large_wordcount

command

v2.44.0-RC1 Latest Latest Go to latest Published: Jan 6, 2023 License: Apache-2.0, BSD-3-Clause, MIT Imports: 24 Imported by: 0

Details

Valid go.mod file
Redistributable license
Tagged version
Stable version
Learn more about best practices

Repository

github.com/apache/beam

Documentation ¶

Overview ¶

large_wordcount is an example that demonstrates a more complex version of a wordcount pipeline. It uses a SplittableDoFn for reading the text files, then uses a map side input to build sorted shards.

This example, large_wordcount, is the fourth in a series of five successively more detailed 'word count' examples. You may first want to take a look at minimal_wordcount and wordcount. Then look at debugging_worcount for some testing and validation concepts. After you've looked at this example, follow up with the windowed_wordcount pipeline, for introduction of additional concepts.

Basic concepts, also in the minimal_wordcount and wordcount examples: Reading text files; counting a PCollection; executing a Pipeline both locally and using a selected runner; defining DoFns.

New Concepts:

Using a SplittableDoFn transform to read the IOs.
Using a Map Side Input to access values for specific keys.
Testing your Pipeline via passert and metrics, using Go testing tools.

This example will not be enumerating concepts, but will document them as they appear. There may be repetition from previous examples.

To change the runner, specify:

--runner=YOUR_SELECTED_RUNNER

The input file defaults to a public data set containing the text of King Lear, by William Shakespeare. You can override it and choose your own input with --input.

Source Files ¶

View all Source files

large_wordcount.go

?	: This menu
/	: Search site
f or F	: Jump to
y or Y	: Canonical URL