Documentation ¶
Overview ¶
large_wordcount is an example that demonstrates a more complex version of a wordcount pipeline. It uses a SplittableDoFn for reading the text files, then uses a map side input to build sorted shards.
This example, large_wordcount, is the fourth in a series of five successively more detailed 'word count' examples. You may first want to take a look at minimal_wordcount and wordcount. Then look at debugging_worcount for some testing and validation concepts. After you've looked at this example, follow up with the windowed_wordcount pipeline, for introduction of additional concepts.
Basic concepts, also in the minimal_wordcount and wordcount examples: Reading text files; counting a PCollection; executing a Pipeline both locally and using a selected runner; defining DoFns.
New Concepts:
- Using a SplittableDoFn transform to read the IOs.
- Using a Map Side Input to access values for specific keys.
- Testing your Pipeline via passert and metrics, using Go testing tools.
This example will not be enumerating concepts, but will document them as they appear. There may be repetition from previous examples.
To change the runner, specify:
--runner=YOUR_SELECTED_RUNNER
The input file defaults to a public data set containing the text of King Lear, by William Shakespeare. You can override it and choose your own input with --input.