large_wordcount

command
v2.44.0-RC1 Latest Latest
Warning

This package is not in the latest version of its module.

Go to latest
Published: Jan 6, 2023 License: Apache-2.0, BSD-3-Clause, MIT Imports: 24 Imported by: 0

Documentation

Overview

large_wordcount is an example that demonstrates a more complex version of a wordcount pipeline. It uses a SplittableDoFn for reading the text files, then uses a map side input to build sorted shards.

This example, large_wordcount, is the fourth in a series of five successively more detailed 'word count' examples. You may first want to take a look at minimal_wordcount and wordcount. Then look at debugging_worcount for some testing and validation concepts. After you've looked at this example, follow up with the windowed_wordcount pipeline, for introduction of additional concepts.

Basic concepts, also in the minimal_wordcount and wordcount examples: Reading text files; counting a PCollection; executing a Pipeline both locally and using a selected runner; defining DoFns.

New Concepts:

  1. Using a SplittableDoFn transform to read the IOs.
  2. Using a Map Side Input to access values for specific keys.
  3. Testing your Pipeline via passert and metrics, using Go testing tools.

This example will not be enumerating concepts, but will document them as they appear. There may be repetition from previous examples.

To change the runner, specify:

--runner=YOUR_SELECTED_RUNNER

The input file defaults to a public data set containing the text of King Lear, by William Shakespeare. You can override it and choose your own input with --input.

Jump to

Keyboard shortcuts

? : This menu
/ : Search site
f or F : Jump to
y or Y : Canonical URL