README ¶
Dataflow flex templates - Wordcount
📝 Docs: Using Flex Templates
Samples showing how to create and run an Apache Beam template with a custom Docker image on Google Cloud Dataflow.
Before you begin
Follow the
Getting started with Google Cloud Dataflow
page, and make sure you have a Google Cloud project with billing enabled
and a service account JSON key set up in your GOOGLE_APPLICATION_CREDENTIALS
environment variable.
Additionally, for this sample you need the following:
-
Enable the APIs: App Engine, Cloud Build.
-
Create a Cloud Storage bucket.
export BUCKET="your-gcs-bucket" gsutil mb gs://$BUCKET
-
Clone the
golang-samples
repository and navigate to the code sample.git clone https://github.com/GoogleCloudPlatform/golang-samples.git cd golang-samples/dataflow/flex-templates/wordcount
Wordcount sample
This sample shows how to deploy an Apache Beam streaming pipeline that reads text files from Google Cloud Storage, counts the occurences of each word in the text, and outputs the results back to Google Cloud Storage.
Compiling the pipeline code
Go flex templates take compiled binaries when built, meaning that the containers remain small; however, this means that the pipeline code must be compiled for the target environment rather than the machine being used to write the template. For more information, see the Apache Beam SDK documentation on cross-compilation.
We will compile the Go binary to execute on a linux-amd64 architecture used by Dataflow workers.
CGO_ENABLED=0 GOOS=linux GOARCH=amd64 go build -o wordcount .
Building a container image
We will build the Docker image for the Apache Beam pipeline. We are using Cloud Build so we don't need a local installation of Docker.
ℹ️ You can speed up subsequent builds with Kaniko cache in Cloud Build.
# (Optional) Enable to use Kaniko cache by default. gcloud config set builds/use_kaniko True
Cloud Build allows you to
build a Docker image using a Dockerfile
.
and saves it into
Container Registry,
where the image is accessible to other Google Cloud products.
export TEMPLATE_IMAGE="gcr.io/$PROJECT/samples/dataflow/wordcount:latest"
# Build the image into Container Registry, this is roughly equivalent to:
# gcloud auth configure-docker
# docker image build -t $TEMPLATE_IMAGE .
# docker push $TEMPLATE_IMAGE
gcloud builds submit --tag "$TEMPLATE_IMAGE" .
Images starting with gcr.io/PROJECT/
are saved into your project's
Container Registry, where the image is accessible to other Google Cloud products.
Creating a Flex Template
To run a template, you need to create a template spec file containing all the necessary information to run the job, such as the SDK information and metadata.
The metadata.json
file contains additional information for
the template such as the "name", "description", and input "parameters" field.
The template file must be created in a Cloud Storage location, and is used to run a new Dataflow job.
export TEMPLATE_PATH="gs://$BUCKET/samples/dataflow/templates/wordcount.json"
# Build the Flex Template.
gcloud dataflow flex-template build $TEMPLATE_PATH \
--image "$TEMPLATE_IMAGE" \
--sdk-language "GO" \
--metadata-file "metadata.json"
The template is now available through the template file in the Cloud Storage location that you specified.
Running a Dataflow Flex Template pipeline
You can now run the Apache Beam pipeline in Dataflow by referring to the template file and passing the template parameters required by the pipeline. For this pipeline the input is optional and will default to a public storage bucket holding the text of Shakespeare's King Lear.
export REGION="us-central1"
# Run the Flex Template.
gcloud dataflow flex-template run "wordcount-`date +%Y%m%d-%H%M%S`" \
--template-file-gcs-location "$TEMPLATE_PATH" \
--parameters input="projects/$PROJECT/subscriptions/$SUBSCRIPTION" \
--parameters output="gs://$BUCKET/counts.txt" \
--region "$REGION"
Check the results in your GCS bucket by downloading the output:
gcloud alpha storage cp gs://$BUCKET/counts.txt $LOCAL_PATH
Cleaning up
After you've finished this tutorial, you can clean up the resources you created on Google Cloud so you won't be billed for them in the future. The following sections describe how to delete or turn off these resources.
Clean up the Flex template resources
-
Delete the template spec file from Cloud Storage.
gsutil rm $TEMPLATE_PATH
-
Delete the Flex Template container image from Container Registry.
gcloud container images delete $TEMPLATE_IMAGE --force-delete-tags
Clean up Google Cloud project resources
-
Delete the Cloud Storage bucket, this alone does not incur any charges.
⚠️ The following command also deletes all objects in the bucket. These objects cannot be recovered.
gsutil rm -r gs://$BUCKET
Limitations
There are certain limitations that apply to Flex Templates jobs.
📝 Using Flex Templates Google Cloud Dataflow documentation page is the authoritative source for the up-to-date information on that.