Getting Started with Google Cloud Dataflow

We recently started a prototype to implement an end to end analytics solution using Google Cloud Dataflow.  Google has done an excellent job documenting the various steps to get started as well as provided several examples to implement analytics solutions in Dataflow.  Here’s a quick summary of what I had to do to get started.

Download Google Cloud SDK from https://cloud.google.com/sdk/. You need this primarily to setup a Google Cloud Storage bucket.  The SDK installs a binary called gcloud which allows you to complete the creation of the storage bucket.

1. I was running this in a Windows environment and had to install Maven from http://maven.apache.org/download.cgi

2. The “mvn clean install” command can then be run to build the binaries for the SDK. However I ran into checkstyle errors because of line breaks being different in unix vs windows.  To address this error I added the below line into the checkstyle.xml file:

<module name=”NewlineAtEndOfFile”>
<property name=”lineSeparator” value=”lf” />
</module>

3. Here’s the command I ran to execute the Wordcount sample.

..\apache-maven-3.3.3\bin\mvn exec:java -pl examples -Dexec.mainClass=com.google.cloud.dataflow.examples.WordCount -Dexec.args=”–project=ace-scarab-94723 –stagingLocation=gs://springml-bucket1/staging –runner=BlockingDataflowPipelineRunner –output=gs://springml-bucket1/output.txt”