Cloud Dataflow Developer Experience

We got the opportunity to collaborate with Google Cloud Platform this year soon after the relationship between Salesforce and Google Cloud Platform was announced.  We worked with both Google Cloud Platform and Salesforce product managers to outline a use case that involved Salesforce CRM, Wave and Google Cloud Dataflow.  Here’s a more detailed description of the use case.

In this post we would like to talk about our experience developing source and sink for Salesforce CRM and Wave platforms using Cloud Dataflow.  This can get a bit geeky so apologies in advance to any non-tech readers.

Google Cloud Dataflow is a simple, flexible, and powerful system you can use to perform data   processing tasks of any size.  It consists of two components – a set of SDKs to define processing jobs and a managed cloud service.  We first got started by reading the documentation and trying the word count example.  This helped us get an understanding of the framework and got us familiar with the Cloud Dataflow nomenclature including PTransforms, DoFn, Pipeline, User defined source and sink, etc.  All these concepts are very well documented.  In addition the Stackoverflow forum is very active with Google developers themselves answering a lot of questions.

Here are some things we learned when developing our initial solution.

  1. All Source and Sink classes must be immutable – there should be no setters, and all fields must be final, and all fields of mutable type (e.g. collections) must be effectively immutable. We changed “setter-like” methods to return a new object with all fields copied except the field being updated.
  2. When we were initially getting started, we were curious about the number of workers used on any job, but couldn’t find an easy way to find that out.  On the Developers Console, navigating to Compute > Compute Engine > VM instances will show CPU utilization, N/W utilization and disk utilization but not the total number of VM’s used.
  3. Cloud Dataflow has a great console that allows developers and administrators to monitor job executions.  Here’s a screenshot of the execution flow of a particular job.  Clicking on each of the boxes takes you to a more detailed view.  This is a great way to get the status of the overall job as well as each individual step.dataflow_consile
  4. When implementing the user defined source for salesforce, we ran into a couple of issues around how to make the Salesforce read execution in parallel.  This Salesforce read was essentially a SQL query and parallelizing reading from an SQL query is in general a non-trivial task.  So we discussed the below approach to read query results sequentially but parallelize the subsequent processing.  Each Reader will process a bundle – no additional queries in reader’s start/advance methods since data is passed to the reader.  Another approach was to just write a PTransform and not a user-defined-format source.  For cases where the source can’t parallelize reading from itself, it seems that there is no benefit to using the Source API over just using a PTransform.
  5. We faced similar issues when writing the user defined sink to write data to Wave.  Wave API has limitations around the amount of data that can be written in a single call (10MB) and the total number of writes that can happen in parallel.  So we needed a way to control total number of writes by splitting dataset into multiple parts to avoid api errors.  In addition, each part had to be numbered sequentially from 1 to N.  The transform job computed the number of parts based on dataset size e.g. if 200MB, then set # of parts to say 30 to ensure each part is less than 10 MB, then the GroupBy takes the number of parts from previous step as sideInput and groups dataset into 30 (assign key by doing hash modulo to ensure even buckets) KV pairs.
  6. One thing that’s available on other platforms (e.g. Spark) is the ability to write transformation using SQL like languages like HiveQL.  That makes the amount of code to be written to be much smaller and could probably be a faster way to implement given the relative ease of writing Hive QL queries.
  7. Given the complex nature of pipelines that deal with large data, testing becomes an important part of the overall code.  Google Cloud Dataflow has very good support for writing unit tests and they also have some good documentation in this regard.