ETL Best Practices

ETL is the process of extracting data from source systems and bringing it into a data warehouse like BigQuery. ETL stands for extract, transform, and load. When moving data from on-premise systems to BigQuery or from other cloud-based applications there are some key things to keep in mind.

  1. Keep in mind that you are typically dealing with large amounts of data (millions of rows, gigabytes or terabytes of data).  So where the transformation and execution happens plays a key role. BigQuery API offers client libraries so that scripts can be written in your favorite programming language (Python, Java, and others).  This provides the flexibility to extract data from BigQuery, apply transformations and push to a target (either file or another table) with most of the processing done as a BigQuery job.
  2. If multiple input source systems are involved and there are complex joins or transformations that cannot be handled via a BigQuery client then data will need to be processed outside of the data sources.  In such cases a Batch Dataflow job can be used – this allows flexibility to handle large data loads by leveraging Dataflow’s inbuilt scalability. Depending on the complexity this can be broken down into two steps – the first step brings data from disparate sources and places them in GCS buckets.  The second part can be offloaded to a simple script that batch loads this data into BigQuery.
  3. Consider DataPrep for visualizing and cleansing data.  DataPrep is a tool that is useful for visually exploring data and then applying transformations to correct any issues.  DataPrep internally can spawn Dataflow jobs so is capable of handling large amounts of data. DataPrep allows you to clean the data (e.g. deduplication) so that you have better data to work within your data warehouse.
  4. Split large tables into many smaller partitions, where each partition contains a day’s worth of data. BigQuery automatically loads data into a date-based partition of the table. You can then reduce the number of bytes processed by restricting your queries to specific partitions in the table.
  5. Have a clear plan to validate data being produced by ETL jobs.  Sometimes a separate job that operates on a smaller subset of data that can easily be verified may help.

A common requirement of ETL jobs is to schedule and monitor them.  There are a couple of ways to approach this. One is to schedule using Google App Engine cron as described here and here. The cron job will invoke a Python GAE application URL. On invoking this URL, Python app will start the retention job in BigQuery. Scheduling frequency is configured in cron.yaml.  The second option is to use Airflow.  Airflow offers a way to schedule and monitor several ETL jobs and also can manage dependencies within them.