BigQuery Tips & Techniques from the Trenches

In a previous post, we covered BQML which showcased built-in machine learning in BigQuery and in this post, we would like to showcase some BigQuery techniques we learned from one of our most recent projects. The client is an emerging company in the retail space with over 2 years worth of customer data from users interacting with their browser extension for coupons.

The goal of the project was to:

  • Apply the Customer Lifetime Value(CLV) model to find out which customers are projected to spend more in a dollar amount based on how the customer interacted with the platform.
  • Create customer segmentation, where we isolate customers that were more engaged at different stages of the sales process. Our client wanted to identify and work with those more engaged users for targeted ads, sales and weed out the looky-loos.

Here are a few things we would like to share with you for your next BigQuery Project.

K- Means in BQML

We experimented with K-means in BigQuery and were able to segment the data into 3 clusters.

1 Cluster #1 10.5 6169.7 6091.4
2 Cluster #2 8.9 141.0 136.4
3 Cluster #3 48.2 142.5 137.8

Although we chose to a different approach, in the end, it was still worth mentioning in case you need it for your next project.

Handling Partitioned Tables

We worked with partitioned tables, where much larger tables were divided into manageable parts. To access the data properly, a timestamp statement using WHERE was used to properly query the data:

bigquery tips

Rand() Function

When building a predictive model, we used the Rand() function so the customer base was randomly selected to represent the population of the data. Using all of the customer data was not feasible, so this process helps in building an accurate model without using the entire dataset.

Feature Engineering with BigQuery Queries

We used BigQuery to do feature engineering. Here are some of the ways we did it:

  • We extracted information from timestamps with the extract function to pull out things like hour of the day,day of the week, and month. This helped us when we broke down spending over several months to determine CLV of a user. It also helped when we compared customer usage in the am vs pm, as well as weekday vs weekend.
  • We used CASE to binarize data for modeling. We added an aggregation for the GROUP BY statement shown here:bigquery tips
    We also pulled out useful sources of leads using LIKE:
    bigquery tipsThe LIKE function helps to find the lead source in the hyperlink of the campaign.

  • Some of the fields are nested data, which required us to use the UNNEST function to break apart the dataset. This is an example from the query-syntax guide.
    bigquery tips
  • The IFNULL statement filled in missing gaps in the data.  
  • We used the analytics functions in BigQuery such as AVG(), SUM(),MIN(), and MAX() to do some simple statistics for us.
  • We applied the LIMIT function to cut off the rows of data once we were able to get a sizable sample to build a model.  

Connecting to Datalab

We used Datalab, a Jupyter Notebook-like environment which connects directly to a compute engine in GCP, without the hassle of authenticating the environment. We can run queries in the notebook,  transform the data into a Pandas DataFrame, and then use scikit-learn packages to run machine learning.

  • To start, we import the packages that we need:
    bigquery tips
  • Next, we write a query as a string datatype as example:
    Query
  • After the query is written, it is processed as a query job:
    bigquery tips
  • Lastly, the query job is converted into a Pandas DataFrame:
    bigquery tips

GCloud SDK

With GCloud SDK, it was easy to access the turn on and off the Virtual Machine and access Datalab on my local machine with datalab connect datalab_vm_name –no-user-checking.

GCloud SDK

Summary

During the project, were able to:

  • leverage the power of BigQuery to do feature engineering and statistics for predictive modeling
  • run BigQuery in DataLab to use scikit-learn tools for machine learning
  • access Datalab locally with GCloud SDK.

Want More Information?

If you are looking for additional resources, check out a few we found to be useful:

If you are interested in how to use BigQuery to build a Data Lake download the SpringmL whitepaper, Building a Data Lake on BigQuery. This whitepaper provides guidelines and best practices for customers looking to leverage BigQuery to build a Data Lake.

If you still have questions drop us a line at info@SpringML.com or tweet us @springmlinc.