3 Open Source Tools for Your Next Data Science Project in 2020

The Data Science landscape is changing all the time and it can be difficult to keep up.  Luckily, new Data Science tools have been making things easier for data exploration and presentation.  

The 3 tools that we are going to talk about today are:

  1. Pandas Profiling
  2. HV Plots 
  3. RISE 

To demonstrate the usage of these tools, we are going to first pull data from BigQuery and bring it into Jupyter Notebook.

open sourced tools for your next Data Science Project image 1

With the code above, we are importing all the packages we need, and more importantly getting permissions to access GCP with the application credentials. 

open sourced tools for your next Data Science Project image 2


From Google’s
BigQuery public datasets, we decided to look at the political ad spending for Trump in the year 2019 from the google political ads spending dataset.  We thought this would be an interesting dataset due to the upcoming Presidential election. 

For a detailed instruction on how to pull BigQuery data into Jupyter Notebook, check out the blog post Connect your Jupyter Notebook to BigQuery.  But feel free to use any dataset.

Now that we have imported the data into Pandas Dataframe, we can use Pandas Profiling to explore and quickly understand the dataset without having to write much code. 

Pandas Profiling

Pandas Profiling

With just one line of code, we were able to interactively explore the dataset by studying the variables, missing values, correlations, and sample of the dataset.  This normally would take several commands in pandas and matplotlib code to get the same result. You can also save the results as an HTML and view the results offline in your browser.

You can do so by using this line of code:

Line of code

HV Plot

HV Plots

With HV Plots, we easily did boxplot and violin plots and grouped US and EURO spending for “Trump’s Make America Great Again Committee” and “Donald Trump For President”. 

HV Plots give us a more rich, interactive, and appealing visualization with a high level wrapper.  The nice part about it is that the command structure is not that different from how you would typically do it using Pandas.  The toolbar on the right is definitely a good add on to toggle back and forth between advertiser names.

RISE

This was a great tool to convert your Jupyter Notebook into a presentation without having to import your plots and print outs to a slide deck.  Check out this Youtube video about it here.  

Here is a presentation using RISE:

RISE

It is also nice that within the presentation, you can run code and interact with plots.  This definitely creates another added dimension to presenting your data science project rather than using a slide deck.

You can find the entire notebook code here.


SUMMARY


With these new tools rolling out, it makes life easier for data lovers to explore, visualize, and present data.  That way, you can focus on analyzing, interpreting, and building models from the data instead.  

If you have any questions or have examples you’d like to share, we’d love to hear what tools you find useful. Drop us a line at info@SpringML.com or tweet us @springmlinc.

About SpringML

SpringML is a premier Google Cloud partner that offers full service consulting, implementation and managed services. Our specializations include Marketing Analytics, Contact Center AI, Data Warehouse migrations, SAP workflow automation, and intelligent document processing.