Apache Spark and Databricks

Databricks is a company founded by the creators of Apache Spark.  As you can tell from our other blogs, we believe Apache Spark is a revolutionary big data technology and springML provides expert consulting services in Spark implementations.  We have been working on the Databricks platform and love the great work they’ve done to simplify Spark implementations.  Here are a few things we’ve played with.

  • They make the creation of the Spark cluster simple. You simply sign up for one of their plans and link it to your AWS account.  Within a few minutes you have a Spark environment up and running.
  • They provide support for creating notebooks. Notebooks have been the defacto mode in which data scientists work to do ad hoc analysis.  Databricks supports creation of notebooks in all the languages that Spark supports i.e., Scala, Java, Python and R.
  • Databricks also allows scheduling these notebooks so that they can run in a production environment without need for any manual intervention.
  • Notebooks also integrate to GitHub so code written in GitHub can be pulled into a Notebook seamlessly.
  • Collaboration is possible not just via GitHub but also natively by sharing Notebooks with other users.