Using Apache Spark for big data processing

Introduction to Spark

Over the last few years, Hadoop has emerged as the standard platform for big data processing.  At a high level, Hadoop consists of distributed storage (HDFS) and distributed computing (MapReduce).  However MapReduce has a couple of limitations.  Programming in MapReduce is difficult and needs chaining multiple MapReduce jobs in a sequence for complex analytical operations including machine learning which involves iterative execution.  In addition the output of each MapReduce operation needs to be serialized to disk which incurs the high cost of IO and slows down performance.  There are a few different projects underway to address these challenges.  Apache Spark is one such project that has risen to prominence in the recent years.

Apache Spark is an open-source cluster computing framework originally developed in the AMPLab at UC Berkeley.  Unlike Hadoop‘s two-stage MapReduce operation described above, Spark provides an efficient in-memory architecture that can provide performance gains of up to 100 times for certain applications.  By allowing user programs to load data into a cluster’s memory and query it repeatedly, Spark is well-suited to machine learning algorithms.

Spark’s primary abstraction is a distributed collection of items called a Resilient Distributed Dataset (RDD).  RDD is a collection of data with the following properties:

  • Immutable – once an object is created, it cannot be changed.
  • Lazily evaluated – transformations not computed until needed.
  • Cacheable – immutable data allows caching for a long time.
  • Type inferred

RDDs support two types of operations: transformations, which create a new dataset from an existing one, and actions, which return a value to the driver program after running a computation on the dataset. For example, map is a transformation that passes each dataset element through a function and returns a new RDD representing the results. On the other hand, reduce is an action that aggregates all the elements of the RDD using some function and returns the final result to the driver program.

ETL in Spark.  Source data usually has to be transformed, cleaned, joined, etc before it can be ready for querying.  In spark, ETL type jobs can be written using plain RDD programming in Scala, Java or Python.  We at springML have written a Spark package that connects to Salesforce Wave to push data.  This package is available here.  Code is published on GitHub here.

Hadoop vs Spark.  Hadoop is distributed storage and distributed compute (map reduce).  Spark is compute only and needs to be combined with distributed storage.  Spark is storage agnostic and can be used with S3, Cassandra or HDFS.  For Spark data can either be on disk or in memory.  This makes Spark great at iterative work for machine learning e.g. iterate 20 times before converging.  Why?  Hadoop fetches data from HDFS and goes through map and reduce cycle.  Before next map reduce, data needs to be written back to HDFS – so a lot of IO happening that can consume time.  Spark data can be cached or kept in memory between iterations.

Spark compliments Hadoop well.  For small to medium data (up to 100’s of gigs), spark is good.  Truly large data (terrabytes and petabytes), Hadoop is probably a better choice.

  • Coding in Spark is easier than Hadoop. Shorter code
  • Friendlier for data scientists/analysts. Interactive shell and notebooks allow for fast development cycles and da hoc exploration.
  • Language support – Java, Scala, Python, R
  • Great for small (gigs) and medium (100’s of gigs)

Streaming support is provided by the Apache Storm project in the Hadoop ecosystem.  In Spark, support for streaming is more native.  Storm is low level event based however while Spark is micro-batch and sliding windows based.  Both support Streaming use cases – exactly once deliver and exactly once compute.

SQL Processing.  If you have lots of hive queries in your Hadoop environment, you can move them to Spark pretty easily.  Spark supports HiveQL and can read hive tables natively.

Machine learning.  This is where spark excels.  Iteration and convergence is a key requirement for machine learning and Spark does this well and fast because it can load data in memory and do in memory computation.  In addition its support for languages like Python and R helps data scientists who are at home with these languages.