SparkR and machine learning

Iteration and convergence is a key requirement for machine learning and Spark does this well and fast because it can load data in memory and do in memory computation.  In addition its support for languages like Python and R helps data scientists who are at home with these languages.

SparkR is an R package that provides a light-weight frontend to use Apache Spark from R. SparkR provides a distributed distributed dataframe implementation that supports several operations like selection, filtering, aggregation, etc.  These operations are similar in syntax to the dplyr package that can be applied to R dataframes.

There are several reasons why R is popular with data scientists.  It provides tools like dataframes to manipulate data, provides support for data visualization and several packages that support statistics and machine learning functions.  However R does have limitations.  It’s restricted to single thread and to one machine, memory.

Here are some of the predominant patterns in big data and R.

  • Case where big data is distilled to small problem. Large amount of data present in HDFS or hive tables.  This data is filtered and aggregated to create a new data set is much smaller.  This data can be fit on a single machine.  This is a very common use case and machine learning algorithms can be applied using traditional R programs
  • If data fits on a single machine, but if you have the need to run multiple different algorithms on the data set and then combine the results e.g. boosting. In this case you may want to run all the algorithms in parallel.  SparkR architecture supports such parallelization.
  • If you truly have large scale data on which machine learning algorithms need to be applied, then spark provides the option of using MLlib libgrary.

SparkR can handle all use cases well.