Open Source Projects that Make Predictive Analytics Accessible

r

A transformation began in the tech industry when open source initiatives were first introduced in the late 1990s. Open source initiatives made programming languages and databases accessible to researchers, students and the broader community, who could now use open source technologies in their own solutions or contribute with their own code. Today, there are hundreds of open source projects with thousands of contributors. Many of these initiatives, such as Google Tensor Flow, Caffe and h20.ai, are focused on machine learning and predictive analytics.

At SpringML our goal is to build predictive analytics solutions for the ‘doers’ in companies who need data to make better business decisions. When we began building our product, which should be released any day now, we wanted our development team to focus on making something that was unique and incredibly useful for the customers we were targeting – the executives, managers, individuals in sales, customer service, marketing, finance, etc. We needed a language to write algorithms and base predictive models on, as well as a platform that supports large data processing. However, we didn’t want to recreate the wheel so looking for open source initiatives in these areas was a logical choice for us.

After an analysis of the various open source technologies we selected two – R and Apache Spark – as key elements in our predictive solutions.

R Language

R is a software environment for statistical computing and graphics, and we selected it because of the maturity of the language and its support for various predictive models. It is the language upon which we can build the specific use cases needed for particular organizations or personas. Many people in the tech community have a strong skill set in R, which also gives us access to a large talent pool from which we can hire our data scientists to write the predictive models that get built into our application.

Apache Spark

Apache Spark is the platform, or engine, at the foundation of our app. It has a massive developer community and since 2009 more than 1,000 developers have contributed to the code. It is one of the most active initiatives in the big data space with the ability to handle large and various types of data and process millions of records. We see it as the perfect fit for what we’re building. The ability to run multiple algorithms and interpret the data ensures that our customers will benefit from near real-time insight. Plus, Apache Spark processes all data in memory, meaning everything happens in run time. Nothing is copied or stored outside of the customer site, ensuring data privacy. It has strong support for R so our data scientists who know the language well can easily use it.  If you want more details about SpringML’s support of Apache Spark, visit our technical page at https://springml.com/apache-spark/.

Data preparation – It’s all about data

Algorithms and machine learning alone do not address predictive analytics. The key piece in the overall solution is still data. Spending time getting the right level of data and performing data munging/wrangling activities is still an important prerequisite before executing predictive models. Girish Reddy, our CTO always says “Garbage In Garbage out” and any predictive models does not perform better if the data quality is poor.
Some of open source packages for data prep are dplyr (R Library), Pandas (Python package) etc. We use dplyr (R library) in our framework to prepare data for machine learning and advanced analytics.

Using open source technologies such as R and Apache Spark has enabled us to build a solid working model of our predictive analytics application in six months. Our solution, and in turn our customers, benefit from the speed, ease of use and sophisticated analytics of these technologies. Customers have security and privacy protection which is critical in today’s business environment. We’re backed by the power and intelligence of the collective community that contributes to building on these solutions and keeping them current.