Integration with Amazon Redshift Data Warehouse

Since its launch in 2013, Redshift has become one of the most popular data warehouse products on the cloud computing platform. Using parallel processing, Amazon Redshift is capable of analyzing large scale datasets (upto petabytes) using your existing business intelligence tools.

One of the frequent questions asked by our clients is how can we use Redshift data for Machine learning and also for integration to Salesforce Analytics. In this blog we will detail the different options available to integrate with Redshift.

ETL Tool:

Traditional and the most proven way to integrate to any Data Warehouse is to use a ETL Tool.

The ETL approach is recommended for

1) Translation rules

2) Frequent data integration

Almost all of the ETL platforms in use today support the Redshift connector using JDBC. It is very easy to setup and we can have the integration running in short order.

The tool we are considering in this blog is Talend.

Talend is an open-source integration software for the Enterprise. The following is a quick overview of Redshift integration using Talend.

Setup Connection:

Setup Wave connection:

Test and run the data integration:

Apache Spark:

Next up is the Apache Spark Connector. I’m sure any data fanatic by now must have heard or even better used Spark. IBM BlueMix, Microsoft Azure, Amazon and Google provide managed spark services. Major Hadoop distributors too are planning to replace MapReduce with Spark.

It didn’t take us long to fall in love with Spark. The architecture, ease of use, API and adoption makes it the defacto standard for data processing for next few years.

The first step is to install the Redshift connector.

Once the connector is installed, we can quickly explore redshift data in Databricks cloud or Zeppelin.

 

 

Now, the JDBC connection is easier to setup but single Master node for JDBC connectivity can be a bottleneck when executing queries that return large datasets.

So for large datasets, another option that we can consider is exporting to S3.

As mentioned earlier, one of the major advantages of using Redshift is its parallel processing capability. Exporting to S3 makes it easier to take advantage of parallel processing.

We can execute the following commands against your Redshift cluster by other means.

  1. Load (COPY) the comma-separate data from S3 into Redshift
  2. UPDATE the data to contain a pipe character that needs escaping
  3. Dump (UNLOAD … ESCAPE) the pipe-delimited data back into S3 – preserving the pipe character that needed escaping.

Here’s an example:

COPY babynames FROM ‘s3://$s3_bucket/$load_from_dir’

CREDENTIALS ‘aws_access_key_id=$aws_key_id;aws_secret_access_key=$aws_key_secret’

DELIMITER ‘,’;””

UNLOAD (‘select * from babynames’)

TO ‘s3://$s3_bucket/$unload_to_dir’

CREDENTIALS ‘aws_access_key_id=$aws_key_id;aws_secret_access_key=$aws_key_secret’

ESCAPE ALLOWOVERWRITE;”””)

Now, we can write data into wave.

The result of the data frame can be easily sent to wave using SpringML’s wave connector.

df2.

write.

format(“com.springml.spark.salesforce”).

option(“username”, “*******@springml.com”).

option(“password”, “***************”).

option(“datasetName”, “redshiftdata”).

save()

For a detailed demo, please contact info@springml.com