Friday, August 29, 2014

Bulk data transfer from MySQL to RedshiftDB

Introduction:

This post is going to demo how to save a bulk data from one DB into S3 bucket then from S3 to the redshift DB.

We are gonna achieve this using Talend Open Studio for Data Integration.

Getting started:

Purpose:

The purpose of this tutorial is to transfer bulk data from one DB to datawarehouse DB using S3 bucket through Talend.

Step by Step instruction:

 1. Pull data from our DB and saving it in CSV file:

First step is to create a Talend job design in order to pull data from our DB. How to create a basic job in Talend?

Components used:

  • tMySqlInput - Get data from MySQL table and provide the data as input to other component.
  • tFileOutputDelimited - Write the data to the desired file format. In our example CSV file.
  • tS3Put - Put the file to the S3 bucket.
  • tRedshiftRow - Let us write our own set of queries.

 Get the data from our main db (In our example, MySql) using the tMySqlInput component. tMap component is used to map the input and output of data. tFileOutputDelimited component takes input from tMap component. Provide the local path to save the CSV file.



As CSV file is a comma separated value, it is recommended to use the CSV option.

 2. Push the CSV file in S3 bucket:

Provide the access key and secret key to access the S3 bucket in the tS3Put component and provide the saved csv file as the input to the component.



This saves/pushes the csv file into the S3 bucket.

3. Using the copy command:

What is copy command in Amazon Redshift? Amazon Redshift provides a build in copy command to load data from S3 to Redshift DB. In order to perform this action, we are gonna use the tRedshiftRow component, which allows us to write our own set of queries.

copy <table_name> from 's3://<bucket_name>/<object_prefix>'
credentials 'aws_access_key_id=<access-key-id>;
aws_secret_access_key=<secret-access-key>';
 
 



Do not forget to include the removequotes keyword in the copy command, as we are using CSV option in csv file.

The main advantage of this method is that, data lose can be avoided when transferring bulk data from one DB to datawarehouse DB.

To achieve the same method for more than one table (parallel processing), here is another blog for that.

***** My first blog ends here *****
*****Comments and suggestions are appreciated*****

No comments:

Post a Comment