Friday, August 29, 2014

Delete duplicate data from redshift table.

Purpose:

I am tired searching a way to eliminate duplicate data in redshift tables.

Here I am with a simple query to avoid duplicate records in redshift.
 

SELECT * FROM (
SELECT *, ROW_NUMBER() OVER ( PARTITION BY column1,column2,column3,... ORDER BY sortcolumn ASC ) rownum 
FROM tablename)
WHERE rownum = 1;

ORDER BY clause places a major role in this query. Usually, there is a single column which is responsible for the duplicate records. Use that column in the ORDER BY clause.

How this query avoids duplicate?

Follow the below approach to avoid to the duplicate in records or delete duplicate records.

Consider the below Table "Employee" which has two duplicate records



EmpId
EmpName
Salary
1
AAAA
20000
2
BBBB
30000
2
BBBB
30000
3
CCCC
25000
4
DDDD
20000
4
DDDD
20000
 

Now create a temporary table of the same schema and insert data into that temp table using the below query


INSERT into Employee_temp 
SELECT EmpId, EmpName, Salary 
From(
     SELECT *, 
                ROW_NUMBER() OVER ( PARTITION BY EmpId,EmpName,Salary ORDER BY EmpId ASC ) rownum 
     FROM
Employee)
WHERE rownum = 1;


Now the temp table will have unique data as below



EmpId
EmpName
Salary
1
AAAA
20000
2
BBBB
30000
3
CCCC
25000
4
DDDD
20000

Now drop the Employee table and Rename the Employee_temp table to Employee.

Hope this should avoid duplicate records or delete duplicate records.


Bulk data transfer from MySQL to RedshiftDB

Introduction:

This post is going to demo how to save a bulk data from one DB into S3 bucket then from S3 to the redshift DB.

We are gonna achieve this using Talend Open Studio for Data Integration.

Getting started:

Purpose:

The purpose of this tutorial is to transfer bulk data from one DB to datawarehouse DB using S3 bucket through Talend.

Step by Step instruction:

 1. Pull data from our DB and saving it in CSV file:

First step is to create a Talend job design in order to pull data from our DB. How to create a basic job in Talend?

Components used:

  • tMySqlInput - Get data from MySQL table and provide the data as input to other component.
  • tFileOutputDelimited - Write the data to the desired file format. In our example CSV file.
  • tS3Put - Put the file to the S3 bucket.
  • tRedshiftRow - Let us write our own set of queries.

 Get the data from our main db (In our example, MySql) using the tMySqlInput component. tMap component is used to map the input and output of data. tFileOutputDelimited component takes input from tMap component. Provide the local path to save the CSV file.



As CSV file is a comma separated value, it is recommended to use the CSV option.

 2. Push the CSV file in S3 bucket:

Provide the access key and secret key to access the S3 bucket in the tS3Put component and provide the saved csv file as the input to the component.



This saves/pushes the csv file into the S3 bucket.

3. Using the copy command:

What is copy command in Amazon Redshift? Amazon Redshift provides a build in copy command to load data from S3 to Redshift DB. In order to perform this action, we are gonna use the tRedshiftRow component, which allows us to write our own set of queries.

copy <table_name> from 's3://<bucket_name>/<object_prefix>'
credentials 'aws_access_key_id=<access-key-id>;
aws_secret_access_key=<secret-access-key>';
 
 



Do not forget to include the removequotes keyword in the copy command, as we are using CSV option in csv file.

The main advantage of this method is that, data lose can be avoided when transferring bulk data from one DB to datawarehouse DB.

To achieve the same method for more than one table (parallel processing), here is another blog for that.

***** My first blog ends here *****
*****Comments and suggestions are appreciated*****