Tuesday, September 23, 2014

Difference between MAX(NUMERIC) function and MAX(VARCHAR) function in redshift

Introduction:

Here a post with quite interesting topic. 
Many might not know about the hidden functionality of MAX() function in redshift query. Not only redshift, all other queries that support postgresql as one of basic query language.

Difference between MAX(NUMERIC) function and MAX(VARCHAR) function:

MAX(NUMERIC):

Many might know MAX(NUMERIC) function.


   SELECT MAX(salary) as SALARY
      FROM employee
         WHERE  experience > 3;

Lets consider the following table employee

ID NAME DESIGNATION EXPERIENCE SALARY
1 Employee1 Software Engineer 2 30000
2 Employee2 Software Engineer 1 25000
3 Employee3 Senior Software Engineer 4 50000
4 Employee4 Senior Software Engineer 5 55000
5 Employee5 Software Engineer 1 20000
6 Employee6 Team Lead 8 75000
7 Employee7 Trainee Software Engineer 0 15000
8 Employee8 Software Engineer 5 55000

The result of the MAX() query against the table will be

SALARY
75000

Hence it is clear that the query returned the maximum or largest salary of the employee whose experience is greater than 3.

MAX(VARCHAR):

Let we now dive into the different thing called MAX(VARCHAR) function.

What will happen when we apply max function to a column with varchar type? Lets see


   SELECT MAX(designation) as DESIGNATION
      FROM employee;

The above query will result

DESIGNATION
Software Engineer

Shocking!!!!

MAX(VARCHAR) will return the maximum repeated or frequently used value.

In our example, Software Engineer value is repeated four times.

Keep searching for a better post.

Wednesday, September 10, 2014

Fetch the 'nth' element or row from redshift table

Purpose:

A simple query to fetch the nth row in a database table.



        SELECT * 
            FROM tablename 
               LIMIT 1
                 OFFSET n-1;



OFFSET plays a vital role in this query. This query not only works in redshift. All DBs which support postgres queries support OFFSET.

If you need the 21st row data, then the OFFSET will have value as 20. Since OFFSET will omit or skip the first 20 rows and start fetching with the 21st row.

Friday, August 29, 2014

Delete duplicate data from redshift table.

Purpose:

I am tired searching a way to eliminate duplicate data in redshift tables.

Here I am with a simple query to avoid duplicate records in redshift.
 

SELECT * FROM (
SELECT *, ROW_NUMBER() OVER ( PARTITION BY column1,column2,column3,... ORDER BY sortcolumn ASC ) rownum 
FROM tablename)
WHERE rownum = 1;

ORDER BY clause places a major role in this query. Usually, there is a single column which is responsible for the duplicate records. Use that column in the ORDER BY clause.

How this query avoids duplicate?

Follow the below approach to avoid to the duplicate in records or delete duplicate records.

Consider the below Table "Employee" which has two duplicate records



EmpId
EmpName
Salary
1
AAAA
20000
2
BBBB
30000
2
BBBB
30000
3
CCCC
25000
4
DDDD
20000
4
DDDD
20000
 

Now create a temporary table of the same schema and insert data into that temp table using the below query


INSERT into Employee_temp 
SELECT EmpId, EmpName, Salary 
From(
     SELECT *, 
                ROW_NUMBER() OVER ( PARTITION BY EmpId,EmpName,Salary ORDER BY EmpId ASC ) rownum 
     FROM
Employee)
WHERE rownum = 1;


Now the temp table will have unique data as below



EmpId
EmpName
Salary
1
AAAA
20000
2
BBBB
30000
3
CCCC
25000
4
DDDD
20000

Now drop the Employee table and Rename the Employee_temp table to Employee.

Hope this should avoid duplicate records or delete duplicate records.


Bulk data transfer from MySQL to RedshiftDB

Introduction:

This post is going to demo how to save a bulk data from one DB into S3 bucket then from S3 to the redshift DB.

We are gonna achieve this using Talend Open Studio for Data Integration.

Getting started:

Purpose:

The purpose of this tutorial is to transfer bulk data from one DB to datawarehouse DB using S3 bucket through Talend.

Step by Step instruction:

 1. Pull data from our DB and saving it in CSV file:

First step is to create a Talend job design in order to pull data from our DB. How to create a basic job in Talend?

Components used:

  • tMySqlInput - Get data from MySQL table and provide the data as input to other component.
  • tFileOutputDelimited - Write the data to the desired file format. In our example CSV file.
  • tS3Put - Put the file to the S3 bucket.
  • tRedshiftRow - Let us write our own set of queries.

 Get the data from our main db (In our example, MySql) using the tMySqlInput component. tMap component is used to map the input and output of data. tFileOutputDelimited component takes input from tMap component. Provide the local path to save the CSV file.



As CSV file is a comma separated value, it is recommended to use the CSV option.

 2. Push the CSV file in S3 bucket:

Provide the access key and secret key to access the S3 bucket in the tS3Put component and provide the saved csv file as the input to the component.



This saves/pushes the csv file into the S3 bucket.

3. Using the copy command:

What is copy command in Amazon Redshift? Amazon Redshift provides a build in copy command to load data from S3 to Redshift DB. In order to perform this action, we are gonna use the tRedshiftRow component, which allows us to write our own set of queries.

copy <table_name> from 's3://<bucket_name>/<object_prefix>'
credentials 'aws_access_key_id=<access-key-id>;
aws_secret_access_key=<secret-access-key>';
 
 



Do not forget to include the removequotes keyword in the copy command, as we are using CSV option in csv file.

The main advantage of this method is that, data lose can be avoided when transferring bulk data from one DB to datawarehouse DB.

To achieve the same method for more than one table (parallel processing), here is another blog for that.

***** My first blog ends here *****
*****Comments and suggestions are appreciated*****