Showing posts with label duplicate records in redshift. Show all posts
Showing posts with label duplicate records in redshift. Show all posts

Wednesday, May 18, 2016

Another simple way to remove duplicate data

Purpose:

In my previous blog, I have explained a way to avoid duplicate records. Here I am with a simple query to achieve the same.

SELECT DISTINCT column1, column2, column3, .......

FROM tablename;

                                                              OR

SELECT column1, column2, column3,....
FROM(
     SELECT COUNT(1) count_value, column1, column2, column3, .......
     FROM tablename
     GROUP BY column1, column2, column3, .......);


                                                              OR

SELECT column1, column2, column3, .......

FROM tablename
GROUP BY column1, column2, column3, .......;


If you notice the sub-query contains COUNT(1), which is not in the main query.

How this query avoids duplicate?

Follow the below approach to avoid to the duplicate in records or delete duplicate records.

Consider the below Table "Employee" which has two duplicate records


EmpId
EmpName
Salary
1
AAAA
20000
2
BBBB
30000
2
BBBB
30000
3
CCCC
25000
4
DDDD
20000
4
DDDD
20000

SELECT COUNT(1) count_value, EmpId, EmpName, Salary
FROM Employee
GROUP BY EmpId, EmpName, Salary;

The above query will return the below result


count_value
EmpId
EmpName
Salary
1
1
AAAA
20000
2
2
BBBB
30000
1
3
CCCC
25000
2
4
DDDD
20000

It is clear from the above query that EmpId 2 and EmpId 4 has duplicate records.

So with the above query, we can able to group the duplicate records.

Now create a temporary table of the same schema and insert data into that temp table using the below query


INSERT into Employee_temp 
SELECT EmpId, EmpName, Salary
FROM(
     SELECT COUNT(1) count_value, EmpId, EmpName, Salary
     FROM Employee
     GROUP BY EmpId, EmpName, Salary);


                                                                     OR
INSERT into Employee_temp 
SELECT EmpId, EmpName, Salary
FROM Employee
GROUP BY EmpId, EmpName, Salary;

                                                                     OR
INSERT into Employee_temp 
SELECT DISTINCT EmpId, EmpName, Salary
FROM Employee;



Now the temp table will have unique data as below



EmpId
EmpName
Salary
1
AAAA
20000
2
BBBB
30000
3
CCCC
25000
4
DDDD
20000

Now drop the Employee table and Rename the Employee_temp table to Employee.

Note: This query will not work for the below records, if you need all the columns.


EmpId
EmpName
Salary
Department
1
AAAA
20000
Dept1
2
BBBB
30000
Dept1
2
BBBB
30000
Dept2
3
CCCC
25000
Dept2
4
DDDD
20000
Dept2
4
DDDD
20000
Dept3

If you note this, the department for EmpId 2 and EmpId 4 are different in both the records.

If you do not need the department column, then this query will work.

Hope this should avoid duplicate records or delete duplicate records.

Friday, August 29, 2014

Delete duplicate data from redshift table.

Purpose:

I am tired searching a way to eliminate duplicate data in redshift tables.

Here I am with a simple query to avoid duplicate records in redshift.
 

SELECT * FROM (
SELECT *, ROW_NUMBER() OVER ( PARTITION BY column1,column2,column3,... ORDER BY sortcolumn ASC ) rownum 
FROM tablename)
WHERE rownum = 1;

ORDER BY clause places a major role in this query. Usually, there is a single column which is responsible for the duplicate records. Use that column in the ORDER BY clause.

How this query avoids duplicate?

Follow the below approach to avoid to the duplicate in records or delete duplicate records.

Consider the below Table "Employee" which has two duplicate records



EmpId
EmpName
Salary
1
AAAA
20000
2
BBBB
30000
2
BBBB
30000
3
CCCC
25000
4
DDDD
20000
4
DDDD
20000
 

Now create a temporary table of the same schema and insert data into that temp table using the below query


INSERT into Employee_temp 
SELECT EmpId, EmpName, Salary 
From(
     SELECT *, 
                ROW_NUMBER() OVER ( PARTITION BY EmpId,EmpName,Salary ORDER BY EmpId ASC ) rownum 
     FROM
Employee)
WHERE rownum = 1;


Now the temp table will have unique data as below



EmpId
EmpName
Salary
1
AAAA
20000
2
BBBB
30000
3
CCCC
25000
4
DDDD
20000

Now drop the Employee table and Rename the Employee_temp table to Employee.

Hope this should avoid duplicate records or delete duplicate records.