Viki's Blog: duplicate records in redshift

Purpose:

In my previous blog, I have explained a way to avoid duplicate records. Here I am with a simple query to achieve the same.

SELECT DISTINCT column1, column2, column3, .......

FROM tablename;

SELECT column1, column2, column3,....
FROM(
SELECT COUNT(1) count_value, column1, column2, column3, .......

FROM tablename
GROUP BY column1, column2, column3, .......);

OR

SELECT column1, column2, column3, .......

FROM tablename
GROUP BY column1, column2, column3, .......;

If you notice the sub-query contains COUNT(1), which is not in the main query.

How this query avoids duplicate?

Follow the below approach to avoid to the duplicate in records or delete duplicate records.

Consider the below Table "Employee" which has two duplicate records

EmpId	EmpName	Salary
1	AAAA	20000
2	BBBB	30000
2	BBBB	30000
3	CCCC	25000
4	DDDD	20000
4	DDDD	20000

SELECT COUNT(1) count_value, EmpId, EmpName, Salary

FROM Employee
GROUP BY EmpId, EmpName, Salary;

The above query will return the below result

count_value	EmpId	EmpName	Salary
1	1	AAAA	20000
2	2	BBBB	30000
1	3	CCCC	25000
2	4	DDDD	20000

It is clear from the above query that EmpId 2 and EmpId 4 has duplicate records.

So with the above query, we can able to group the duplicate records.

Now create a temporary table of the same schema and insert data into that temp table using the below query

INSERT into Employee_temp
SELECT EmpId, EmpName, Salary
FROM(
SELECT COUNT(1) count_value, EmpId, EmpName, Salary

FROM Employee
GROUP BY EmpId, EmpName, Salary);

OR
INSERT into Employee_temp
SELECT EmpId, EmpName, Salary

FROM Employee

GROUP BY EmpId, EmpName, Salary;

OR
INSERT into Employee_temp
SELECT DISTINCT EmpId, EmpName, Salary

FROM Employee;

Now the temp table will have unique data as below

EmpId	EmpName	Salary
1	AAAA	20000
2	BBBB	30000
3	CCCC	25000
4	DDDD	20000

Now drop the Employee table and Rename the Employee_temp table to Employee.

Note: This query will not work for the below records, if you need all the columns.

EmpId	EmpName	Salary	Department
1	AAAA	20000	Dept1
2	BBBB	30000	Dept1
2	BBBB	30000	Dept2
3	CCCC	25000	Dept2
4	DDDD	20000	Dept2
4	DDDD	20000	Dept3

If you note this, the department for EmpId 2 and EmpId 4 are different in both the records.

If you do not need the department column, then this query will work.

Hope this should avoid duplicate records or delete duplicate records.

Purpose:

I am tired searching a way to eliminate duplicate data in redshift tables.

Here I am with a simple query to avoid duplicate records in redshift.

SELECT * FROM (
SELECT *, ROW_NUMBER() OVER ( PARTITION BY column1,column2,column3,... ORDER BY sortcolumn ASC ) rownum
FROM tablename)
WHERE rownum = 1;

ORDER BY clause places a major role in this query. Usually, there is a single column which is responsible for the duplicate records. Use that column in the ORDER BY clause.

How this query avoids duplicate?

Follow the below approach to avoid to the duplicate in records or delete duplicate records.

Consider the below Table "Employee" which has two duplicate records

EmpId	EmpName	Salary
1	AAAA	20000
2	BBBB	30000
2	BBBB	30000
3	CCCC	25000
4	DDDD	20000
4	DDDD	20000

Now create a temporary table of the same schema and insert data into that temp table using the below query

INSERT into Employee_temp
SELECT EmpId, EmpName, Salary
From(
SELECT *,
ROW_NUMBER() OVER ( PARTITION BY EmpId,EmpName,Salary ORDER BY EmpId ASC ) rownum
FROM Employee)
WHERE rownum = 1;

Now the temp table will have unique data as below

EmpId	EmpName	Salary
1	AAAA	20000
2	BBBB	30000
3	CCCC	25000
4	DDDD	20000

Now drop the Employee table and Rename the Employee_temp table to Employee.

Hope this should avoid duplicate records or delete duplicate records.

Viki's Blog

Wednesday, May 18, 2016

Another simple way to remove duplicate data

Purpose:

How this query avoids duplicate?

Friday, August 29, 2014

Delete duplicate data from redshift table.

Purpose:

How this query avoids duplicate?

Blog Archive