Purpose:
I am tired searching a way to eliminate duplicate data in redshift tables.Here I am with a simple query to avoid duplicate records in redshift.
SELECT * FROM (
SELECT *, ROW_NUMBER() OVER ( PARTITION BY column1,column2,column3,... ORDER BY sortcolumn ASC ) rownum
FROM tablename)
WHERE rownum = 1;
SELECT *, ROW_NUMBER() OVER ( PARTITION BY column1,column2,column3,... ORDER BY sortcolumn ASC ) rownum
FROM tablename)
WHERE rownum = 1;
ORDER BY clause places a major role in this query. Usually, there is a single column which is responsible for the duplicate records. Use that column in the ORDER BY clause.
How this query avoids duplicate?
Follow the below approach to avoid to the duplicate in records or delete duplicate records.
Consider the below Table "Employee" which has two duplicate records
EmpId
|
EmpName
|
Salary
|
1
|
AAAA
|
20000
|
2
|
BBBB
|
30000
|
2
|
BBBB
|
30000
|
3
|
CCCC
|
25000
|
4
|
DDDD
|
20000
|
4
|
DDDD
|
20000
|
Now create a temporary table of the same schema and insert data into that temp table using the below query
INSERT into Employee_temp
SELECT EmpId, EmpName, Salary
From(
SELECT *,
ROW_NUMBER() OVER ( PARTITION BY EmpId,EmpName,Salary ORDER BY EmpId ASC ) rownum
FROM Employee)
WHERE rownum = 1;
Now the temp table will have unique data as below
EmpId
|
EmpName
|
Salary
|
1
|
AAAA
|
20000
|
2
|
BBBB
|
30000
|
3
|
CCCC
|
25000
|
4
|
DDDD
|
20000
|
Now drop the Employee table and Rename the Employee_temp table to Employee.
Hope this should avoid duplicate records or delete duplicate records.
Hope this should avoid duplicate records or delete duplicate records.