How to delete duplicate rows from a table?
In this post, we will take a look at the most frequently asked and interviewers' favorite question "How to delete duplicate rows from a table?"
Creating Database Table
The following scripts can be used to create a sample table that holds the duplicate records.
CREATE TABLE EMP_DUP
(EMPNO NUMBER(4,0) NOT NULL ENABLE,
ENAME VARCHAR2(10),
JOB VARCHAR2(9),
MGR NUMBER(4,0),
HIREDATE DATE,
SAL NUMBER(7,2),
COMM NUMBER(7,2),
DEPTNO NUMBER(2,0),
CREATED_AT DATE
);
/
INSERT INTO EMP_DUP VALUES (7839,'KING','PRESIDENT',null,to_date('17-11-81','DD-MM-RR'),5000,null,10,to_date('25-06-21','DD-MM-RR'));
INSERT INTO EMP_DUP VALUES (7698,'BLAKE','MANAGER',7839,to_date('01-05-81','DD-MM-RR'),2850,null,30,to_date('25-06-21','DD-MM-RR'));
INSERT INTO EMP_DUP VALUES (7782,'CLARK','MANAGER',7839,to_date('09-06-81','DD-MM-RR'),2450,null,10,to_date('25-06-21','DD-MM-RR'));
INSERT INTO EMP_DUP VALUES (7566,'JONES','MANAGER',7839,to_date('02-04-81','DD-MM-RR'),2975,null,20,to_date('25-06-21','DD-MM-RR'));
INSERT INTO EMP_DUP VALUES (7788,'SCOTT','ANALYST',7566,to_date('09-12-82','DD-MM-RR'),3000,null,20,to_date('25-06-21','DD-MM-RR'));
INSERT INTO EMP_DUP VALUES (7902,'FORD','ANALYST',7566,to_date('03-12-81','DD-MM-RR'),3000,null,20,to_date('25-06-21','DD-MM-RR'));
INSERT INTO EMP_DUP VALUES (7499,'ALLEN','SALESMAN',7698,to_date('20-02-81','DD-MM-RR'),1600,315,30,to_date('25-06-21','DD-MM-RR'));
INSERT INTO EMP_DUP VALUES (7654,'MARTIN','SALESMAN',7698,to_date('28-09-81','DD-MM-RR'),1250,1400,30,to_date('25-06-21','DD-MM-RR'));
INSERT INTO EMP_DUP VALUES (7844,'TURNER','SALESMAN',7698,to_date('08-09-81','DD-MM-RR'),1500,15,30,to_date('25-06-21','DD-MM-RR'));
INSERT INTO EMP_DUP VALUES (7876,'ADAMS','CLERK',7788,to_date('12-01-83','DD-MM-RR'),1100,null,20,to_date('25-06-21','DD-MM-RR'));
INSERT INTO EMP_DUP VALUES (7900,'JAMES','CLERK',7698,to_date('03-12-81','DD-MM-RR'),950,null,30,to_date('25-06-21','DD-MM-RR'));
INSERT INTO EMP_DUP VALUES (7934,'MILLER','CLERK',7782,to_date('23-01-82','DD-MM-RR'),1300,null,10,to_date('25-06-21','DD-MM-RR'));
INSERT INTO EMP_DUP VALUES (7839,'KING','PRESIDENT',null,to_date('17-11-81','DD-MM-RR'),5000,null,10,to_date('27-06-21','DD-MM-RR'));
INSERT INTO EMP_DUP VALUES (7698,'BLAKE','MANAGER',7839,to_date('01-05-81','DD-MM-RR'),2850,null,30,to_date('27-06-21','DD-MM-RR'));
INSERT INTO EMP_DUP VALUES (7782,'CLARK','MANAGER',7839,to_date('09-06-81','DD-MM-RR'),2450,null,10,to_date('27-06-21','DD-MM-RR'));
INSERT INTO EMP_DUP VALUES (7566,'JONES','MANAGER',7839,to_date('02-04-81','DD-MM-RR'),2975,null,20,to_date('27-06-21','DD-MM-RR'));
INSERT INTO EMP_DUP VALUES (7788,'SCOTT','ANALYST',7566,to_date('09-12-82','DD-MM-RR'),3000,null,20,to_date('27-06-21','DD-MM-RR'));
INSERT INTO EMP_DUP VALUES (7902,'FORD','ANALYST',7566,to_date('03-12-81','DD-MM-RR'),3000,null,20,to_date('27-06-21','DD-MM-RR'));
INSERT INTO EMP_DUP VALUES (7499,'ALLEN','SALESMAN',7698,to_date('20-02-81','DD-MM-RR'),1600,315,30,to_date('27-06-21','DD-MM-RR'));
INSERT INTO EMP_DUP VALUES (7654,'MARTIN','SALESMAN',7698,to_date('28-09-81','DD-MM-RR'),1250,1400,30,to_date('27-06-21','DD-MM-RR'));
INSERT INTO EMP_DUP VALUES (7844,'TURNER','SALESMAN',7698,to_date('08-09-81','DD-MM-RR'),1500,15,30,to_date('27-06-21','DD-MM-RR'));
INSERT INTO EMP_DUP VALUES (7876,'ADAMS','CLERK',7788,to_date('12-01-83','DD-MM-RR'),1100,null,20,to_date('27-06-21','DD-MM-RR'));
INSERT INTO EMP_DUP VALUES (7900,'JAMES','CLERK',7698,to_date('03-12-81','DD-MM-RR'),950,null,30,to_date('27-06-21','DD-MM-RR'));
INSERT INTO EMP_DUP VALUES (7934,'MILLER','CLERK',7782,to_date('23-01-82','DD-MM-RR'),1300,null,10,to_date('27-06-21','DD-MM-RR'));
/
COMMIT;
/
This sample table EMP_DUP contains the employee information and all the records of this table are fully duplicated, In other words, for each record, all the columns have identical values except CREATED_AT which holds the date when records were inserted in the table. We will see three methods to delete the duplicate records.
But before that first, we need to identify the columns which form the duplicate records and we also need to identify which records hold the correct values. In our sample table which holds employees' records, the EMPNO column is a good candidate to identify the repeating rows. Also, other columns have identical values so we can delete any of the records. But this is not the case all the time. When all the columns of a record are not identical and only a subset of columns have repeated values then finding the records which hold the correct values become important. You need to look for other columns based on which you can differentiate between two records and find the records having correct values.
Method 1: Using Subquery
In this first method, we use the ROWID pseudo column to uniquely identify each record in the table and then using the min and max aggregate function we retain the one set of records and delete others.
DELETE
FROM EMP_DUP
WHERE ROWID IN (SELECT MAX(ROWID)
FROM EMP_DUP
GROUP BY EMPNO);
DELETE
FROM EMP_DUP
WHERE ROWID IN ( SELECT MIN(ROWID)
FROM EMP_DUP
GROUP BY EMPNO);
Method 2: Using Correlated Subquery
In this method, the core logic is the same but instead of a subquery, we use a correlated subquery.
DELETE
FROM EMP_DUP A
WHERE ROWID > (SELECT MIN(ROWID)
FROM EMP_DUP B
WHERE A.EMPNO = B.EMPNO);
DELETE
FROM EMP_DUP A
WHERE ROWID < (SELECT MAX(ROWID)
FROM EMP_DUP B
WHERE A.EMPNO = B.EMPNO);
Method 3: Using Analytic Function
Now let's consider that our table does not have fully identical records for all the columns and has two sets of records for each employee. One record is for the old salary and the second record is for the new salary Now the requirement is to keep the newly updated salary and delete the old records. In our table, we have a CRETAED_AT column that holds the insert date of the records, and using this column we can differentiate between the old and new records.
Now we have identified the old records but using the previous two methods we cannot delete those records. Analytic function comes to the rescue. Using the ROW_NUMBER analytical function and CREATED_AT column we can identify the ROWID of the older records and then delete those records. For e.g. using the below query, we can keep the most recent records in the EMP_DUP table and delete older records.
DELETE
FROM EMP_DUP
WHERE ROWID IN (
SELECT RID
FROM (
SELECT ROWID RID,
ROW_NUMBER() OVER(PARTITION BY EMPNO ORDER BY CREATED_AT DESC) RN
FROM EMP_DUP)
WHERE RN > 1
);
ConclusionIn our daily work, we often encountered a situation where we need to delete duplicate records. Here, I explained the three most basic methods which work for almost every scenario but there are other methods also. The important step in this process of deleting the duplicate rows is to identify the column group which forms the duplicate records and to find the records having correct values. Once you identified the column groups then based on your requirement you can use these methods to delete duplicate records.
Comments
Post a Comment