Find Duplicate Records In Table Without Unique Id

I recently came across a couple methods for listing out instances of duplicate records in a table where the table doesn't have an explicit unique identifier. Here is a post that explains how to do this when a unique identifier is present.

If the table doesn't have an explicit primary key or other uniquely identifying value, then we'll have to get some help from PostgreSQL's internal system columns — namely the ctid.

The ctid is:

The physical location of the row version within its table.

Let's use the example of the mailing_list table with potential duplicate email values.

Here is the first approach:

delete from mailing_list
where ctid not in (
  select min(ctid)
  from mailing_list
  group by email
);

This uses a subquery to find the first occurrence of every unique email and then deletes the rest. The ctid is the unique value that we can call the min aggregate on.

A second approach:

delete from mailing_list ml1
  using mailing_list ml2
where ml1.ctid < ml2.ctid
  and ml1.email = ml2.email;

This uses delete using to join the table against itself as a cartesian product to compare every entry to every other entry.

Last updated