More SQL: Queries for Duplicates and Missing Values

Some of the courses I tried in data analytics use clean data to work with, which is nice yet utopian :) That’s why whenever I see data cleaning mentioned somewhere, I’m all ears.

As far as I know now, there are three types of data that need to be checked during Data Cleaning process:

  1. Duplicates
  2. Missing Data
  3. Outliers

Today I’ll touch on the first two types (SQL-wise): queries that I stumbled upon while studying.

Check for duplicates and Remove duplicates

Here’s a simple table with duplicates.

-- Initial table
SELECT *
FROM category
ORDER BY category_id;

Screenshot - initial table output

It has only 2 columns, so in order to check for duplicates we are going to:

  • group by both of these columns;
  • count number of duplicates (=how many category_id+name combinations);
  • filter out those that get count of 1 (=no duplicates).

I. Checking

-- Check for Duplicates
SELECT category_id,
        name,
        COUNT(*) as duplicate_count
FROM category
GROUP BY category_id,
        name
HAVING COUNT(*) > 1
ORDER BY category_id;

Screenshot - checking for duplicates output

Removing the duplicates is a bit trickier, but step by step, it’s easily digestible.

II. Removing

This is the query:

-- Remove Duplicates
DELETE FROM category
WHERE ctid NOT IN (
    SELECT MIN(ctid)
    FROM category
    GROUP BY category_id,
            name
);

What it does?

Let’s break it down. The ctid shows where the row is stored.

-- Let's see what ctid does
SELECT ctid,
        category_id,
        name
    FROM category;

Screenshot - output with ctid column

Numbers (0,1), (0,2), … mean: 0 - page number; 1, 2, etc - row number. Each row is unique, regardless of whether the contents of the rows are identical.

So, when we choose MIN(ctid), it gives first occurence of the category_id+name combination.

-- See 1st occurence of category_id+name combination
SELECT MIN(ctid),
        category_id,
        name
FROM category
GROUP BY category_id,
        name;

Screenshot - output with MIN ctids

Thus, to delete only duplicates while keeping the first occurence, we use NOT IN MIN(ctid):

-- Remove Duplicates with CTID
DELETE FROM category
WHERE ctid NOT IN (
    SELECT MIN(ctid)
    FROM category
    GROUP BY category_id,
            name
);

-- Check what we've done
SELECT *
FROM category
ORDER BY category_id;

Screenshot - check if duplicates are deleted, final output

Check for Missing Data

Now, consider a table like this:

Screenshot - table with missing values

It’s quite straightforward to check for missing values by filtering for NULLs.

-- Check for missing values
SELECT *
FROM category
WHERE name IS NULL;

Screenshot - output for checking for missing values

BUT

Wait, where is the 5th row? It is obviously missing some data as well?

In SQL, NULL represents true absence, while an empty string ‘’ is technically a value — just an empty one. As well as zero. Well, with 0 it’s more straightforward — it does take space even for the naked eye, and it is a value. Meanwhile, empty strings can sneak around unnoticed.

So, we need to modify our query a bit:

-- Check for NULL or Empty Strings
SELECT *
FROM category
WHERE name IS NULL OR TRIM(name) = '';

Screenshot - output for checking for missing values - with empty strings

Gotcha!

Depending on the quantity, it wight be a good idea to start by just counting them:

-- Count Missing or Empty Values
SELECT COUNT(*) AS missing_count
FROM category
WHERE name IS NULL OR TRIM(name) = '';

Screenshot - output for counting missing values

Once found, what to do with them?

  • delete
  • replace (for example, fill them in with ‘n/a’ or averages)

I saw a comment that regression can be used to predict values. Haven’t tried it yet, but sounds interesting.

Until next time.