Saturday, June 21, 2014

Stripping whitespaces and deleting duplicate rows in a PostgreSQL database

Two weeks back, in our project, we noticed that there were lot of values with leading and trailing white spaces in our database and also many duplicate row entries, as a result of which our program was not working correctly. I was assigned the task of cleaning up our PostgreSQL database.

There were two clean ups to be done:
1. Stripping of leading and trailing white spaces in string values.
2. Deleting duplicate rows from tables

I started learning a bit of plpgsql scripts to accomplish the above tasks. Since, its a very generic, common issue that most of us face, I thought I would document the steps.

Stripping white spaces:
For this, we need to fetch the list of all text columns in all tables in the database, loop though them and run a update query stripping the leading and trailing white spaces. Since the column names are available only while running the script, we need to write a dynamic sql query for the update process.

Below is the script for that:
(In all the below scripts, replace db_name with your database name)

If you also want to coalesce multiple whitespaces into a single whitespace in a string, you can use the regexp_replace function as shown below:

Deleting duplicate rows from tables:
The problem of duplicate rows will occur when the only primary key in the table is an autoincrementing id. So let's say we import data to the table by reading from a file. And if the file has duplicate rows in it, then, since there is no other primary key check, we would run into this issue. In such a case, to delete all those duplicate records, we would use:

DELETE FROM table1 a
USING table1 b
WHERE a.col1 = b.col1 and a.col2 = b.col2 and a.col3 = b.col3
--These are the non-unique columns
AND a.id > b.id; --(or a.id < b.id depending on whether we want to retain the first or last row among duplicates) 

But we need to run the above delete query for all the tables in the database. So, we need to construct a dynamic sql query that will take the table names and group by the columns present in the table.

Below is the script for that:

To execute the above scripts, save the script as a .sql file and run the below command:
psql -U db_username -d db_name -f filename.sql