Saturday, June 21, 2014

Stripping whitespaces and deleting duplicate rows in a PostgreSQL database

Two weeks back, in our project, we noticed that there were lot of values with leading and trailing white spaces in our database and also many duplicate row entries, as a result of which our program was not working correctly. I was assigned the task of cleaning up our PostgreSQL database.

There were two clean ups to be done:
1. Stripping of leading and trailing white spaces in string values.
2. Deleting duplicate rows from tables

I started learning a bit of plpgsql scripts to accomplish the above tasks. Since, its a very generic, common issue that most of us face, I thought I would document the steps.

Stripping white spaces:
For this, we need to fetch the list of all text columns in all tables in the database, loop though them and run a update query stripping the leading and trailing white spaces. Since the column names are available only while running the script, we need to write a dynamic sql query for the update process.

Below is the script for that:
(In all the below scripts, replace db_name with your database name)

/* This plpgsql script strips leading and trailing whitespaces
in all varchar columns of all tables
*/
DO $$DECLARE r record;
DECLARE str text;
BEGIN
-- Get all varchar column names of all tables and loop through them
FOR r IN SELECT column_name, table_name FROM information_schema.columns
WHERE table_catalog = 'db_name' and table_schema = 'public' and data_type = 'character varying'
LOOP
RAISE NOTICE 'Stripping whitespaces in column % of table %', r.column_name, r.table_name;
-- An update query to trim leading and trailing whitespaces from the column
str = 'UPDATE ' || quote_ident(r.table_name)
|| ' SET ' || quote_ident(r.column_name) || ' = trim(both '' '' from ' || quote_ident(r.column_name) || ');';
EXECUTE str;
RAISE NOTICE 'Finished stripping whitespaces in column % of table %', r.column_name, r.table_name;
END LOOP;
END$$;
If you also want to coalesce multiple whitespaces into a single whitespace in a string, you can use the regexp_replace function as shown below:

/* This plpgsql script coalesces multiple whitespaces into
a single whitespace in all varchar columns of all tables
*/
DO $$DECLARE r record;
DECLARE str text;
BEGIN
-- Get all varchar column names of all tables and loop through them
FOR r IN SELECT column_name, table_name FROM information_schema.columns
WHERE table_catalog = 'db_name' and table_schema = 'public' and data_type = 'character varying'
LOOP
RAISE NOTICE 'Coalescing multiple whitespaces in column % of table %', r.column_name, r.table_name;
-- An update query to coalesce multiple whitespaces into a single whitespace
str = 'UPDATE ' || quote_ident(r.table_name)
|| ' SET ' || quote_ident(r.column_name) ||
' = regexp_replace(' || quote_ident(r.column_name) || ', E''[ tnr]+'', '' '', ''g'');';
EXECUTE str;
RAISE NOTICE 'Finished coalescing multiple whitespaces in column % of table %', r.column_name, r.table_name;
END LOOP;
END$$;
Deleting duplicate rows from tables:
The problem of duplicate rows will occur when the only primary key in the table is an autoincrementing id. So let's say we import data to the table by reading from a file. And if the file has duplicate rows in it, then, since there is no other primary key check, we would run into this issue. In such a case, to delete all those duplicate records, we would use:

DELETE FROM table1 a
USING table1 b
WHERE a.col1 = b.col1 and a.col2 = b.col2 and a.col3 = b.col3
--These are the non-unique columns
AND a.id > b.id; --(or a.id < b.id depending on whether we want to retain the first or last row among duplicates) 

But we need to run the above delete query for all the tables in the database. So, we need to construct a dynamic sql query that will take the table names and group by the columns present in the table.

Below is the script for that:

/* This plpgsql script deletes duplicate rows in all tables
Duplicate row is a row which has all values as the same
as another row except the id field.
*/
DO $$DECLARE r1 record;
DECLARE r2 record;
DECLARE str text;
BEGIN
-- Select all the tables names and loop through them
FOR r1 IN SELECT table_name FROM information_schema.tables
WHERE table_catalog = 'db_name' and table_schema = 'public' and table_type = 'BASE TABLE'
LOOP
RAISE NOTICE 'Removing duplicates from %', r1.table_name;
-- Query to delete duplicate rows
str = 'DELETE FROM ' || quote_ident(r1.table_name)
|| ' a USING ' || quote_ident(r1.table_name) || ' b WHERE ';
-- Select all columns of table except id column and check equality
FOR r2 IN SELECT column_name FROM information_schema.columns
WHERE table_name = r1.table_name AND column_name <> 'id'
LOOP
str = str || 'a.' || r2.column_name || ' = b.' || r2.column_name || ' and ';
END LOOP;
str = str || ' a.id > b.id;';
-- Execute the query to delete duplicate rows
EXECUTE str;
RAISE NOTICE 'Finished removing duplicates from %', r1.table_name;
END LOOP;
END$$;
To execute the above scripts, save the script as a .sql file and run the below command:
psql -U db_username -d db_name -f filename.sql
 

0 comments:

Post a Comment