Two weeks back, in our project, we noticed that there were lot of values with leading and trailing white spaces in our database and also many duplicate row entries, as a result of which our program was not working correctly. I was assigned the task of cleaning up our PostgreSQL database.
There were two clean ups to be done:
1. Stripping of leading and trailing white spaces in string values.
2. Deleting duplicate rows from tables
I started learning a bit of plpgsql scripts to accomplish the above tasks. Since, its a very generic, common issue that most of us face, I thought I would document the steps.
Stripping white spaces:
For this, we need to fetch the list of all text columns in all tables in the database, loop though them and run a update query stripping the leading and trailing white spaces. Since the column names are available only while running the script, we need to write a dynamic sql query for the update process.
Below is the script for that:
(In all the below scripts, replace db_name with your database name)
If you also want to coalesce multiple whitespaces into a single whitespace in a string, you can use the regexp_replace function as shown below:
Deleting duplicate rows from tables:
The problem of duplicate rows will occur when the only primary key in the table is an autoincrementing id. So let's say we import data to the table by reading from a file. And if the file has duplicate rows in it, then, since there is no other primary key check, we would run into this issue. In such a case, to delete all those duplicate records, we would use:
DELETE FROM table1 a
USING table1 b
WHERE a.col1 = b.col1 and a.col2 = b.col2 and a.col3 = b.col3
--These are the non-unique columns
AND a.id > b.id; --(or a.id < b.id depending on whether we want to retain the first or last row among duplicates)
There were two clean ups to be done:
1. Stripping of leading and trailing white spaces in string values.
2. Deleting duplicate rows from tables
I started learning a bit of plpgsql scripts to accomplish the above tasks. Since, its a very generic, common issue that most of us face, I thought I would document the steps.
Stripping white spaces:
For this, we need to fetch the list of all text columns in all tables in the database, loop though them and run a update query stripping the leading and trailing white spaces. Since the column names are available only while running the script, we need to write a dynamic sql query for the update process.
Below is the script for that:
(In all the below scripts, replace db_name with your database name)
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
/* This plpgsql script strips leading and trailing whitespaces | |
in all varchar columns of all tables | |
*/ | |
DO $$DECLARE r record; | |
DECLARE str text; | |
BEGIN | |
-- Get all varchar column names of all tables and loop through them | |
FOR r IN SELECT column_name, table_name FROM information_schema.columns | |
WHERE table_catalog = 'db_name' and table_schema = 'public' and data_type = 'character varying' | |
LOOP | |
RAISE NOTICE 'Stripping whitespaces in column % of table %', r.column_name, r.table_name; | |
-- An update query to trim leading and trailing whitespaces from the column | |
str = 'UPDATE ' || quote_ident(r.table_name) | |
|| ' SET ' || quote_ident(r.column_name) || ' = trim(both '' '' from ' || quote_ident(r.column_name) || ');'; | |
EXECUTE str; | |
RAISE NOTICE 'Finished stripping whitespaces in column % of table %', r.column_name, r.table_name; | |
END LOOP; | |
END$$; |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
/* This plpgsql script coalesces multiple whitespaces into | |
a single whitespace in all varchar columns of all tables | |
*/ | |
DO $$DECLARE r record; | |
DECLARE str text; | |
BEGIN | |
-- Get all varchar column names of all tables and loop through them | |
FOR r IN SELECT column_name, table_name FROM information_schema.columns | |
WHERE table_catalog = 'db_name' and table_schema = 'public' and data_type = 'character varying' | |
LOOP | |
RAISE NOTICE 'Coalescing multiple whitespaces in column % of table %', r.column_name, r.table_name; | |
-- An update query to coalesce multiple whitespaces into a single whitespace | |
str = 'UPDATE ' || quote_ident(r.table_name) | |
|| ' SET ' || quote_ident(r.column_name) || | |
' = regexp_replace(' || quote_ident(r.column_name) || ', E''[ tnr]+'', '' '', ''g'');'; | |
EXECUTE str; | |
RAISE NOTICE 'Finished coalescing multiple whitespaces in column % of table %', r.column_name, r.table_name; | |
END LOOP; | |
END$$; |
The problem of duplicate rows will occur when the only primary key in the table is an autoincrementing id. So let's say we import data to the table by reading from a file. And if the file has duplicate rows in it, then, since there is no other primary key check, we would run into this issue. In such a case, to delete all those duplicate records, we would use:
DELETE FROM table1 a
USING table1 b
WHERE a.col1 = b.col1 and a.col2 = b.col2 and a.col3 = b.col3
--These are the non-unique columns
AND a.id > b.id; --(or a.id < b.id depending on whether we want to retain the first or last row among duplicates)
But we need to run the above delete query for all the tables in the database. So, we need to construct a dynamic sql query that will take the table names and group by the columns present in the table.
Below is the script for that:
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
/* This plpgsql script deletes duplicate rows in all tables | |
Duplicate row is a row which has all values as the same | |
as another row except the id field. | |
*/ | |
DO $$DECLARE r1 record; | |
DECLARE r2 record; | |
DECLARE str text; | |
BEGIN | |
-- Select all the tables names and loop through them | |
FOR r1 IN SELECT table_name FROM information_schema.tables | |
WHERE table_catalog = 'db_name' and table_schema = 'public' and table_type = 'BASE TABLE' | |
LOOP | |
RAISE NOTICE 'Removing duplicates from %', r1.table_name; | |
-- Query to delete duplicate rows | |
str = 'DELETE FROM ' || quote_ident(r1.table_name) | |
|| ' a USING ' || quote_ident(r1.table_name) || ' b WHERE '; | |
-- Select all columns of table except id column and check equality | |
FOR r2 IN SELECT column_name FROM information_schema.columns | |
WHERE table_name = r1.table_name AND column_name <> 'id' | |
LOOP | |
str = str || 'a.' || r2.column_name || ' = b.' || r2.column_name || ' and '; | |
END LOOP; | |
str = str || ' a.id > b.id;'; | |
-- Execute the query to delete duplicate rows | |
EXECUTE str; | |
RAISE NOTICE 'Finished removing duplicates from %', r1.table_name; | |
END LOOP; | |
END$$; |
psql -U db_username -d db_name -f filename.sql
0 comments:
Post a Comment