PGSQL Phriday #011. Partitioning: we need it!

I almost missed this month’s PGSQL Phriday, but I hope I still have time to make a small contribution. 

Partitioning in PostgreSQL is a topic that recently became painfully familiar to me. Like other people who already contributed to PGSQL Phriday #011, I will talk only about my industrial experience rather than “how things should be.”  

Like many others, I never did sharding “for real,” that is, in the production environment. From the pure PostgeSQL perspective, “sharding” is partitioning where different partitions may be located on different physical nodes. I am very interested to hear about our people’s experience because the fact that “you can use FDW for sharding” has been cited for many years, but I never talked to anybody who had done it. With everything I know about PostgeSQL partitioning, I can’t imagine any justifications for treating distributed partitions “as one table,” so naturally, I am looking forward to learning about real-life use cases. 

Now, back to “classic” declarative PostgreSQL partitions. As anybody who switched from Oracle to PostgreSQL, I was very disappointed to find the lack of adequate partitioning support in Postgres. Twelve years later, despite all improvements, the situation is still not ideal. Whenever I suggest somebody partition their large tables, I must warn them that partitioning will not make queries run faster and that I will do my best to ensure the queries will not slow down (and that’s a huge commitment!).

On OLTP systems, if proper indexes are in place, PostgreSQL does a fantastic job delivering results fast, even on gigantic tables, so the users wonder what is wrong and why they should partition.

Most times, there are two reasons for partitioning: maintenance and archiving. When a table reaches a certain size, it becomes increasingly difficult for the autovacuum to complete on time, no matter how well it is tuned. The same is true for ANALYZE and for many other operations. Just think about what it would take to pg_dump of 1 TB table and to restore it (and if you wonder why you would ever need to do anything like this, recall that most data loss disasters are caused by human errors). 

Archiving – moving old data to archive storage and removing it from the “active” table is a common practice, especially in OLTP systems. There are enough articles and blogs which explain why dropping a partition is better than mass deletes (and why it is better to detach an old partition and attach a new partition than to update 1M rows) 

Between these two cases, there are times when I tell my customers (external or internal): we must partition this table. However, partitioning does not come without cost, and most times, you need to make sure your queries won’t slow down,

The most important thing is that when you query a partitioned table, you must specify (explicitly) the value that identifies the proper partition(s) to query. For example, if the table report is partitioned by report_date range, each query should contain an extra condition on report_date; for example:

SELECT rep_id, paid_date, amount
FROM report
WHERE employee_id=123456
AND report_date between '06-01-2023' and '07-01-2023'

If you don’t pass the date range, PostgeSQL will search all partitions, even if employee 123456 has no reports outside this date range. Moreover, if you write a statement like this:

SELECT rep_id, paid_date, amount
FROM report
WHERE employee_id=123456
AND report_date between (
SELECT start_date FROM conference 
WHERE conf_name='Awesome PG Conference') and current_date - 7

There might be cases when partitioning would improve half of your queries and slow down the rest, and you might need to evaluate which way of partitioning will cause the least harm. 

Lastly, how many partitions are practical? You can find kinds of recommendations about the optimal number of partitions. From my practical experience, a partitioned table can stay happy with several thousand partitions. The only problem with “too many” partitions is DDL. To make a change to the partitioned table structure (add or remove a column) all partitions should be locked, which in case of too many partitions, will raise an exception ” out of shared memory” due to reaching the limit of max_locks_per_transaction.

Summary: most likely, at some point, you will need to partition some of your production tables. PostgreSQL declarative partition support is improving with each new version. However, it is still more art than craftsmanship to come up with a good partitioning schema that won’t decrease your system performance. 

I look forward to hearing about other people’s experiences and learning some tricks I might not know!

Leave a comment

Filed under Data management, SQL

Leave a comment