Pushpad
Articles › Engineering, Technical Insights

PostgreSQL limitations when working with big data and arrays

This is not an extensive article, but just some brief notes about PostgreSQL limitations. Finding the culprit of performance issues in not easy and these limitations were found during months of research.

In particular this article was written while scaling Pushpad to millions of notifications per hour.

Testing was mainly performed with PG 11, but all the limitations should be still present in the current version (13).

PostgreSQL stats for array elements are wrong

PostgreSQL does not collect stats about the array elements in an array field.

So if you have a table with millions of rows and you run this query:

labels @> ARRAY[‘label5’]

PostgreSQL doesn’t know if that _label5_ is a common value or not and thus makes awry execution plans (e.g. wrong choice of index). 

PostgreSQL GIN indexes don’t support index-only scans

Look at this query plan:

EXPLAIN ANALYZE SELECT COUNT(*) FROM “notifications” WHERE “notifications”.”project_id” = 123 AND (labels @> ARRAY[‘label5’]::varchar[]);

Aggregate (cost=62419.67..62419.68 rows=1 width=8) (actual time=154.610..154.612 rows=1 loops=1)
-> Bitmap Heap Scan on notifications (cost=972.50..62240.03 rows=71854 width=0) (actual time=102.254..137.849 rows=167787 loops=1)
Recheck Cond: ((project_id = 123) AND (labels @> ‘{label5}’::character varying[]))
Heap Blocks: exact=27705
-> Bitmap Index Scan on index_notifications_on_labels (cost=0.00..954.54 rows=71854 width=0) (actual time=95.266..95.266 rows=167787 loops=1)
Index Cond: ((project_id = 123) AND (labels @> ‘{label5}’::character varying[]))
Planning Time: 0.422 ms
Execution Time: 154.700 ms

There is a Bitmap Heap Scan that is apparently unnecessary… All data required for the query was already available in the index, so why recheck the heap?

If you have some advanced knowledge about how PG and MVCC work, then you would expect the problem is that the visibility map signals many dirty blocks and then those blocks have to be rechecked.

In order to fix that you run:

VACUUM (ANALYZE) notifications;

However nothing changes.

And then you find a line in the PG documentation that says:

GIN indexes cannot support index-only scans because each index entry typically holds only part of the original data value.

It’s obscure what that means, since in the specific query above all the data required for the COUNT is present in the index (we don’t need other fields). However now we know that GIN indexes in PostgreSQL never support index-only scans.

PostgreSQL cannot include additional data, like timestamps, inside GIN indexes

Suppose that you have millions of rows with some tags and a timestamp. It seems a common scenario (e.g. something like a log).

Then you may want to filter based on tags and order by timestamp.

PostgreSQL GIN indexes can’t store the timestamp and use it for sorting.

For example PostgreSQL can find all the rows that have tag1, but then it will have to access the heap in order to get the timestamps. That is extremely slow if the query returns millions of matches.

A solution would be to store the timestamp directly in the index (in the posting list).

There is an unofficial extension called RUM index that would solve the problem, but you would have to trust a third party extension (which is not always feasible). It would be great to see this effort merged in the main PostgreSQL repository in the future.

It would be even better if the posting list was kept in order (a btree for each gin index key) based on a field (like timestamp). I think that in the past there was an experiment for that called Vodka.

PostgreSQL MVCC is not ideal for big data and frequently updated tables

Unlike other points, this is not a PG limitation, rather its design (Multi Version Concurrency Control). Unfortunately this design is not perfect for very large tables that are frequently updated.

Basically each time that a row is updated, it is copied in a new location. The existing row is left there because there might be older transactions that still need it.

This causes:

  • lots of inserts in the indexes (that need to point to the new location)
  • an enormous growth in index size (previous row version is kept in the index)
  • the visibility map becomes dirty (and index-only scans become rare)

The only solution is to run aggressive auto vacuum frequently (e.g. hourly). Don’t believe who says that vacuum causes performance issues: if you don’t run it very frequently your tables and indexes become full of dead rows (trash) and you also reduce the probability of index-only scans.

Note that this is even more true if you use COUNT, since COUNT has to go through large portions of the index, and then, for each row, check that is visible (in the visibility map or, if the page is marked as dirty in the visibility map, in the heap).

Workarounds

Currently there are two workarounds for the above issues:

  • use normalized tables, like in the old days (btree indexes on scalar columns have better support in PostgreSQL, but you will still have to face joins on massive tables…)
  • use expensive / good hardware (with proper tuning) that can compensate the technical limitations.

Extra: PostgreSQL default settings are completely wrong for large databases

Remember to set PG settings carefully based on your machine performance. PostgreSQL won’t do that for you.

In particular: 

  • set frequent vacuum (e.g. autovacuum_vacuum_threshold = 10000)
  • increase settings related to memory (e.g. shared buffers = 1/3 RAM, effective_cache_size = 3/4 RAM)
  • reduce random page cost (1.1 for SSD) 
  • set number of workers equal to number of CPUs
  • etc.

I hope that the above tips will save you some weeks of research - wondering why things are not fast as you would expect. I would also love to hear from you in the comments.