The Surprising Discover That Freed 20GB of Unused Index Area
Each few months we get an alert from our database monitoring to warn us that we’re about to expire of house. Normally we simply provision extra storage and neglect about it, however this time we had been below quarantine, and the system in query was below much less load than common. We thought this can be a good alternative to do some cleanups that will in any other case be rather more difficult.
To begin from the top, we ended up releasing greater than 70GB of un-optimized and un-utilized house with out dropping a single index or deleting any knowledge!
Utilizing standard technics equivalent to rebuilding indexes and tables we cleared up lots of house, however then one shocking discover helped us clear a further ~20GB of unused listed values!
That is what the free storage chart of one in every of our databases seemed like within the course of:
Desk of Contents
Provisioning storage is one thing we do once in a while, however earlier than we throw cash on the downside we like to verify we make good use of the storage we have already got. To try this, we begin with the standard suspects.
Unused Indexes
Unused indexes are double-edged swords; you create them to make issues quicker, however they find yourself taking house and gradual inserts and updates. Unused indexes are the very first thing we all the time test when we have to clear up storage.
To search out unused indexes we use the next question:
SELECT
relname,
indexrelname,
idx_scan,
idx_tup_read,
idx_tup_fetch,
pg_size_pretty(pg_relation_size(indexrelname::regclass)) as measurement
FROM
pg_stat_all_indexes
WHERE
schemaname = 'public'
AND indexrelname NOT LIKE 'pg_toast_%'
AND idx_scan = 0
AND idx_tup_read = 0
AND idx_tup_fetch = 0
ORDER BY
pg_relation_size(indexrelname::regclass) DESC;
The question is on the lookout for indexes that weren’t scanned or fetched for the reason that final time the statistics had been reset.
Some indexes might seem to be they weren’t used however they had been in-fact used:
-
The documentation lists just a few eventualities when that is potential. For instance, when the optimizer makes use of meta knowledge from the index, however not the index itself.
-
Indexes used to implement distinctive or main key constraints for tables that weren’t up to date shortly. The indexes will appear to be they weren’t used, however it doesn’t suggest we are able to eliminate them.
To search out the unused indexes you may truly drop, you often should go over the checklist one after the other and decide. This may be time consuming within the first couple of instances, however after you eliminate most unused indexes it turns into simpler.
It is also a good suggestion to reset the statistics counters once in a while, often proper after you completed inspecting the checklist. PostgreSQL offers just a few functions to reset statistics at completely different ranges. Once we discover an index we suspect will not be getting used, or after we add new indexes rather than previous ones, we often reset the counters for the desk and watch for some time:
-- Discover desk oid by title
SELECT oid FROM pg_class c WHERE relname = 'table_name';
-- Reset counts for all indexes of desk
SELECT pg_stat_reset_single_table_counters(14662536);
We do that each every so often, so in our case there have been no unused indexes to drop.
Index and Table Bloat
The subsequent suspect is bloat. If you replace rows in a desk, PostgreSQL marks the tuple as lifeless and provides the up to date tuple within the subsequent accessible house. This course of creates what’s known as “bloat”, which may trigger tables to devour extra space than they actually need. Bloat additionally impacts indexes, so to unlock house, bloat is an effective place to look.
Estimating bloat in tables and indexes is seemingly not a easy activity. Fortunate for us, some good folks on the world vast internet already did the hard work and wrote queries to estimate table bloat and index bloat. After operating these queries you’ll most certainly discover some bloat, so the subsequent factor to do it clear up that house.
Clearing Bloat in Indexes
To clear bloat in an index, you might want to rebuild it. There are a number of methods to rebuild an index:
-
Re-create the index: In the event you re-create the index, it will likely be in-built an optimum manner.
-
Rebuild the index: As an alternative of dropping and creating the index your self, PostgreSQL offers a strategy to re-build an current index in-place utilizing the
REINDEX
command:
REINDEX INDEX index_name;
- Rebuild the index concurrently: The earlier strategies will receive a lock on the desk and forestall it from being modified whereas the operation is in progress, which is often unacceptable. To rebuild the index with out locking it for updates, you may rebuilt the index concurrently:
REINDEX INDEX CONCURRENTLY index_name;
When utilizing REINDEX CONCURRENTLY
, PostgreSQL creates a brand new index with a reputation suffixed with _ccnew
, and syncs any modifications made to the desk within the meantime. When the rebuild is finished, it’s going to change the previous index with the brand new index, and drop the previous one.
If for some motive you needed to cease the rebuild within the center, the brand new index won’t be dropped. As an alternative, it will likely be left in an invalid state and devour house. To determine invalid indexes that had been created throughout REINDEX
, we use the next question:
-- Establish invalid indexes that had been created throughout index rebuild
SELECT
c.relname as index_name,
pg_size_pretty(pg_relation_size(c.oid))
FROM
pg_index i
JOIN pg_class c ON i.indexrelid = c.oid
WHERE
-- New index constructed utilizing REINDEX CONCURRENTLY
c.relname LIKE '%_ccnew'
-- In INVALID state
AND NOT indisvalid
LIMIT 10;
As soon as the rebuild course of is not energetic, it ought to be protected to drop any remaining invalid indexes.
Activating B-Tree Index Deduplication
PostgreSQL 13 launched a brand new environment friendly manner of storing duplicate values in B-Tree indexes known as “B-Tree Deduplication”.
For every listed worth, a B-Tree index will maintain in its leaf each the worth and a pointer to the row (TID). The bigger the listed values, the bigger the index. Up till PostgreSQL 12, when the index contained many duplicate values, all of those duplicate values can be saved within the index leaves. This isn’t very environment friendly and might take up lots of house.
Beginning at PostgreSQL 13, when B-Tree deduplication is activated, duplicate values are solely saved as soon as. This will make a big impact on the dimensions of indexes with many duplicate values.
In PostgreSQL 13 index deduplication in enabled by default, until you deactivate it:
-- Activating de-deduplication for a B-Tree index, that is the default:
CREATE INDEX index_name ON table_name(column_name) WITH (deduplicate_items = ON)
In case you are migrating from PostgreSQL variations previous to 13, you might want to rebuild the indexes utilizing the REINDEX
command as a way to get the total advantages of index de-deduplication.
For example the impact of B-Tree deduplication on the dimensions of the index, create a desk with a singular column and a non distinctive column, and populate it with 1M rows. On every column create two B-Tree indexes, one with deduplication enabled and one other with deduplication disabled:
db=# CREATE test_btree_dedup (n_unique serial, n_not_unique integer);
CREATE TABLE
db=# INSERT INTO test_btree_dedup (n_not_unique)
SELECT (random() * 100)::int FROM generate_series(1, 1000000);
INSERT 0 1000000
db=# CREATE INDEX ix1 ON test_btree_dedup (n_unique) WITH (deduplicate_items = OFF);
CREATE INDEX
db=# CREATE INDEX ix2 ON test_btree_dedup (n_unique) WITH (deduplicate_items = ON);
CREATE INDEX
db=# CREATE INDEX ix3 ON test_btree_dedup (n_not_unique) WITH (deduplicate_items = OFF);
CREATE INDEX
db=# CREATE INDEX ix4 ON test_btree_dedup (n_not_unique) WITH (deduplicate_items = ON);
CREATE INDEX
Subsequent, evaluate the sizes of the 4 indexes:
Column | Deduplication | Measurement |
---|---|---|
Not distinctive | Sure | 6840 kB |
Not distinctive | No | 21 MB |
Distinctive | Sure | 21 MB |
Distinctive | No | 21 MB |
As anticipated, deduplication had no impact on the distinctive index, however it had a major impact on the index that had many duplicate values.
Sadly for us, PostgreSQL 13 was nonetheless contemporary on the time, and our cloud supplier didn’t have help for it but, so we had been unable to make use of deduplication to clear house.
Clearing Bloat in Tables
Similar to in indexes, tables may include lifeless tuples that trigger bloat and fragmentation. Nevertheless, in contrast to indexes that include knowledge from an related desk, a desk can’t simply merely be re-created. To re-create a desk you would need to create a brand new desk, migrate the information over whereas holding it synced with new knowledge, create all of the indexes, constraints and any referential constraints in different tables. Solely in spite of everything of that is finished, you may change the previous desk with the brand new one.
There are a number of methods to rebuild a desk and cut back bloat:
-
Re-create the desk: Utilizing this technique as described above usually requires lots of growth, particularly if the desk is actively getting used because it’s being rebuilt.
-
Vacuum the desk: PostgreSQL offers a strategy to reclaim house occupied by bloat and lifeless tuples in a desk utilizing the
VACUUM FULL
command. Vacuum full requires a lock on the desk, and isn’t an excellent resolution for tables that should be accessible whereas being vacuumed:
-- Will lock the desk
VACUUM FULL table_name;
The 2 choices above require both a major effort, or some down time.
Using pg_repack
Each built-in choices for rebuilding tables are usually not preferrred until you may afford downtime. One fashionable resolution for rebuilding tables and indexes with out downtime is the pg_repack extension.
Being a well-liked extension, pg_repack
is probably going accessible out of your bundle supervisor or already put in by your cloud supplier. To make use of pg_repack
, you first must create the extension:
CREATE EXTENSION pg_repack;
To “repack” a desk together with its indexes, situation the next command from the console:
$ pg_repack -k --table table_name db_name
To rebuild a desk with no downtime, the extension creates a brand new desk, masses the information from the unique desk into it whereas holding it updated with new knowledge, after which additionally rebuilds the indexes. When the method is completed, the 2 tables are switched and the unique desk is dropped. See here for full particulars.
There are two caveats to pay attention to when utilizing pg_repack
to rebuild tables:
-
Requires quantity of storage roughly the quantity of the desk to rebuild: the extension creates one other desk to repeat the information to, so it requires further storage roughly the dimensions of the desk and its indexes.
-
Might require some handbook cleanup: if the “repack” course of failed or stopped manually, it might depart intermediate objects laying round, so it’s possible you’ll must do some handbook cleanup.
Regardless of these caveats, pg_repack
is a superb possibility for rebuilding tables and indexes with no downtime. Nevertheless, as a result of it requires some further storage to function, it isn’t a superb possibility when you find yourself already out of storage. It is a good suggestion to observe the free cupboard space and plan rebuilds prematurely.
At this level we already used all the standard strategies we might consider and cleared up quite a bit of house. We dropped unused indexes and cleared bloat from tables and indexes, however… there was nonetheless extra space to shave off!
The “Aha Moment”
Whereas we had been wanting on the sizes of the indexes after we completed rebuilding them, an attention-grabbing factor caught our eye.
One in every of our largest tables shops transaction knowledge. In our system, after a cost is made, the person can select to cancel and get a refund. This isn’t taking place fairly often, and solely a fraction of the transactions find yourself being cancelled.
In our transactions desk, there are international keys to each the buying person and the cancelling person, and every discipline has a B-Tree index outlined on it. The buying person has a NOT NULL constraint on it so all of the rows maintain a price. The cancelling person however, is nullable, and solely a fraction of the rows maintain any knowledge. Many of the values within the cancelling person discipline are NULL.
We anticipated the index on the cancelling person to be considerably smaller than the index on the buying person, however they had been precisely the identical. Coming from Oracle, I used to be all the time taught that NULLs are not indexed, however in PostgreSQL they’re! This “Aha” second led us to the conclusion that we had been indexing lots of pointless values for no motive.
This was the unique index we had for the cancelling person:
CREATE INDEX transaction_cancelled_by_ix ON transactions(cancelled_by_user_id);
To test our thesis, we changed the index with a partial index that excludes null values:
DROP INDEX transaction_cancelled_by_ix;
CREATE INDEX transaction_cancelled_by_part_ix ON transactions(cancelled_by_user_id)
WHERE cancelled_by_user_id IS NOT NULL;
The complete index after we reindexed it was 769MB in measurement, with greater than 99% null values. The partial index that excluded null values was lower than 5MB. That is greater than 99% % of lifeless weight shaved off the index!
Index | Measurement |
---|---|
Full index | 769MB |
Partial Index | 5MB |
Distinction | -99% |
To ensure these NULL values had been certainly pointless, we reset the stats on the desk and waited some time. Not lengthy after, we noticed that the index is getting used similar to the previous one! We simply shaved off greater than 760MB of unused listed tuples with out compromising efficiency!
Utilizing Partial Indexes
As soon as we had a superb expertise with one partial index, we figured we would have extra indexes like that. To search out good candidates for partial index we wrote a question to seek for indexes on fields with excessive null_frac
, the % of values of the column that PostgreSQL estimates are NULL:
-- Discover listed columns with excessive null_frac
SELECT
c.oid,
c.relname AS index,
pg_size_pretty(pg_relation_size(c.oid)) AS index_size,
i.indisunique AS distinctive,
a.attname AS indexed_column,
CASE s.null_frac
WHEN 0 THEN ''
ELSE to_char(s.null_frac * 100, '999.00%')
END AS null_frac,
pg_size_pretty((pg_relation_size(c.oid) * s.null_frac)::bigint) AS expected_saving
-- Uncomment to incorporate the index definition
--, ixs.indexdef
FROM
pg_class c
JOIN pg_index i ON i.indexrelid = c.oid
JOIN pg_attribute a ON a.attrelid = c.oid
JOIN pg_class c_table ON c_table.oid = i.indrelid
JOIN pg_indexes ixs ON c.relname = ixs.indexname
LEFT JOIN pg_stats s ON s.tablename = c_table.relname AND a.attname = s.attname
WHERE
-- Major key can't be partial
NOT i.indisprimary
-- Exclude already partial indexes
AND i.indpred IS NULL
-- Exclude composite indexes
AND array_length(i.indkey, 1) = 1
-- Bigger than 10MB
AND pg_relation_size(c.oid) > 10 * 1024 ^ 2
ORDER BY
pg_relation_size(c.oid) * s.null_frac DESC;
The outcomes of this question can appear to be this:
oid | index | index_size | distinctive | indexed_column | null_frac | expected_saving
---------+--------------------+------------+--------+----------------+-----------+-----------------
138247 | tx_cancelled_by_ix | 1418 MB | f | cancelled_by | 96.15% | 1363 MB
16988 | tx_op1_ix | 1651 MB | t | op1 | 6.11% | 101 MB
1473377 | tx_token_ix | 22 MB | t | token | 11.21% | 2494 kB
138529 | tx_op_name_ix | 1160 MB | t | title | | 0 bytes
Within the desk above we are able to determine a number of forms of outcomes:
tx_cancelled_by_ix
is a big index with many null values: nice potential right here!tx_op_1_ix
is a big index with few null values: there’s not a lot potentialtx_token_ix
is a small index with few null values: I would not hassle with this indextx_op_name_ix
is a big index with no null values: nothing to do right here
The outcomes present that by turning tx_cancelled_by_ix
right into a partial index that excludes null we are able to doubtlessly save ~1.3GB.
Is it all the time useful to exclude nulls from indexes?
No. NULL
is as significant as another worth. In case your queries are trying to find null values utilizing IS NULL
, these queries would possibly profit from an index on NULL
.
So is that this technique useful just for null values?
Utilizing partial indexes to exclude values that aren’t queried fairly often or under no circumstances might be useful for any worth, not simply null values. NULL
often point out an absence of worth, and in our case not many queries had been trying to find null values, so it made sense to exclude them from the index.
So how did you find yourself clearing greater than 20GB?
You might have seen that the title mentions greater than 20GB of free house however the charts solely present half, properly… indexes are additionally dropped from replications! If you launch 10GB out of your main database, you additionally launch roughly the identical quantity of storage from every reproduction.
This story is taken from a big utility constructed with Django. To place the above strategies to observe with Django, there are a number of issues to notice.
Prevent Implicit Creation of Indexes on Foreign Keys
Except you explicitly set db_index=False
, Django will implicitly create a B-Tree index on a models.ForeignKey
field. Think about the next instance:
from django.db import fashions
from django.contrib.auth.fashions import Person
class Transaction(fashions.Mannequin):
# ...
cancelled_by_user = fashions.ForeignKey(
to=Person,
null=True,
on_delete=fashions.CASCADE,
)
The mannequin is used to maintain monitor of transaction knowledge. If a transaction is cancelled, we preserve a reference to person that cancelled it. As beforehand described, most transactions do not find yourself being cancelled, so we set null=True
on the sphere.
Within the ForeignKey
definition above we didn’t explicitly set db_index
, so Django will implicitly create a full index on the sphere. To create a partial index as a substitute, make the next modifications:
from django.db import fashions
from django.contrib.auth.fashions import Person
class Transaction(fashions.Mannequin):
# ...
cancelled_by_user = fashions.ForeignKey(
to=Person,
null=True,
on_delete=fashions.CASCADE,
db_index=False,
)
class Meta:
indexes = (
fashions.Index(
fields=('cancelled_by_user_id', ),
title='%(class_name)s_cancelled_by_part_ix',
situation=Q(cancelled_by_user_id__isnull=False),
),
)
We first inform Django to not create the index on the FK discipline, after which add a partial index utilizing models.Index
.
take away
Nullable international keys are good candidates for a partial index!
To forestall implicit options equivalent to this one from sneaking indexes with out us noticing, we create Django checks to pressure ourselves to always explicitly set db_index
in foreign keys.
Migrate Exiting Full Indexes to Partial Indexes
One of many challenges we had been dealing with throughout this migration is to exchange the prevailing full indexes with partial indexes with out inflicting downtime or degraded efficiency through the migration. After we recognized the total indexes we need to substitute, we took the next steps:
-
Substitute full indexes with partial indexes: Regulate the related Django fashions and substitute full indexes with partial indexes, as demonstrated above. The migration Django generates will first disable the FK constraint (if the sphere is a international key), drop the prevailing full index and create the brand new partial index. Executing this migration might trigger each downtime and degraded efficiency, so we can’t truly run it.
-
Create the partial indexes manually: Use Django’s
./manage.py sqlmigrate
utility to supply a script for the migration, extract solely theCREATE INDEX
statements and modify them to create the indexesCONCURRENTLY
. Then, create the indexes manually and concurrently within the database. Because the full indexes are usually not dropped but, they will nonetheless be utilized by queries so efficiency shouldn’t be impacted within the course of. It’s potential to create indexes concurrently in Django migrations, however this time we determined it is best to do it manually. -
Reset full index statistics counters: To ensure it is protected to drop the total indexes, we wished to first be sure the brand new partial indexes are getting used. To maintain monitor of their use we reset the counters for the total indexes utilizing
pg_stat_reset_single_table_counters(<full index oid>)
. -
Monitor use of partial indexes: After reseting the stats we monitored each general question efficiency and the partial index utilization by observing the values of
idx_scan
,idx_tup_read
andidx_tup_fetch
within thepg_stat_all_indexes
tables, for each the partial and the total indexes. -
Drop the total indexes: As soon as we had been satisfied the partial indexes are getting used, we dropped the total indexes. It is a good level to test the sizes of each partial and full indexes to seek out out precisely how a lot storage you are about to free.
-
Faux the Django migration: As soon as the database state was successfully in-sync with the mannequin state, we pretend the migration utilizing
./handle.py migrate --fake
. When faking a migration, Django will register the migration as executed, however it will not truly execute something. That is helpful for conditions like this while you want higher management over a migration course of. Observe that on different environments equivalent to dev, QA or staging the place there’s not downtime concerns, the Django migrations will execute usually and the total indexes will probably be changed with the partial ones. For extra superior Django migration operations equivalent to “pretend”, try How to Move a Django Model to Another App.
Optimizing disks, storage parameters and configuration can solely have an effect on efficiency a lot. Sooner or later, to squeeze that remaining drop of efficiency you might want to make modifications to the underlying objects. On this case, it was the index definition.
To sum up the method we took to clear an a lot storage as we might:
- Take away unused indexes
- Repack tables and indexes (and activate B-Tree deduplication when potential)
- Make the most of partial indexes to index solely what’s mandatory
Hopefully, after making use of these strategies you may achieve just a few extra days earlier than you might want to attain into your pocket and provision extra storage.