The Surprising Discover That Freed 20GB of Unused Index Area

2023-08-28 09:42:04

Each few months we get an alert from our database monitoring to warn us that we’re about to expire of house. Normally we simply provision extra storage and neglect about it, however this time we had been below quarantine, and the system in query was below much less load than common. We thought this can be a good alternative to do some cleanups that will in any other case be rather more difficult.

To begin from the top, we ended up releasing greater than 70GB of un-optimized and un-utilized house with out dropping a single index or deleting any knowledge!

Utilizing standard technics equivalent to rebuilding indexes and tables we cleared up lots of house, however then one shocking discover helped us clear a further ~20GB of unused listed values!

That is what the free storage chart of one in every of our databases seemed like within the course of:

Free space over time (higher means more free space) — Free house over time (larger means extra free house)

Desk of Contents

Provisioning storage is one thing we do once in a while, however earlier than we throw cash on the downside we like to verify we make good use of the storage we have already got. To try this, we begin with the standard suspects.

Unused Indexes

Unused indexes are double-edged swords; you create them to make issues quicker, however they find yourself taking house and gradual inserts and updates. Unused indexes are the very first thing we all the time test when we have to clear up storage.

To search out unused indexes we use the next question:

SELECT
    relname,
    indexrelname,
    idx_scan,
    idx_tup_read,
    idx_tup_fetch,
    pg_size_pretty(pg_relation_size(indexrelname::regclass)) as measurement
FROM
    pg_stat_all_indexes
WHERE
    schemaname = 'public'
    AND indexrelname NOT LIKE 'pg_toast_%'
    AND idx_scan = 0
    AND idx_tup_read = 0
    AND idx_tup_fetch = 0
ORDER BY
    pg_relation_size(indexrelname::regclass) DESC;

The question is on the lookout for indexes that weren’t scanned or fetched for the reason that final time the statistics had been reset.

Some indexes might seem to be they weren’t used however they had been in-fact used:

The documentation lists just a few eventualities when that is potential. For instance, when the optimizer makes use of meta knowledge from the index, however not the index itself.
Indexes used to implement distinctive or main key constraints for tables that weren’t up to date shortly. The indexes will appear to be they weren’t used, however it doesn’t suggest we are able to eliminate them.

To search out the unused indexes you may truly drop, you often should go over the checklist one after the other and decide. This may be time consuming within the first couple of instances, however after you eliminate most unused indexes it turns into simpler.

It is also a good suggestion to reset the statistics counters once in a while, often proper after you completed inspecting the checklist. PostgreSQL offers just a few functions to reset statistics at completely different ranges. Once we discover an index we suspect will not be getting used, or after we add new indexes rather than previous ones, we often reset the counters for the desk and watch for some time:

-- Discover desk oid by title
SELECT oid FROM pg_class c WHERE relname = 'table_name';
-- Reset counts for all indexes of desk
SELECT pg_stat_reset_single_table_counters(14662536);

We do that each every so often, so in our case there have been no unused indexes to drop.

Index and Table Bloat

The subsequent suspect is bloat. If you replace rows in a desk, PostgreSQL marks the tuple as lifeless and provides the up to date tuple within the subsequent accessible house. This course of creates what’s known as “bloat”, which may trigger tables to devour extra space than they actually need. Bloat additionally impacts indexes, so to unlock house, bloat is an effective place to look.

Estimating bloat in tables and indexes is seemingly not a easy activity. Fortunate for us, some good folks on the world vast internet already did the hard work and wrote queries to estimate table bloat and index bloat. After operating these queries you’ll most certainly discover some bloat, so the subsequent factor to do it clear up that house.

Clearing Bloat in Indexes

To clear bloat in an index, you might want to rebuild it. There are a number of methods to rebuild an index:

Re-create the index: In the event you re-create the index, it will likely be in-built an optimum manner.
Rebuild the index: As an alternative of dropping and creating the index your self, PostgreSQL offers a strategy to re-build an current index in-place utilizing the REINDEX command:

REINDEX INDEX index_name;

Rebuild the index concurrently: The earlier strategies will receive a lock on the desk and forestall it from being modified whereas the operation is in progress, which is often unacceptable. To rebuild the index with out locking it for updates, you may rebuilt the index concurrently:

REINDEX INDEX CONCURRENTLY index_name;

When utilizing REINDEX CONCURRENTLY, PostgreSQL creates a brand new index with a reputation suffixed with _ccnew, and syncs any modifications made to the desk within the meantime. When the rebuild is finished, it’s going to change the previous index with the brand new index, and drop the previous one.

Clearing bloat in Indexes

If for some motive you needed to cease the rebuild within the center, the brand new index won’t be dropped. As an alternative, it will likely be left in an invalid state and devour house. To determine invalid indexes that had been created throughout REINDEX, we use the next question:

-- Establish invalid indexes that had been created throughout index rebuild
SELECT
    c.relname as index_name,
    pg_size_pretty(pg_relation_size(c.oid))
FROM
    pg_index i
    JOIN pg_class c ON i.indexrelid = c.oid
WHERE
    -- New index constructed utilizing REINDEX CONCURRENTLY
    c.relname LIKE  '%_ccnew'
    -- In INVALID state
    AND NOT indisvalid
LIMIT 10;

As soon as the rebuild course of is not energetic, it ought to be protected to drop any remaining invalid indexes.

Activating B-Tree Index Deduplication

PostgreSQL 13 launched a brand new environment friendly manner of storing duplicate values in B-Tree indexes known as “B-Tree Deduplication”.

For every listed worth, a B-Tree index will maintain in its leaf each the worth and a pointer to the row (TID). The bigger the listed values, the bigger the index. Up till PostgreSQL 12, when the index contained many duplicate values, all of those duplicate values can be saved within the index leaves. This isn’t very environment friendly and might take up lots of house.

B-Tree Index Deduplication

Beginning at PostgreSQL 13, when B-Tree deduplication is activated, duplicate values are solely saved as soon as. This will make a big impact on the dimensions of indexes with many duplicate values.

In PostgreSQL 13 index deduplication in enabled by default, until you deactivate it:

-- Activating de-deduplication for a B-Tree index, that is the default:
CREATE INDEX index_name ON table_name(column_name) WITH (deduplicate_items = ON)

In case you are migrating from PostgreSQL variations previous to 13, you might want to rebuild the indexes utilizing the REINDEX command as a way to get the total advantages of index de-deduplication.

For example the impact of B-Tree deduplication on the dimensions of the index, create a desk with a singular column and a non distinctive column, and populate it with 1M rows. On every column create two B-Tree indexes, one with deduplication enabled and one other with deduplication disabled:

db=# CREATE test_btree_dedup (n_unique serial, n_not_unique integer);
CREATE TABLE

db=# INSERT INTO test_btree_dedup (n_not_unique)
SELECT (random() * 100)::int FROM generate_series(1, 1000000);
INSERT 0 1000000

db=# CREATE INDEX ix1 ON test_btree_dedup (n_unique)     WITH (deduplicate_items = OFF);
CREATE INDEX

db=# CREATE INDEX ix2 ON test_btree_dedup (n_unique)     WITH (deduplicate_items = ON);
CREATE INDEX

db=# CREATE INDEX ix3 ON test_btree_dedup (n_not_unique) WITH (deduplicate_items = OFF);
CREATE INDEX

db=# CREATE INDEX ix4 ON test_btree_dedup (n_not_unique) WITH (deduplicate_items = ON);
CREATE INDEX

Subsequent, evaluate the sizes of the 4 indexes:

Column	Deduplication	Measurement
Not distinctive	Sure	6840 kB
Not distinctive	No	21 MB
Distinctive	Sure	21 MB
Distinctive	No	21 MB

As anticipated, deduplication had no impact on the distinctive index, however it had a major impact on the index that had many duplicate values.

Sadly for us, PostgreSQL 13 was nonetheless contemporary on the time, and our cloud supplier didn’t have help for it but, so we had been unable to make use of deduplication to clear house.

Clearing Bloat in Tables

Similar to in indexes, tables may include lifeless tuples that trigger bloat and fragmentation. Nevertheless, in contrast to indexes that include knowledge from an related desk, a desk can’t simply merely be re-created. To re-create a desk you would need to create a brand new desk, migrate the information over whereas holding it synced with new knowledge, create all of the indexes, constraints and any referential constraints in different tables. Solely in spite of everything of that is finished, you may change the previous desk with the brand new one.

Clearing bloat in Tables

There are a number of methods to rebuild a desk and cut back bloat:

Re-create the desk: Utilizing this technique as described above usually requires lots of growth, particularly if the desk is actively getting used because it’s being rebuilt.
Vacuum the desk: PostgreSQL offers a strategy to reclaim house occupied by bloat and lifeless tuples in a desk utilizing the VACUUM FULL command. Vacuum full requires a lock on the desk, and isn’t an excellent resolution for tables that should be accessible whereas being vacuumed:

-- Will lock the desk
VACUUM FULL table_name;

The 2 choices above require both a major effort, or some down time.

Using pg_repack

Each built-in choices for rebuilding tables are usually not preferrred until you may afford downtime. One fashionable resolution for rebuilding tables and indexes with out downtime is the pg_repack extension.

Being a well-liked extension, pg_repack is probably going accessible out of your bundle supervisor or already put in by your cloud supplier. To make use of pg_repack, you first must create the extension:

CREATE EXTENSION pg_repack;

To “repack” a desk together with its indexes, situation the next command from the console:

$ pg_repack -k --table table_name db_name

To rebuild a desk with no downtime, the extension creates a brand new desk, masses the information from the unique desk into it whereas holding it updated with new knowledge, after which additionally rebuilds the indexes. When the method is completed, the 2 tables are switched and the unique desk is dropped. See here for full particulars.

There are two caveats to pay attention to when utilizing pg_repack to rebuild tables:

Requires quantity of storage roughly the quantity of the desk to rebuild: the extension creates one other desk to repeat the information to, so it requires further storage roughly the dimensions of the desk and its indexes.
Might require some handbook cleanup: if the “repack” course of failed or stopped manually, it might depart intermediate objects laying round, so it’s possible you’ll must do some handbook cleanup.

Regardless of these caveats, pg_repack is a superb possibility for rebuilding tables and indexes with no downtime. Nevertheless, as a result of it requires some further storage to function, it isn’t a superb possibility when you find yourself already out of storage. It is a good suggestion to observe the free cupboard space and plan rebuilds prematurely.

At this level we already used all the standard strategies we might consider and cleared up quite a bit of house. We dropped unused indexes and cleared bloat from tables and indexes, however… there was nonetheless extra space to shave off!

The “Aha Moment”

Whereas we had been wanting on the sizes of the indexes after we completed rebuilding them, an attention-grabbing factor caught our eye.

One in every of our largest tables shops transaction knowledge. In our system, after a cost is made, the person can select to cancel and get a refund. This isn’t taking place fairly often, and solely a fraction of the transactions find yourself being cancelled.

In our transactions desk, there are international keys to each the buying person and the cancelling person, and every discipline has a B-Tree index outlined on it. The buying person has a NOT NULL constraint on it so all of the rows maintain a price. The cancelling person however, is nullable, and solely a fraction of the rows maintain any knowledge. Many of the values within the cancelling person discipline are NULL.

Transactions desk with FK to buying and cancelling person

We anticipated the index on the cancelling person to be considerably smaller than the index on the buying person, however they had been precisely the identical. Coming from Oracle, I used to be all the time taught that NULLs are not indexed, however in PostgreSQL they’re! This “Aha” second led us to the conclusion that we had been indexing lots of pointless values for no motive.

This was the unique index we had for the cancelling person:

CREATE INDEX transaction_cancelled_by_ix ON transactions(cancelled_by_user_id);

To test our thesis, we changed the index with a partial index that excludes null values:

DROP INDEX transaction_cancelled_by_ix;

CREATE INDEX transaction_cancelled_by_part_ix ON transactions(cancelled_by_user_id)
WHERE cancelled_by_user_id IS NOT NULL;

The complete index after we reindexed it was 769MB in measurement, with greater than 99% null values. The partial index that excluded null values was lower than 5MB. That is greater than 99% % of lifeless weight shaved off the index!

Index	Measurement
Full index	769MB
Partial Index	5MB

Distinction	-99%

To ensure these NULL values had been certainly pointless, we reset the stats on the desk and waited some time. Not lengthy after, we noticed that the index is getting used similar to the previous one! We simply shaved off greater than 760MB of unused listed tuples with out compromising efficiency!

Clearing space, play by play — Clearing house, play by play

Utilizing Partial Indexes

As soon as we had a superb expertise with one partial index, we figured we would have extra indexes like that. To search out good candidates for partial index we wrote a question to seek for indexes on fields with excessive null_frac, the % of values of the column that PostgreSQL estimates are NULL:

-- Discover listed columns with excessive null_frac
SELECT
    c.oid,
    c.relname AS index,
    pg_size_pretty(pg_relation_size(c.oid)) AS index_size,
    i.indisunique AS distinctive,
    a.attname AS indexed_column,
    CASE s.null_frac
        WHEN 0 THEN ''
        ELSE to_char(s.null_frac * 100, '999.00%')
    END AS null_frac,
    pg_size_pretty((pg_relation_size(c.oid) * s.null_frac)::bigint) AS expected_saving
    -- Uncomment to incorporate the index definition
    --, ixs.indexdef

FROM
    pg_class c
    JOIN pg_index i ON i.indexrelid = c.oid
    JOIN pg_attribute a ON a.attrelid = c.oid
    JOIN pg_class c_table ON c_table.oid = i.indrelid
    JOIN pg_indexes ixs ON c.relname = ixs.indexname
    LEFT JOIN pg_stats s ON s.tablename = c_table.relname AND a.attname = s.attname

WHERE
    -- Major key can't be partial
    NOT i.indisprimary

    -- Exclude already partial indexes
    AND i.indpred IS NULL

    -- Exclude composite indexes
    AND array_length(i.indkey, 1) = 1

    -- Bigger than 10MB
    AND pg_relation_size(c.oid) > 10 * 1024 ^ 2

ORDER BY
    pg_relation_size(c.oid) * s.null_frac DESC;

The outcomes of this question can appear to be this:

   oid   |         index      | index_size | distinctive | indexed_column | null_frac | expected_saving
---------+--------------------+------------+--------+----------------+-----------+-----------------
  138247 | tx_cancelled_by_ix | 1418 MB    | f      | cancelled_by   |   96.15%  | 1363 MB
   16988 | tx_op1_ix          | 1651 MB    | t      | op1            |    6.11%  | 101 MB
 1473377 | tx_token_ix        | 22 MB      | t      | token          |   11.21%  | 2494 kB
  138529 | tx_op_name_ix      | 1160 MB    | t      | title           |           | 0 bytes

Within the desk above we are able to determine a number of forms of outcomes:

tx_cancelled_by_ix is a big index with many null values: nice potential right here!
tx_op_1_ix is a big index with few null values: there’s not a lot potential
tx_token_ix is a small index with few null values: I would not hassle with this index
tx_op_name_ix is a big index with no null values: nothing to do right here

The outcomes present that by turning tx_cancelled_by_ix right into a partial index that excludes null we are able to doubtlessly save ~1.3GB.

Is it all the time useful to exclude nulls from indexes?

No. NULL is as significant as another worth. In case your queries are trying to find null values utilizing IS NULL, these queries would possibly profit from an index on NULL.

So is that this technique useful just for null values?

Utilizing partial indexes to exclude values that aren’t queried fairly often or under no circumstances might be useful for any worth, not simply null values. NULL often point out an absence of worth, and in our case not many queries had been trying to find null values, so it made sense to exclude them from the index.

So how did you find yourself clearing greater than 20GB?

You might have seen that the title mentions greater than 20GB of free house however the charts solely present half, properly… indexes are additionally dropped from replications! If you launch 10GB out of your main database, you additionally launch roughly the identical quantity of storage from every reproduction.

This story is taken from a big utility constructed with Django. To place the above strategies to observe with Django, there are a number of issues to notice.

Prevent Implicit Creation of Indexes on Foreign Keys

Except you explicitly set db_index=False, Django will implicitly create a B-Tree index on a models.ForeignKey field. Think about the next instance:

from django.db import fashions
from django.contrib.auth.fashions import Person

class Transaction(fashions.Mannequin):
    # ...
    cancelled_by_user = fashions.ForeignKey(
        to=Person,
        null=True,
        on_delete=fashions.CASCADE,
    )

The mannequin is used to maintain monitor of transaction knowledge. If a transaction is cancelled, we preserve a reference to person that cancelled it. As beforehand described, most transactions do not find yourself being cancelled, so we set null=True on the sphere.

Within the ForeignKey definition above we didn’t explicitly set db_index, so Django will implicitly create a full index on the sphere. To create a partial index as a substitute, make the next modifications:

from django.db import fashions
from django.contrib.auth.fashions import Person

class Transaction(fashions.Mannequin):
    # ...
    cancelled_by_user = fashions.ForeignKey(
        to=Person,
        null=True,
        on_delete=fashions.CASCADE,
        db_index=False,
    )

    class Meta:
        indexes = (
            fashions.Index(
                fields=('cancelled_by_user_id', ),
                title='%(class_name)s_cancelled_by_part_ix',
                situation=Q(cancelled_by_user_id__isnull=False),
            ),
        )

We first inform Django to not create the index on the FK discipline, after which add a partial index utilizing models.Index.

take away

Nullable international keys are good candidates for a partial index!

To forestall implicit options equivalent to this one from sneaking indexes with out us noticing, we create Django checks to pressure ourselves to always explicitly set db_index in foreign keys.

Migrate Exiting Full Indexes to Partial Indexes

One of many challenges we had been dealing with throughout this migration is to exchange the prevailing full indexes with partial indexes with out inflicting downtime or degraded efficiency through the migration. After we recognized the total indexes we need to substitute, we took the next steps:

Substitute full indexes with partial indexes: Regulate the related Django fashions and substitute full indexes with partial indexes, as demonstrated above. The migration Django generates will first disable the FK constraint (if the sphere is a international key), drop the prevailing full index and create the brand new partial index. Executing this migration might trigger each downtime and degraded efficiency, so we can’t truly run it.
Create the partial indexes manually: Use Django’s ./manage.py sqlmigrate utility to supply a script for the migration, extract solely the CREATE INDEX statements and modify them to create the indexes CONCURRENTLY. Then, create the indexes manually and concurrently within the database. Because the full indexes are usually not dropped but, they will nonetheless be utilized by queries so efficiency shouldn’t be impacted within the course of. It’s potential to create indexes concurrently in Django migrations, however this time we determined it is best to do it manually.
Reset full index statistics counters: To ensure it is protected to drop the total indexes, we wished to first be sure the brand new partial indexes are getting used. To maintain monitor of their use we reset the counters for the total indexes utilizing pg_stat_reset_single_table_counters(<full index oid>).
Monitor use of partial indexes: After reseting the stats we monitored each general question efficiency and the partial index utilization by observing the values of idx_scan, idx_tup_read and idx_tup_fetch within the pg_stat_all_indexes tables, for each the partial and the total indexes.
Drop the total indexes: As soon as we had been satisfied the partial indexes are getting used, we dropped the total indexes. It is a good level to test the sizes of each partial and full indexes to seek out out precisely how a lot storage you are about to free.
Faux the Django migration: As soon as the database state was successfully in-sync with the mannequin state, we pretend the migration utilizing ./handle.py migrate --fake. When faking a migration, Django will register the migration as executed, however it will not truly execute something. That is helpful for conditions like this while you want higher management over a migration course of. Observe that on different environments equivalent to dev, QA or staging the place there’s not downtime concerns, the Django migrations will execute usually and the total indexes will probably be changed with the partial ones. For extra superior Django migration operations equivalent to “pretend”, try How to Move a Django Model to Another App.

Optimizing disks, storage parameters and configuration can solely have an effect on efficiency a lot. Sooner or later, to squeeze that remaining drop of efficiency you might want to make modifications to the underlying objects. On this case, it was the index definition.

To sum up the method we took to clear an a lot storage as we might:

Take away unused indexes
Repack tables and indexes (and activate B-Tree deduplication when potential)
Make the most of partial indexes to index solely what’s mandatory

Hopefully, after making use of these strategies you may achieve just a few extra days earlier than you might want to attain into your pocket and provision extra storage.

Source Link