Now Reading
PostgreSQL Logical Replication Gotchas – pgDash

PostgreSQL Logical Replication Gotchas – pgDash

2023-04-07 00:38:44

PostgreSQL 10 got here with the welcome addition of the logical replication
characteristic. This offers a extra versatile and simpler means to copy your
tables than the common streaming replication mechanism. Nonetheless, it does
have some limitations which will or might not forestall you from using it for
replication. Learn on to be taught extra.

What Is Logical Replication Anyway?

Streaming Replication

Previous to v10, the one strategy to replicate information residing in a server was to
replicate the modifications at WAL-level. Throughout it’s operation, a PostgreSQL server
(the major) generates a sequence of WAL information. The essential concept is to get
these information over to a different PostgreSQL server (the standby) which takes in
these information and “replays” them to recreate the identical modifications occurring on the
major server. The standby server stays in a read-only mode referred to as the
restoration mode, and any modifications to the standby server are not permitted (that
is, solely read-only transactions are allowed).

The method of delivery the WAL information from the first to standby known as log
delivery
, and may be carried out manually (scripts to rsync modifications from major’s
$PGDATA/pg_wal listing to secondary’s) or by way of streaming replication.
Numerous options like replication slots, standby suggestions and failover
have been added over time to enhance the reliability and usefulness of streaming
replication.

One large “characteristic” of streaming replication is that it’s all or nothing. All
modifications to all objects from all databases on the first must be shipped to
the standby, and the standby has to import each change. It’s not potential to
selectively replicate part of your database.

Logical Replication

Logical Replication, added in v10, makes it potential to do exactly that –
replicate solely a set of tables to different servers. It’s best
defined with an instance. Let’s take a database referred to as src in a server, and
create a desk in it:

src=> CREATE TABLE t (col1 int, col2 int);
CREATE TABLE
src=> INSERT INTO t VALUES (1,10), (2,20), (3,30);
INSERT 0 3

We’re additionally going to create a publication on this database (notice that you just want
to have superuser privileges to do that):

src=# CREATE PUBLICATION mypub FOR ALL TABLES;
CREATE PUBLICATION

Now let’s goto a database dst on one other server and create an identical desk:

dst=# CREATE TABLE t (col1 int, col2 int, col3 textual content NOT NULL DEFAULT 'foo');
CREATE TABLE

And we now setup a subscription right here that may hook up with the publication on
the supply and begin pulling within the modifications. (Notice that you might want to have a consumer
repuser on the supply server with replication privileges and browse entry to
the tables.)

dst=# CREATE SUBSCRIPTION mysub CONNECTION 'consumer=repuser password=reppass host=127.0.0.1 port=5432 dbname=src' PUBLICATION mypub;
NOTICE:  created replication slot "mysub" on writer
CREATE SUBSCRIPTION

The modifications are synced, and you may see the rows on the vacation spot facet:

dst=# SELECT * FROM t;
 col1 | col2 | col3
------+------+------
    1 |   10 | foo
    2 |   20 | foo
    3 |   30 | foo
(3 rows)

The vacation spot desk has an additional column “col3”, which isn’t touched by the
replication. The modifications are replicated “logically” – so, so long as it potential
to insert a row with t.col1 and t.col2 alone, the replication course of will do
so.

In comparison with streaming replication, the logical replication characteristic is ideal
for replicating, say, a single schema or a set of tables in a particular database
to a different server.

Replication of Schema Adjustments

Assume you have got a Django utility with it’s set of tables residing within the
supply database. It’s straightforward and environment friendly to setup logical replication to deliver
over all these tables into one other server, the place you may run reporting,
analytics, batch jobs, developer/buyer help apps and the like with out
touching the “actual” information and with out affecting the manufacturing app.

Presumably the largest limitation of Logical Replication at the moment is that it does
not replicate schema modifications – any DDL command executed on the supply database
doesn’t trigger an identical change within the vacation spot database, not like in streaming
replication. For instance, if we do that on the supply database:

src=# ALTER TABLE t ADD newcol int;
ALTER TABLE
src=# INSERT INTO t VALUES (-1, -10, -100);
INSERT 0 1

this will get logged within the vacation spot log file:

ERROR:  logical replication goal relation "public.t" is lacking some replicated columns

and the replication stops. The column needs to be added “manually” on the
vacation spot, at which level the replication resumes:

dst=# SELECT * FROM t;
 col1 | col2 | col3
------+------+------
    1 |   10 | foo
    2 |   20 | foo
    3 |   30 | foo
(3 rows)

dst=# ALTER TABLE t ADD newcol int;
ALTER TABLE
dst=# SELECT * FROM t;
 col1 | col2 | col3 | newcol
------+------+------+--------
    1 |   10 | foo  |
    2 |   20 | foo  |
    3 |   30 | foo  |
   -1 |  -10 | foo  |   -100
(4 rows)

Which means in case your Django utility has added a brand new characteristic that wants
new columns or tables, and also you’ve to run django-admin migrate on the supply
database, the replication setup breaks.

Workaround

Your finest guess to repair this difficulty can be to pause the subscription on the
vacation spot, migrate the vacation spot first, then the supply after which resume the
subscription. You’ll be able to pause and resume subscriptions like this:

-- pause replication (vacation spot facet)
ALTER SUBSCRIPTION mysub DISABLE;

-- resume replication
ALTER SUBSCRIPTION mysub ENABLE;

If new tables are added and your publication isn’t “FOR ALL TABLES”, you’ll
want so as to add them to the publication manually:

ALTER PUBLICATION mypub ADD TABLE newly_added_table;

You’ll additionally have to “refresh” the subscription on the vacation spot facet to inform
Postgres to begin syncing the brand new tables:

dst=# ALTER SUBSCRIPTION mysub REFRESH PUBLICATION;
ALTER SUBSCRIPTION

Sequences

Take into account this desk on the supply, having a sequence:

src=# CREATE TABLE s (a serial PRIMARY KEY, b textual content);
CREATE TABLE
src=# INSERT INTO s (b) VALUES ('foo'), ('bar'), ('baz');
INSERT 0 3
src=# SELECT * FROM s;
 a |  b
---+-----
 1 | foo
 2 | bar
 3 | baz
(3 rows)

src=# SELECT currval('s_a_seq'), nextval('s_a_seq');
 currval | nextval
---------+---------
       3 |       4
(1 row)

The sequence s_a_seq was created to again the a column, of serial kind.
This generates the autoincrementing values for s.a. Now let’s replicate this
into dst, and insert one other row:

dst=# SELECT * FROM s;
 a |  b
---+-----
 1 | foo
 2 | bar
 3 | baz
(3 rows)

dst=# INSERT INTO s (b) VALUES ('foobaz');
ERROR:  duplicate key worth violates distinctive constraint "s_pkey"
DETAIL:  Key (a)=(1) already exists.
dst=#  SELECT currval('s_a_seq'), nextval('s_a_seq');
 currval | nextval
---------+---------
       1 |       2
(1 row)

Oops, what simply occurred? The vacation spot tried to begin the sequence from
scratch and generated a price of 1 for a. It is because logical replication
doesn’t replicate the values for sequences for the reason that subsequent worth of the
sequence isn’t saved within the desk itself.

Workaround

If you consider it logically, you may’t modify the identical “autoincrement” worth
from two locations with out bidirectional synchronization. If you actually need an
incrementing quantity in every row of a desk, and have to insert into that desk
from a number of servers, you might:

  • use an exterior supply for the quantity, like ZooKeeper or etcd,
  • use non-overlapping ranges – for instance, the primary server generates and
    inserts numbers within the vary 1 to 1 million, the second within the vary 1
    million to 2 million, and so forth.

Tables With out Distinctive Rows

Let’s strive making a desk and not using a major key, and replicating it:

src=# CREATE TABLE nopk (foo textual content);
CREATE TABLE
src=# INSERT INTO nopk VALUES ('the big apple');
INSERT 0 1
src=# INSERT INTO nopk VALUES ('boston');
INSERT 0 1

And the rows are actually on the vacation spot too:

dst=# SELECT * FROM nopk;
   foo
----------
 the big apple
 boston
(2 rows)

Now let’s strive deleting the second row on the supply:

src=# DELETE FROM nopk WHERE foo='boston';
ERROR:  can not delete from desk "nopk" as a result of it doesn't have a reproduction id and publishes deletes
HINT:  To allow deleting from the desk, set REPLICA IDENTITY utilizing ALTER TABLE.

This occurs as a result of the vacation spot received’t be capable to uniquely determine the row
that must be deleted (or up to date) and not using a major key.

See Also

Workaround

You’ll be able to after all, alter the schema to incorporate a major key. In case you don’t
wish to try this, you ALTER TABLE and set the “reproduction determine” to the total
row or a novel index. For instance:

src=# ALTER TABLE nopk REPLICA IDENTITY FULL;
ALTER TABLE
src=# DELETE FROM nopk WHERE foo='boston';
DELETE 1

The delete now succeeds, and the replication too:

dst=# SELECT * FROM nopk;
   foo
----------
 the big apple
(1 row)

In case your desk actually has no strategy to uniquely determine rows, then you definitely’re a bit
caught. See the REPLICA IDENTITY
part of ALTER
TABLE
for extra data.

In another way Partitioned Locations

Wouldn’t or not it’s good to have a supply that’s partitioned a method and a
vacation spot another way? For instance, on the supply we will hold
parititions for every month, and on the vacation spot for every year. Presumably the
vacation spot is a much bigger machine, and we have to hold historic information, however want
that information hardly ever.

Let’s create a monthly-partitioned desk on the supply:

src=# CREATE TABLE measurement (
src(#     logdate         date not null,
src(#     peaktemp        int
src(# ) PARTITION BY RANGE (logdate);
CREATE TABLE
src=#
src=# CREATE TABLE measurement_y2019m01 PARTITION OF measurement
src-# FOR VALUES FROM ('2019-01-01') TO ('2019-02-01');
CREATE TABLE
src=#
src=# CREATE TABLE measurement_y2019m02 PARTITION OF measurement
src-# FOR VALUES FROM ('2019-02-01') TO ('2019-03-01');
CREATE TABLE
src=#
src=# GRANT SELECT ON measurement, measurement_y2019m01, measurement_y2019m02 TO repuser;
GRANT

And check out making a yearly-partitioned desk on the vacation spot:

dst=# CREATE TABLE measurement (
dst(#     logdate         date not null,
dst(#     peaktemp        int
dst(# ) PARTITION BY RANGE (logdate);
CREATE TABLE
dst=#
dst=# CREATE TABLE measurement_y2018 PARTITION OF measurement
dst-# FOR VALUES FROM ('2018-01-01') TO ('2019-01-01');
CREATE TABLE
dst=#
dst=# CREATE TABLE measurement_y2019 PARTITION OF measurement
dst-# FOR VALUES FROM ('2019-01-01') TO ('2020-01-01');
CREATE TABLE
dst=#
dst=# ALTER SUBSCRIPTION mysub REFRESH PUBLICATION;
ERROR:  relation "public.measurement_y2019m01" doesn't exist
dst=#

Postgres complains that it wants the partition desk for Jan 2019, which now we have
no intention of making on the vacation spot.

This occurs as a result of logical replication works not on the base desk stage, however
on the little one desk stage. There isn’t any actual workaround for this – if you happen to’re
utilizing partitions, the partition hierarchy should be the identical on either side of a
logical replication setup.

Massive Objects

Massive objects can’t be replicated utilizing logical replication. That is most likely
not a giant deal these days, as storing massive objects isn’t a standard modern-day
apply. Additionally it is simpler to retailer a reference to a big object on some
exterior, redudant storage (like NFS, S3 and many others.) and replicate that reference slightly than storing and
replicating the item itself.

About pgDash

pgDash is a contemporary, in-depth monitoring resolution designed
particularly for PostgreSQL deployments. pgDash exhibits you data and
metrics about each side of your PostgreSQL database server, collected utilizing
the open-source device pgmetrics.
pgDash offers core reporting and visualization
performance, together with amassing and displaying PostgreSQL data and
offering time-series graphs, detailed reviews, alerting, groups and extra.

pgDash Replication Monitoring

pgDash can monitor replication, locks, queries and extra. It may possibly additionally present
fast diagnostics and in depth alerts. Learn more here or signup
today
for a free trial.

Source Link

What's Your Reaction?
Excited
0
Happy
0
In Love
0
Not Sure
0
Silly
0
View Comments (0)

Leave a Reply

Your email address will not be published.

2022 Blinking Robots.
WordPress by Doejo

Scroll To Top