The Internal Workings of Distributed Databases


This put up examines the assorted replication methods
utilized by widespread time-series and OLAP databases to implement high-availability.
Our learnings from this analysis have impressed us as we proceed to construct replication
into our personal performance-focused database, QuestDB.
The database is in-built C++, low latency Java, and not too long ago Rust! The codebase
has no exterior dependencies and we implement SQL with native time-series
extensions.
After being unable to trip my new electrical bike to work but once more as a result of it was
within the store (this time due to a wiring downside that prevented the bike from
operating!), I began to consider how I may create some redundancy in my
biking setup so I would not be caught driving the Tube for weeks at a time on account of
easy upkeep or a provide chain problem. What if I had one other bike to trip
whereas my present one is being fastened? That will undoubtedly assist, however electrical
bikes are costly, and there is little room to retailer a backup bike in my cozy
London residence.
Imagine it or not, this downside is just like what I have been tackling at my day
job as a database developer. Similar to bicycles, databases can break down too!
Generally {hardware} fails mysteriously, or a community config has modified and your
database is now not reachable. Positive, you can at all times restore your information from a
backup onto new {hardware} within the occasion of a failure, however then you definately’re caught with
downtime and probably misplaced information; and that is assuming you’ve got examined your
restore course of not too long ago to make sure that it even works! Even when you’re in a position to
run a restore seamlessly, similar to I want to attend for the store to complete fixing
my bike earlier than I can begin driving it to work once more, you’d additionally want to attend
for the restore to finish and the brand new database to be configured earlier than utilizing
it.
However fortunately, in contrast to my bike scenario, trendy databases have the potential to
keep on-line and proceed their regular operations when issues go incorrect. That is
particularly essential on the earth of time collection databases, that are designed
to be always writing information, in lots of instances nearly nonstop. There isn’t any time to
restore from a backup, since valuable information might be misplaced throughout any interval of
downtime.
How can this occur? The reply is thru replication! By synchronizing
a number of database nodes and constructing an answer for shifting execution over to
wholesome ones when a node breaks down (write failover), databases can make sure that
no information is misplaced. That is like if I received a flat tire on my commute and will
change to a spare bike proper on the spot in the midst of the street! Appears fairly
cool, proper?
As we begin our journey in direction of making QuestDB distributed, we take the time to
analyze how a number of widespread time-series/OLAP databases implement excessive
availability to focus on the professionals and cons of every method. Together with a
evaluate of the basics, we additionally share QuestDB’s personal method and our plans
for the long run.
Failover#
Let’s check out a state of affairs is when an utility writes to a database, however
it immediately will get a community disconnect. What ought to occur on this case?
The appliance ought to change to a replicated node and proceed inserting information.
For instance, if I had an utility App
writing to DB Node 1
with reproduction
DB Node 2
I may draw this state of affairs as a sequence diagram:

Appears easy, however is it simple to get a database to play its half on this
dance? I consider it isn’t. To realize this, the database has to have the ability to write
information into the identical desk A by means of a number of nodes. Within the basic case, the
database additionally has to evolve the schema to assist including/eradicating columns from
a number of nodes and making use of the transaction in precisely the identical order on all of the
replicas. It’s because some transactions like INSERT
and UPDATE
which might be
executed on the identical row could have completely different outcomes if utilized within the incorrect
order.
To summarize, a database wants to have the ability to:
-
Write information to the identical desk utilizing a number of DB Nodes
(multi-master replication)
or mirror the information to a read-only replicated node with automated failover -
Evolve desk schemas such that the identical desk columns could be added
concurrently from a number of connections -
Keep the worldwide order of the writes and schema adjustments throughout reproduction
nodes
Sync and Async replication#
Various things can go incorrect with databases when a failure occurs. For
occasion, there could be two the explanation why the database Node 1
could not reply to
the second insert, insert into A values(2)
: it might both be that the node
fails to obtain and course of the transaction, or that the information is inserted however
the OK reply is just not delivered again to the appliance on account of a community
disconnect.
To keep away from dropping the second insert after disconnecting from Node 1
, the
utility has to repeat transaction insert into A values(2)
to Node 2
. The
utility additionally has to do the repeated insert into A values(2)
try in
such a approach that the database doesn’t create a reproduction row within the case when
the identical insert has already been processed by Node 1
however an OK reply is just not
delivered again to the appliance. To be able to obtain these objectives, all information
inserted into Node 1
must be readable from Node 2
instantly for the
utility to carry out the deduplication on failures. Alternatively, there should
be one other built-in mechanism to de-duplicate the inserted information within the database.
This results in the conclusion that at the very least one of many factors under has to exist
for no-gap, no-duplicate write failover:
-
Knowledge written to at least one node is instantly selectable by means of all different replicas
in order that the writing utility can test for the duplicates -
There’s a de-duplication mechanism constructed into the insert protocol or desk
storage
The primary bullet level above can also be referred to as Synchronous (Sync) Replication the place
Node 1
replies OK
solely after an insert is already replicated to Node 2
. In
distinction, Asynchronous (Async) replication permits Node 1
to answer OK
to the
App
earlier than replicating the inserted information to Node 2
. This fashion Node 2
could
not obtain the primary insert earlier than Node 1
goes down, so the database will
must reattempt to duplicate the information when Node 1
is began or linked
again.
Alternative of replication taste#
Again to my bike analogy, a databased configured in single major with a
read-only reproduction in Sync mode is one thing akin to driving one bike whereas
concurrently attempting to roll one other alongside you always. You’ll be able to
think about that it is fairly onerous to trip whereas concurrently balancing on two
bikes, and that it is practically unattainable to really cycle rapidly!

Then again, a database configured in single major with a read-only
Async reproduction is just like me shopping for a spare bike, storing it at work, and
synchronizing my journey information when the bikes are on the identical place. In case of a
breakdown, I will lose information from that exact journey however I might nonetheless have the ability
to proceed commuting on the spare bike the identical day.
I think about that multi-master replication is one thing like driving with a good friend
on two tandem bikes, the place everyone seems to be driving the tandem on the primary seat. If
one of many tandems breaks, the rider can transfer to the again seat of the opposite one.
Multi-master Sync replication is driving two tandems subsequent to one another with out
the liberty of turning or stopping independently. Async replication can be the
unbiased rides with occasional location/journey information catch-ups.
Async replication is probably the most cheap alternative on the earth of bicycles; I do
not see folks operating bikes in sync on the streets. Additionally it is a default/solely
alternative for a lot of within the database world. Sync replication could be easy to cause
about, and should seem like the most effective resolution total, however it has a hefty
efficiency value to pay since every step wants to attend till it’s accomplished on
each node. Counterintuitively, though synchronous replication makes information
obtainable to a number of nodes on the identical time, it additionally ends in decrease
availability of the cluster because it dramatically reduces the transactions price.
There isn’t any silver bullet for the write failover downside and each database
affords completely different replication flavors to select from. To seek out the best choice for
QuestDB, we did severe analysis on how replication in different time-series
databases handles the write failover and listed below are a number of the outcomes.
TimescaleDB#
Everybody loves traditional relational databases, and practically each developer is
ready to reply interview questions on ACID properties and generally even
about Transaction Isolation ranges of various RDMS techniques (and what can go
incorrect with every of them!). Constructing distributed Learn / Write functions utilizing
a RDBMS is just not simple however it’s undoubtedly doable.
PostgreSQL is without doubt one of the main open supply RDBMSes, and lots of say that it is
greater than only a database. Postgres can also be an extensible platform the place you
can, for instance, add a geographical location column sort, a geospatial SQL
Question syntax, and indexes, successfully turning it into GIS system. Equally,
TimescaleDB is a PostgreSQL extension constructed to optimize the storage and question
efficiency of time-series workloads.
Out of the field, TimescaleDB inherits its replication performance from
Postgres. PostgreSQL helps a number of read-only replicas with Sync or Async
replication with all of the ACID and Transaction Isolation properties,
and so does Timescale.
Sadly, automated failover is solved neither by PostgreSQL nor
TimescaleDB, however there are Third-party options like
Patroni that add assist for that
performance. PostgreSQL describes the method
of failover as STONITH (Shoot The Other Node In The Head),
which means that the first node must be shot down as soon as it begins to misbehave.
Working Sync replication can clear up the information gaps and duplicate issues after
the failover. If the appliance detects the failover, it might re-run the final
non-confirmed INSERT
as an UPSERT.
With Async replication, a number of current transactions could also be lacking on the reproduction
that’s promoted to major. It’s because the previous major node needed to be shot
down (STONISHed) and there’s no trivial solution to transfer the lacking information from that
“useless” node to the brand new major node post-failover.
ClickHouse#
ClickHouse had been developed open supply for a few years by Yandex, a search
supplier in Russia. ClickHouse’s performance in open supply (Apache 2.0) is
complete and consists of excessive availability and horizontal scaling. There are
additionally fairly a number of unbiased managed cloud choices that assist ClickHouse:
It is sensible to speak about ClickHouse’s replication within the context of its Open
Supply product, since cloud options can range dramatically from supplier to
supplier.
The beauty of ClickHouse open supply is that it helps multi-master
replication. So if one creates a cluster with 2 nodes (Node 1
and Node 2
)
and replicated desk A
(utilizing the ReplicationMergeTree engine):
When Bob
sends to Node 1
And Alice
sends to Node 2
Each data will probably be written to every of the nodes. When the
INSERT INTO A VALUES(1)
assertion is acquired on Node 1
, ClickHouse writes
it to half 1_1
. Subsequent, Node 1
registers the information half with the Zookeeper (or
ClickHouse Keeper). Zookeeper notifies every node in regards to the new half, and the
nodes obtain the information from the supply and apply it to the native desk reproduction.
The identical course of occurs concurrently with INSERT 2
.

On this structure, inserts could be written to every of the nodes in parallel.
What about studying the information again? ClickHouse documentation states that the
replication course of is Asynchronous and it might take a while for Node 2
to
meet up with Node 1
. There’s nevertheless an choice to specify insert_quorum
with each insert. If the insert_quorum
is about to 2
then the appliance
will get affirmation again from the database after each Node 1
and Node 2
have
inserted the information, successfully turning this into Sync replication. There are a
few extra settings to think about like insert_quorum_parallel
,
insert_quorum_timeout
, and select_sequential_consistency
to outline how
concurrent parallel inserts work.
Additionally it is doable to switch the desk schema by including new columns on the
replicated desk. An ALTER TABLE
assertion could be despatched to any of the nodes in
the cluster, and it is going to be replicated throughout the nodes. ClickHouse doesn’t
permit concurrent desk schema change execution so if 2 of the nodes obtain the
identical non-conflicting assertion:
one of many nodes can reply with the failure:
SQL Replace statements are additionally written in ClickHouse dialect as ALTER TABLE
however happily, they are often executed in parallel with out the above error.
ClickHouse additionally has a helpful methodology to resolve misplaced write confirmations; in instances
the place an INSERT
(or different) question affirmation is misplaced due to a community
disconnect or timeout, the shopper can resend the entire block of knowledge in precisely
the identical solution to another obtainable node. The receiving node then calculates the
hash code of the information and never apply it a second time if it is ready to acknowledge
that this information has already been utilized by way of one other node.
There are extra choices and flavors of the right way to arrange replication in ClickHouse,
the most well-liked method being ReplicatedMergeTree
storage.
InfluxDB#
InfluxDB has the very best
DB engines time series ranking
on the time of writing, and I really feel that it must be included right here though
InfluxData
removed the clustering product from the open source version in 2016
to promote it as a business product, InfluxDB Enterprise. Since then, their focus
has shifted from the enterprise model to the cloud providing in recent times,
the place they’ve constructed InfluxDB Cloud v2. Whereas this can be a closed-source system,
its high-level structure is deducible from the advertising and marketing diagrams InfluxData
formally gives.

InfluxDB Cloud v2 persists incoming writes to the Write Forward Log (WAL) written
over a Kafka cluster. It’s a clear resolution that solves the sturdiness and
distribution of the WAL and ensures that the information is already replicated when the
shopper receives a write affirmation.
WAL utility to “Queriable” desk storage runs asynchronously within the Ingester
element, consuming messages from Kafka and writing them to 2 unbiased TSDB
copies. There’s a delay between studying the message confirmed to be written, in
Inflow phrases that is referred to as
Time to Become Readable.
Inflow protocol messages are idempotent within the sense that the identical message can
be processed many instances with out creating duplicates. It’s because in
InfluxDB, the identical set of tags can have just one row per timestamp worth. So if
one sends a line:
after which sends one other line with the identical measurement and timestamp:
the brand new area worth will probably be added to the identical line as if the fields have been despatched
collectively, in the identical message:
And if any of the above messages are despatched once more, Inflow won’t add a brand new row.
This method solves the issue of resending information on timeout or misplaced replies.
So if the shopper doesn’t obtain a write affirmation from Inflow cloud, it might
re-send the same data again and again.
When the information is distributed with the identical timestamp and tag set from completely different
connections:
The querying storage nodes can turn into inconsistent for a while, returning any
row out of the three:
field1 | field2 | time |
---|---|---|
1 | 2023-03-03T13:59:50.000Z | |
2 | 2000 | 2023-03-03T13:59:50.000Z |
1 | 2000 | 2023-03-03T13:59:50.000Z |
It may well even be that the primary question returns field1=1
, a second question returns
field1=2
after which a 3rd tries flipping again to field1=1
. Finally, the
question consequence will turn into secure and return the identical information on every run. This can be a
very typical consequence for querying nodes in Spherical Robin with Async replication.
The Inflow information mannequin additionally solves the issue of a dynamically evolving schema.
Since there are not any conventional columns (since any unknown fields and tags that
are encountered are added robotically by the database engine), there is no such thing as a
downside writing a distinct set of fields for a similar measurement by design.
Inflow additionally checks for schema conflicts and returns errors to the writing
utility if there are any. For instance, if the identical area is distributed as a quantity
after which as a string:
The write will fail with the error
column worth is sort f64 however write has sort string
or
column worth is sort string however write has sort f64
.
Abstract#
Right here is the abstract of the replication options supported by time-series
databases related to Excessive Availability write use case:
PostgreSQL / TimescaleDB | ClickHouse | InfluxDB Cloud | |
---|---|---|---|
Multi-master replication | No | Sure | No |
Helps Sync replication | Sure | Sure | No |
Helps Async replication | Sure | Sure | Sure |
Concurrently evolves replicated desk schema | Sure | Yes¹ | Yes² |
Similar Insert / Replace order on all nodes | Sure | Sure | Yes² |
No gaps and duplicates after failover | Sync mode solely | Sure | Sure |
Makes use of WAL for Replication | Sure | Yes³ | Sure |
¹ Concurrent schema updates must be re-tried
² InfluxDB cloud is a closed-source system, sure conclusions are made on the
assumption of the cheap use of Kafka WAL partitioning and the correctness
of this declare is determined by the implementation.
³ ClickHouse replication Knowledge Half performs the position of WAL
To attract up some conclusions:
-
All 3 techniques replicate by writing to the Write Forward Log and copying it
throughout the nodes. -
Asynchronous replication, the place information written to Node 1 is ultimately seen
at Node 2, is the most well-liked method utilized by InfluxDB Cloud and is the
default in each ClickHouse and Postgres. -
Postgres / Timescale replication can be utilized in each synchronous and
asynchronous modes, however it doesn’t have multi-master replication and there may be
no choice for automated failover. It isn’t doable to resolve write failover
with out further software program techniques or human intervention. -
Multi-master replication is out there in ClickHouse. There are additionally sufficient
obtainable settings to strike an acceptable steadiness between experiencing information
loss (in excessive situations) and writing throughput. -
InfluxDB doesn’t supply replication assist in its open supply product. There
is a closed-source cloud resolution that leverages Kafka to resolve automated
write failover. Kafka replication is just not multi-master however with the assistance of
automated failover, it solves high-availability write use instances.
QuestDB Replication Plans#
QuestDB launched Write Forward Log desk storage mode in v7.0 as step one in
our replication journey. It makes use of a multi-master write structure internally to
make non-locking writes to the identical desk doable from parallel connections.
Transactions are written in parallel to completely different WAL segments, and a worldwide
order of commits is maintained in a Sequencer element. Probably the most difficult
bit is automated schema battle decision in order that desk schema adjustments can
even be be carried out in parallel.

We tried to keep away from having a Write Forward Log for a very long time, writing straight
into the desk storage. It was not a straightforward choice to just accept WAL write
amplification for the sake of a cleaner path to replication and non-locking
parallel writes. Ultimately, the extra write operations didn’t impression
total throughput. Quite the opposite, due to higher parallelism, we achieved
3x higher write efficiency in
Time Series Benchmarking Suite in contrast
to our own (quite extraordinary) performance for non-WAL tables.
Taking a look at how different databases clear up the replication downside, we selected our aim
to be attaining multi-master replication with Async consistency. We consider that
this method strikes the most effective steadiness of fault tolerance and transaction
throughput. And it’s important to have a built-in write de-duplication
mechanism for automated write failover instances. The subsequent steps for QuestDB will probably be
to maneuver the built-in Sequencer element to a distributed surroundings and clear up
WAL sharing between a number of cases.
Driving tandem bicycles with a good friend is the most effective redundancy resolution we see for
QuestDB. And as for my commute downside, properly, I nonetheless do not know the right way to clear up
it. You’re greater than welcome to hitch
our Slack Community and share your suggestions. You
may play with QuestDB live demo or
play.questdb.io to see how briskly it rides. And, of
course, open-source contributions to
our project on GitHub are greater than
welcome.