How we migrated our PostgreSQL database with 11 seconds downtime

2024-01-18 15:51:05

GOV.UK Notify is hosted on the GOV.UK Platform as a Service (PaaS). The PaaS is being retired, so we’re migrating all of our infrastructure into our personal Amazon Net Companies (AWS) account. This weblog submit explains how we migrated our PostgreSQL database with minimal downtime.

Migrating our database

The PaaS offers a database for us and we use it to retailer all of our information – from information about every notification we ship to the content material of the lots of of hundreds of templates service groups use to ship these notifications. That is an AWS RDS PostgreSQL database and it lives within the PaaS’ AWS account. Our apps that run within the PaaS discuss to this database. We’re going to name this database our ‘supply database’.

We wanted to arrange a brand new database in our personal AWS account, and get all of our apps to speak to the brand new database. We’re going to name this new database our ‘goal database’.

Creating a brand new PostgreSQL database in our personal AWS account just isn’t too tough. The onerous half is transferring all of our information and getting our apps to make use of this new database, while incurring minimal downtime.

A bit extra about our supply database

Our supply database is about 400GB in dimension. It has about 1.3 billion rows, 85 tables, 185 indexes and 120 overseas keys. It’s PostgreSQL model 11.

On a regular weekday, we do someplace within the area of 1,000 inserts or updates per second (generally a lot decrease, generally a lot increased), plus the same variety of reads.

GOV.UK Notify sends tens of millions of vital and well timed notifications every day, from flood alerts to updating customers about their passport purposes . Each notification we ship requires speaking to our database. Subsequently it’s vital that we minimise any downtime.

AWS Database Migration Service

The PaaS group provided us the power emigrate databases utilizing AWS Database Migration Service (DMS).

DMS is liable for transferring information from our supply database to our goal database. It may be run in both the supply or goal AWS account.

DMS works by:

Copying throughout the entire information, desk by desk, as much as a particular time limit. This is named the ‘full load’ job.
Coming into replication mode, the place it ensures that each one new transactions on the supply database are replayed onto the goal database, in order that the two databases are in sync.

We might then be liable for getting our apps to cease speaking to the supply database and begin speaking to the goal database.

Database migration course of

The database migration course of was accomplished in a number of phases.

Organising the DMS occasion

In our case, the DMS occasion was created within the supply AWS account. We selected the supply account as a result of the PaaS group had already arrange cases of DMS of their account and so had been ready to do that rapidly and simply.

The DMS occasion additionally wanted to be given PostgreSQL credentials to speak to each the supply and goal database.

The DMS occasion and the goal database reside in numerous digital personal clouds (VPCs). With the assistance of the PaaS group, we arrange VPC peering in order that visitors from the DMS occasion within the PaaS’s VPC may very well be routed on to our VPC with out the visitors going over the general public web.

Organising our goal database

We created our goal RDS occasion in our personal AWS account. PostgresSQL model 11 was about to turn out to be unsupported, so we took this chance to improve our PostgreSQL model by making our new database PostgreSQL 15.

We then took a dump of the database schema for our supply database utilizing `pg_dump`. This gave us a file with the SQL instructions to recreate our database schema.

From our database schema, we took the declarations for our tables and utilized these to our goal database.

We didn’t apply our overseas keys at this level as a result of DMS’ full load course of doesn’t attempt to copy throughout the info in an order that matches your overseas key constraints.

We didn’t create our main keys or indexes at this level as a result of this is able to massively decelerate our full load job. Every particular person insert would take longer; it might have to replace our indexes and this is able to add as much as a big period of time when inserting billions of rows. It was a lot faster to first copy all of our information throughout after which add the indexes afterwards.

Full load

As soon as we had a goal database with the tables created, we then began the DMS full load job. This copies throughout all the info that existed after we pressed the ‘begin full load’ button. It doesn’t copy throughout any new information or updates that are available after this level. It took about 6 hours for the total load job to complete.

After the total load job accomplished, we utilized the rest of our supply database schema file which provides our indexes and key constraints. Including these took about 3 hours.

Replication

As soon as our full load job accomplished, the info in our goal database matched the info from the supply database on the level after we began the total load job. However many new inserts, updates and deletions had occurred on our supply database since then. And lots of extra adjustments would preserve coming in too.

To repeat these new adjustments throughout, we then began the DMS ongoing replication (often known as change information seize) job. This reads all of the transactions from our supply database transaction log that had been created after the total load job started and sends them to our goal database. This ensures that our goal database is in sync with our supply database with, at most, a small quantity of lag.

It solely took a few hours for the replication course of to catch up. At that time, we monitored the latency within the DMS replication course of to ensure it may deal with the variety of adjustments occurring to the supply database and continued to remain in sync.

We ran the DMS replication course of for about 10 days within the background, preserving all the things in sync while we awaited the time for our apps to cease speaking to the supply database and begin speaking to the goal database. We had introduced this time to our customers upfront and so had a set time already for the migration of visitors.

Getting ready emigrate visitors

A number of months in the past we deliberate how we might cease our apps speaking to our supply database and get them utilizing our goal database.This was the method we used:

Cease all visitors from our apps to our supply database. At this level we might enter a interval of downtime the place Notify was unavailable.
Guarantee our replication had caught up so that each one updates to our supply database had been mirrored on our goal database.
Permit our apps to start out speaking to our goal database. This is able to finish our downtime.

It was vital to not have a few of our apps speaking to our supply database and the remainder speaking to our goal database on the similar time. If this occurred any adjustments on our goal database wouldn’t be mirrored on our supply database which might imply customers would get inconsistent information.

We wrote a Python script for this course of so it may very well be express, simply repeatable and far faster than being carried out manually. The faster it may very well be carried out, the much less downtime for customers of Notify. Our goal was lower than 5 minutes of downtime. We ended up utilizing this script not less than 40 instances throughout our numerous assessments and practices beforehand.

We picked a Saturday night for the migration. It is because it’s one in every of our quietest instances with out us having to be awake in the midst of the evening after we received’t be as alert.

Stopping visitors to our supply database

Our script would cease all visitors to our supply database by calling `pg_terminate_backend` on all of the connections from our apps. This took lower than a second. We additionally modified the password for the PostgreSQL consumer utilized by our apps, which means that if the apps tried to reconnect to our supply database they’d get an authentication error.

Checking replication had caught up

DMS inserts some helpful tables into our goal database on the standing of the replication that are up to date each minute. These tables permit us to see how a lot lag there may be between our goal database and the supply database. Our migration script would test these tables to ensure our goal database was fully caught up.

To be additional secure, after our apps had stopped speaking to our supply database, our migration script would write a single report to our supply database after which wait to see that it safely arrived in our goal database. This gave us additional certainty that each one adjustments had been replicated.

Making a clean swap of visitors

For our apps to hook up with our database, they should know the placement of the database and likewise a username and password for a related PostgreSQL consumer. These are supplied to our apps in an atmosphere variable of the next format:

SQLALCHEMY_DATABASE_URI = postgresql://original-username:original-password@random-identifier.eu-west-1.rds.amazonaws.com:5432

If we wish our apps to hook up with a distinct database, we have to replace the username, password and site within the URI and redeploy our apps so this variation takes impact. Redeploying our apps takes about 5 minutes. If we redeployed our apps as a part of our migration script then this is able to imply an additional 5 minutes of downtime. To minimise downtime we made two adjustments upfront of our migration in order that we may use a fast Area Identify System (DNS) change as a substitute of redeploying our apps.

The primary change was to create a consumer on each our supply and goal database that had the identical username and password. Which means that we don’t want to alter the username or password supplied to the apps through the migration.

The second change was to create a DNS report in AWS Route53 for `database.notifications.service.gov.uk` with a 1 second TTL (time to reside). It had two data with weightings:

100% of DNS outcomes had been weighted to the supply database location
0% of DNS outcomes had been weighted to the goal database location

We set our URI utilized by our apps to make use of our new username and password, and to make use of the brand new area identify for the placement of our database.

SQLALCHEMY_DATABASE_URI = postgresql://shared-username:shared-password@database.notifications.service.gov.uk:5432

Now, after we wished to swap the database that our apps could be pointing at, our migration script simply wanted to replace the DNS weighting in AWS to 100% of outcomes being despatched to the goal database location and wait 1 second for the TTL to run out. Then, when our apps subsequent attempt to question our database they are going to be querying our goal database.

What occurred on the day

Once we gathered on the night of Saturday 4 November, we had arrange our goal database, the total load course of had run and new transactions had been being copied throughout. We checked and solely had a few seconds lag between our goal database and the supply database.

We then efficiently ran our migration script in order that our apps would cease speaking to our supply database and begin speaking to our new goal database. Through the migration there was a brief interval of downtime, roughly 11 seconds. This was a lot lower than our 5 minute goal so we had been more than happy and so had been our customers.

What we learnt

We selected to make use of DMS as a result of it was effectively supported by the GOV.UK PaaS and we may additionally get help from AWS. If we had been doing a PostgreSQL to PostgreSQL database migration sooner or later, we might make investments extra time in attempting various instruments similar to pglogical. DMS doubtlessly added extra complexity, and an unfamiliar replication course of than what we might have discovered with different instruments. This backs up what AWS say themselves on PostgreSQL to PostgreSQL migrations.

What’s subsequent for GOV.UK Notify’s migration to AWS

Now we’ve migrated our database, our subsequent step is emigrate our apps. Sneak peek – we’re transferring them to AWS Elastic Container Service (ECS). We’ll weblog about how this goes within the coming months.

Source Link