Three bugs within the Go MySQL Driver

Though GitHub.com remains to be a Rails monolith, over the previous few years we’ve begun the method of extracting vital performance from our predominant software, by rewriting a few of the code in Go—principally addressing the items that have to run sooner and extra reliably than what we are able to accomplish with Ruby. Final yr, we deployed a brand new authorizations service in manufacturing, authzd
, which powers the “fine-grained authorizations” function that we introduced in GitHub Satellite tv for pc 2019.
This has been a big milestone and an fascinating problem, as a result of authzd
is our first Go service that reads from the manufacturing MySQL databases as a part of an internet request. Prior to now, we’ve deployed different Go companies in our infrastructure that learn and write to MySQL, however they’re both inner administration companies (corresponding to our Search cluster supervisor, manticore
), or async batch jobs (corresponding to gitbackups
, which takes care of scheduling Git repository backups). authzd
is known as a number of instances throughout a standard request to our Rails monolith, that means that its necessities for efficiency and reliability are rather more stringent.
Including to this problem, authzd
is deployed to our Kubernetes clusters, the place we’ve been experiencing points with excessive latencies when opening new TCP connections, one thing that notably impacts the pooling of connections within the Go MySQL driver. One of the harmful lies that programmers inform themselves is that the community is dependable, as a result of, properly, more often than not the community is dependable. However when it will get sluggish or spotty, that’s when issues begin breaking, and we get to seek out out the underlying points within the libraries we take with no consideration.
Right here’s what it took, from a MySQL standpoint, to get authzd
able to serve all our manufacturing site visitors whereas assembly our availability SLOs.
The crash
For a long time, "Invalid connection (unexpected EOF)"
was the most elusive of all known Go MySQL bugs. It’s been over two years since I initially diagnosed and reported the bug upstream: we first got here throughout this downside when deploying the preliminary variations of our repository backup orchestrator (gitbackups
) to manufacturing. On the time, there was no consensus on learn how to repair it on the MySQL driver, so we mounted gitbackups
with a workaround within the software’s code. Final yr, when doing our first authzd
deploys to manufacturing, the problem was again in full impact: a whole bunch of Invalid connection
errors per minute had been dramatically growing authzd
‘s error response fee. This compelled me to deal with the problem once more, and I lastly landed a repair upstream that solves the problem with out having to alter any software code. Let’s dive into the problem and its repair.
Fixing the problem
The job of a Go SQL driver is to provide a simple primitive to the database/sql
package that comes in the Go standard library, a “SQL connection”. The connections that a driver provides are stateful, and cannot be used concurrently, because this is the default behavior for the connections of all major SQL servers (including MySQL and Postgres). This is where the database/sql
package comes in: it manages the access and lifetime of these connections. The DB
struct in the package, which is the entry point for all operations you can perform on a SQL database, is an abstraction over a connection pool, full of individual connections provided by the driver. Hence, calling methods like (*DB).Query
or (*DB).Exec
involves quite a bit of complex logic: polling the connection pool for a connection that is active but idle, or calling the underlying Driver to create a new connection if all the existing ones are busy, and then executing the command or query and returning the connection back to the pool, or discarding it if the pool is full.
The “unexpected EOF” crash we experienced in both gitbackups
and authzd
always happens when taking a connection out of the connection pool and trying to perform a query on it. The reason was obvious simply by reading the MySQL server logs. The server was closing connections that spent too long in Go’s sql.(*DB)
pool because they reached their idle timeout. Keep in mind, we run our production MySQL clusters with a pretty aggressive idle timeout (30s) to ensure we never have too many connections open at any given time. And yet, the fact that this invalid connection error was being reported to the app was puzzling (and a problem, because our service was continuously crashing).
It’s common for TCP connections to be interrupted. The database/sql
package is designed to handle an interruption in any of the hundreds of simultaneous connections that sit in a (*DB)
pool at any time, and the specific driver implementation is supposed to help with this. A driver connection can become unhealthy for a multitude of reasons, whether it’s because it’s been closed, it received a bad packet, or any of the many things that can go wrong with a SQL connection. Whenever a driver detects that a connection has become unhealthy, it’s supposed to return driver.ErrBadConn
on the next call to any of the methods in the connection.
ErrBadConn
is the magic word that signals the database/sql
package that this stateful connection is no longer valid. If the connection was in the pool, it needs to be removed from the pool, or if the connection was just pulled from the pool to perform a query, it must be discarded and a new connection must be pulled from the pool (or created) so the query can be performed. With this logic in mind, it’s expected that calling (*DB).Query
would never fail with an “invalid connection” error, because even if the library attempted to perform the query in an invalid connection, driver.ErrBadConn
would be returned and cause the query to be retried in another connection, transparently, and without the user noticing.
So, what’s going on in authzd
to cause these “invalid connection” errors?
There’s a nuanced mismatch between the TCP protocol and the MySQL wire protocol overlaid on top. A TCP connection is full-duplex, where data can be flowing in each direction independently of the other direction, while the overlaid MySQL protocol on top becomes fully client-managed once it’s in its Command Phase. A MySQL server only sends a packet to a MySQL client in response to a packet from the client. This becomes an issue when the server closes an active connection in the Command Phase, because it reached its idle timeout, or because connections are actively being pruned by processes such as pt-kill
. Let’s look at a network diagram:
Determine 1: A community movement diagram representing the packets despatched between the Go MySQL driver and the MySQL server.
On this replica of the bug, we are able to see an ordinary TCP handshake, adopted by a request and response pair on the MySQL degree. We then sleep for a number of seconds. After some time with out sending requests to the MySQL server, we attain the server-side timeout and the MySQL server performs an lively shut
to our socket. When a TCP server closes a socket, the server’s kernel sends a [FIN, ACK]
pair to the consumer, which implies that the server is completed sending information. The consumer’s kernel acknowledges the [FIN, ACK]
with an [ACK]
nevertheless it doesn’t shut its facet of the connection—that is what we imply by TCP being full-duplex. The write facet, consumer -> server, of the connection is unbiased, and have to be explicitly closed by calling shut
within the consumer.
In most community protocols on high of TCP, this isn’t a problem. The consumer is performing learn
s from the server, and as quickly because it receives a [SYN, ACK]
, the following learn returns an EOF
error, as a result of the Kernel is aware of that the server received’t write extra information to this connection. Nevertheless, as mentioned earlier, as soon as a MySQL connection is in its Command Section, the MySQL protocol is client-managed. The consumer solely reads from the server after it sends a request, as a result of the server solely sends information in response to requests from the consumer.
The earlier community diagram clearly exhibits the sensible results of this. After the sleeping interval, we ship a request to the server and this write
succeeds as a result of our facet of the connection remains to be open. The server consequently solutions our request with a [RST]
packet as a result of it’s really closed—we simply don’t learn about it but. After which, when our consumer makes an attempt to learn the response from the MySQL server, it belatedly finds out that the connection is invalid because it receives an EOF
error.
This solutions why our connection is crashing, however not why our software code is crashing with it. Our MySQL connection is now not legitimate, and we discovered about it (higher late than by no means), so why doesn’t our MySQL driver return driver.ErrBadConn
when this occurs? And why doesn’t it enable database/sql
to retry our question in a model new connection? Sadly, as a result of transparently retrying the question shouldn’t be protected to do within the normal case.
The sequence of occasions depicted within the earlier community diagram could be very frequent. In our manufacturing MySQL clusters, it occurs 1000’s of instances per minute, notably throughout off-peak hours when our service is generally idle and the timeouts are reached. However that’s removed from the one factor that would trigger a connection to be closed.
What would occur if we carried out an UPDATE
in a wonderfully wholesome connection, MySQL executed it, after which our community went down earlier than it may reply to us? The Go MySQL driver would additionally obtain an EOF
after a sound write. But when it had been to return driver.ErrBadConn
, database/sql
would run the UPDATE
assertion on a brand new connection. Catastrophe! Knowledge corruption! And what a totally annoying place to be in: we all know that within the widespread case (when MySQL kills our connection), it’s protected to retry these queries. There was no error when writing the question to the community, however we all know MySQL didn’t obtain it — not to mention executed it. However we should assume the worst, that the question did make it to the server. And therefore, returning driver.ErrBadConn
shouldn’t be protected.
So, how can we go about fixing this? The obvious resolution, and the one we utilized on gitbackups
again within the day, is asking (*DB).SetConnMaxLifetime
on our consumer to set the max lifetime of MySQL connections to a worth decrease than our MySQL cluster’s idle timeout. That is removed from excellent. For starters, SetConnMaxLifetime
units the most period of any connection on the consumer, not its most idle period. That implies that database/sql
will constantly kill connections after they’ve lived that lengthy, even when the connections are being actively used and couldn’t set off the server’s idle timeout. The database/sql
package deal doesn’t present a “most idle period” API for databases, as a result of not all SQL servers implement the idea of idle timeouts. Ah, the miseries of abstraction. In apply, this repair works OK apart from the useless connection churn, and instances the place the MySQL server is actively prunning connections, which this workaround can not detect.
Clearly, the optimum resolution could be checking for a connection’s well being as soon as it’s pulled from the connection pool, and earlier than we try to write
any requests to it. This was sadly not attainable till the introduction of SessionResetter
in Go 1.10: earlier than that interface was obtainable, there was no method to know when a connection was being returned to the pool, which is why the bug stalled for nearly 2 years till we may give you an sufficient repair.
There are two methods to examine the well being of that connection: on the MySQL degree, or on the TCP degree. A MySQL well being examine includes sending a PING
packet and ready for its response. It’s all the time protected to return driver.ErrBadConn
as a result of the ping packet doesn’t carry out write operations in MySQL (one would hope). Nevertheless, there may be the downside of including arbitrary latency to the primary operation carried out on a connection recent off the pool. That’s a bit spooky, as a result of connections are despatched and returned to the pool typically in a Go software. Consequently, we went with the cheaper and less complicated repair which is to easily examine whether or not the MySQL server has closed its facet of the connection on the TCP degree.
Performing this examine could be very cheap, all we’ve got to do is a non-blocking learn on our connection earlier than we try any writes. If the server has closed its facet of the connection, we’ll get an EOF
instantly. If the connection remains to be open, the learn additionally returns instantly however with an EWOULDBLOCK
error, signaling that no information exists to be learn (but). Now, the excellent news is that every one sockets in a Go program are already set to non-blocking mode. Don’t be fooled by the elegant abstraction of Goroutines. Below the hood, the user-land Go scheduler makes use of an async occasion loop to place Goroutines to sleep and wake them as soon as they’ve information to learn (how quaint). The unhealthy information is that we don’t get to carry out non-blocking reads by calling strategies corresponding to web.Conn.Learn
, as a result of the scheduler will put us proper to sleep (once more, the elegant abstraction of Goroutines). The correct interface to carry out this non-blocking learn wasn’t launched till Go 1.9: (*TCPConn).SyscallConn
provides us entry to the underlying file descriptor so we are able to use the learn
system name immediately.
Armed with this new API, I was able to implement a connection health check that solved the “stale connections” issue with lower than 5 microseconds of overhead. A non-blocking learn could be very fast as a result of, huh, that’s the purpose of non-blocking reads—they don’t block.
After deploying the repair in manufacturing, all of the “Invalid Connection” needles disappeared instantly, leading to our first “9” of availability for the service.
Determine 2: Consumer-side needles and retries in authzd
after a deploy
Manufacturing ideas
-
If your MySQL server uses idle timeouts, or is actively pruning connections, you don’t need to call
(*DB).SetConnMaxLifetime
in your production services. It’s no longer needed as the driver can now gracefully detect and retry stale connections. Setting a max lifetime for connections simply causes unnecessary churn by killing and re-opening healthy connections. -
A good pattern to manage high-throughput access to MySQL is configuring your
sql.(*DB)
pool ((*DB).SetMaxIdleConns
and(*DB).SetMaxOpenConns
) with values that support your peak-hour traffic for the service, and making sure that your MySQL server is actively pruning idle connections during off-hours. These pruned connections are detected by the MySQL driver and re-created when necessary.
The timeout
When we put a service like authzd
as a dependency in the standard request path for our monolith, we’ve fundamentally added its response latency to the response latency of all the requests of our app, or for the requests that perform some kind of authorization—which is a lot of them. It’s crucial to ensure that authzd
requests never take more time than what we’ve allotted for them, or they could very easily lead to a site-global performance degradation.
From the Rails side, this means being very careful with how we perform RPC requests to the service and how we time them out. In the Go server side, it means one thing, Context
. The context
package was introduced to the standard library back in Go 1.7, although its API was already available for years as a separate library. Its premise is simple, pass a context.Context
to every single function that is involved in the life-cycle of your service’s requests to make your service cancellation-aware. Each incoming HTTP request provides a Context
object, which becomes canceled if the client disconnects early, and on top of it you can extend your Context
with a deadline. This allows us to manage early request cancellation and request timeouts using a single API.
The engineering team in charge of developing the new service did a stellar job of ensuring that all methods in authzd
were passing around a Context
, including the methods that perform queries to MySQL. Passing a Context
to these methods is fundamental because they’re the most expensive part of our request life-cycle, and we need to ensure the database/sql
query methods are receiving our request’s Context
so they can cancel our MySQL queries early if they’re taking too long.
In practice, however, it seemed like authzd
wasn’t taking request cancellations or timeouts into account:
Determine 3: In-service response instances for an authzd
deployment. The resolver’s timeout was set to 1000ms—you possibly can inform from the useful crimson line I’ve drawn within the graph, however undoubtedly not from the various random spikes that climb all the way in which to 5000ms.
Simply from taking a look at these metrics, it’s fairly clear that regardless that we’re efficiently propagating our request’s Context
all through our software’s code, there’s some extent throughout the lifetime of some requests the place cancellation is solely ignored. Discovering this spot by means of code overview might be quite laborious, even after we strongly suspect that the supply of the timeouts have to be the Go MySQL driver, because it’s the one a part of our service that performs exterior requests. On this case, I captured a stack hint from a manufacturing host to seek out out the place the code was blocking. It took a number of tries to seize a stack hint that appeared related, however as soon as I had one which was blocking on a QueryContext
name, the problem grew to become instantly apparent:
0 0x00000000007704cb in web.(*sysDialer).doDialTCP
at /usr/native/go/src/web/tcpsock_posix.go:64
1 0x000000000077041a in web.(*sysDialer).dialTCP
at /usr/native/go/src/web/tcpsock_posix.go:61
2 0x00000000007374de in web.(*sysDialer).dialSingle
at /usr/native/go/src/web/dial.go:571
3 0x0000000000736d03 in web.(*sysDialer).dialSerial
at /usr/native/go/src/web/dial.go:539
4 0x00000000007355ad in web.(*Dialer).DialContext
at /usr/native/go/src/web/dial.go:417
5 0x000000000073472c in web.(*Dialer).Dial
at /usr/native/go/src/web/dial.go:340
6 0x00000000008fe651 in github.com/github/authzd/vendor/github.com/go-sql-driver/mysql.MySQLDriver.Open
at /dwelling/vmg/src/gopath/src/github.com/github/authzd/vendor/github.com/go-sql-driver/mysql/driver.go:77
7 0x000000000091f0ff in github.com/github/authzd/vendor/github.com/go-sql-driver/mysql.(*MySQLDriver).Open
at <autogenerated>:1
8 0x0000000000645c60 in database/sql.dsnConnector.Join
at /usr/native/go/src/database/sql/sql.go:636
9 0x000000000065b10d in database/sql.(*dsnConnector).Join
at <autogenerated>:1
10 0x000000000064968f in database/sql.(*DB).conn
at /usr/native/go/src/database/sql/sql.go:1176
11 0x000000000065313e in database/sql.(*Stmt).connStmt
at /usr/native/go/src/database/sql/sql.go:2409
12 0x0000000000653a44 in database/sql.(*Stmt).QueryContext
at /usr/native/go/src/database/sql/sql.go:2461
[...]
The problem is that we’re not propagating our request’s Context
as deeply as we initially thought. Though we’re passing the Context
to QueryContext
when performing SQL queries, and this context is getting used to truly carry out and timeout the SQL queries, there’s a nook case within the database/sql/driver
API we’re lacking. Once we’re attempting to carry out a question however no connections can be found in our connection pool, we have to create a brand new SQL connection by calling driver.Driver.Open()
, however this interface methodology doesn’t take a context.Context
!
The earlier stack hint clearly exhibits how in (8)
we cease propagating our Context
and easily name (*MysqlDriver).Open()
with the DSN to our database. Opening a MySQL connection isn’t an inexpensive operation: it includes doing a TCP open, SSL negotiation (if we’re utilizing SSL), performing the MySQL handshake and authentication, and setting any default connection choices. In complete, there are no less than six community spherical journeys which don’t obey our request timeout, as a result of they’re not Context
-aware.
What can we do about this? The very first thing we tried is setting the timeout
worth on our connection’s DSN, which configures the TCP open timeout that (*MySQLDriver).Open
makes use of. However this isn’t ok, as a result of TCP open is simply step one of initializing the connection. The remaining steps (MySQL handshake, and so on) weren’t obeying any timeouts, so we nonetheless had requests going well beyond the worldwide 1000ms timeout.
The correct repair concerned refactoring a big chunk of the Go MySQL driver. The underlying concern was launched again within the Go 1.8 launch, which applied the QueryContext
and ExecContext
APIs. Though these APIs might be used to cancel SQL queries as a result of they’re Context
conscious, the driver.Open
methodology wasn’t up to date to truly take a context.Context
argument. This new interface was solely added two patches later, in Go 1.10: a brand new Connector
interface was launched, which needed to be applied individually from the driving force itself. Therefore, supporting each the outdated driver.Driver
and driver.Connector
interfaces required main modifications within the construction of any SQL driver. Due to this complexity, and a lack of know-how that the outdated Driver.Open
interface can simply result in availability points, the Go MySQL driver by no means acquired round to implementing the brand new interface.
Happily, I had the time and persistence to undertake that refactoring. After shipping the resulting Go MySQL driver with the new Connector interface in authzd
, we may confirm on stack traces that the context handed to QueryContext
was being propagated when creating new MySQL connections:
0 0x000000000076facb in web.(*sysDialer).doDialTCP
at /usr/native/go/src/web/tcpsock_posix.go:64
1 0x000000000076fa1a in web.(*sysDialer).dialTCP
at /usr/native/go/src/web/tcpsock_posix.go:61
2 0x0000000000736ade in web.(*sysDialer).dialSingle
at /usr/native/go/src/web/dial.go:571
3 0x0000000000736303 in web.(*sysDialer).dialSerial
at /usr/native/go/src/web/dial.go:539
4 0x0000000000734bad in web.(*Dialer).DialContext
at /usr/native/go/src/web/dial.go:417
5 0x00000000008fdf3e in github.com/github/authzd/vendor/github.com/go-sql-driver/mysql.(*connector).Join
at /dwelling/vmg/src/gopath/src/github.com/github/authzd/vendor/github.com/go-sql-driver/mysql/connector.go:43
6 0x00000000006491ef in database/sql.(*DB).conn
at /usr/native/go/src/database/sql/sql.go:1176
7 0x0000000000652c9e in database/sql.(*Stmt).connStmt
at /usr/native/go/src/database/sql/sql.go:2409
8 0x00000000006535a4 in database/sql.(*Stmt).QueryContext
at /usr/native/go/src/database/sql/sql.go:2461
[...]
Observe how (*DB).conn
now calls our driver’s Connector
interface immediately, passing the present Context
. Likewise, it was very clear that the worldwide request timeout was being obeyed for all requests:
Determine 4: Resolver response instances with a timeout set at 90ms. You’ll be able to inform that the Context
is being revered as a result of the 99th percentile seems extra like a barcode than a graph
Manufacturing ideas
-
You don’t need to change your
sql.(*DB)
initialization to use the newsql.OpenDB
API. Even when callingsql.Open
in Go 1.10, thedatabase/sql
package detects theConnector
interface and new connections are created withContext
awareness. -
However, changing your application to use
sql.OpenDB
with amysql.NewConnector
has several benefits, most notably the fact that the connection options for your MySQL cluster can be configured out of amysql.Config
struct, without requiring to compose a DSN. -
Don’t set a
?timeout=
(or itsmysql.(Config).Timeout
equivalent) on your MySQL driver. Having a static value for a dial timeout is a bad idea, because it doesn’t take into account how much of your current request budget has been spent. -
Instead, make sure that every SQL operation in your app is using the
QueryContext
/ExecContext
interfaces, so they can be canceled based on your current request’sContext
regardless of whether the connection is failing to dial or the query is slow to execute.
The race
Last, here’s a very serious security issue caused from a data race, which is as devious as it is subtle. We’ve talked a lot about how sql.(*DB)
is simply a wrapper around a pool of stateful connections, but there’s a detail we haven’t discussed. Performing a SQL query, either via (*DB).Query
or (*DB).QueryContext
, which is one of the most common operations you can do on a database, actually steals a connection from the pool.
Unlike a simple call to (*DB).Exec
, which has no actual data returned from the database, (*DB).Query
may return one or more rows, and in order to read these rows from our app code, we must take control of the stateful connection where those rows will be written to by the server. That’s the purpose of the sql.(*Rows)
struct, which is returned from each call to Query
and QueryContext
: it wraps a connection we’re borrowing from the pool so we can read the individual Row
s from it. And this is why it’s critical to call (*Rows).Close
once we’re done processing the results from our query, otherwise the stolen connection never returns to the connection pool.
The flow of a normal SQL query with databaseb/sql
has been identical since the very first stable release of Go. Something like:
rows, err := db.Query("SELECT a, b FROM some_table")
if err != nil {
return err
}
defer rows.Close()
for rows.Next() {
var a, b string
if err := rows.Scan(&a, &b); err != nil {
return err
}
...
}
That’s straightforward in practice: (*DB).Query
returns a sql.(*Rows)
, which is stealing a connection from the SQL pool—the connection on which we’ve performed the query. Subsequent calls to (*Rows).Next
will read a result row from the connection, whose contents we can extract by calling (*Rows).Scan
, until we finally call (*Rows).Close
and then any further calls to Next
will return false
and any calls to Scan
will return an error.
Under the hood, the implementation involving the underlying driver is more complex: the driver returns its own Rows interface, which is wrapped by sql.(*Rows)
. Nevertheless, as an optimization the driving force’s Rows
doesn’t have a Scan
methodology, it has a Subsequent(dest []Worth) error
iterator. The concept behind this iterator is that it returns Worth
objects for every column within the SQL question, so calling sql.(*Rows).Subsequent
merely calls driver.Rows.Subsequent
beneath the hood, maintaining a pointer to the values returned by the driving force. Then, when the consumer calls sql.(*Rows).Scan
, the usual library converts the driving force’s Worth
objects into the goal varieties that the consumer handed as arguments to Scan
. Because of this the driving force doesn’t have to carry out argument conversion (the Go normal library takes care of it), and that the driving force can return borrowed reminiscence from its rows iterator as an alternative of allocating new Worth
s, because the Scan
methodology converts the reminiscence anyway.
As you could guess instantly, the safety concern we’re coping with is attributable to this borrowed reminiscence. Returning short-term reminiscence is definitely an necessary optimization, which is explicitly inspired within the driver contract, and works properly in apply for MySQL as a result of it lets us return tips to the connection buffer the place we’re studying outcomes immediately from MySQL. The row information for a SQL question because it comes off the wire might be handed on to the Scan
methodology, which can then parse its textual illustration into the kinds (int
, bool
, string
, and so on) that the consumer is anticipating. And this has all the time been protected to do, as a result of we’re not overwriting our connection buffer till the consumer calls sql.(*Rows).Subsequent
once more, so no information races are attainable.
…This was the case, no less than, till the introduction of the QueryContext
APIs in Go 1.8. With these new interfaces, it’s now attainable to name db.QueryContext
with a Context
object that can interrupt a question early—we’ve mentioned this at size on this submit. However the truth that a question might be interrupted or timed out whereas we’re scanning its ensuing rows opens up a critical safety vulnerability. Any time a SQL question is interrupted whereas inside a sql.(*Rows).Scan
name, the driving force overwrites the underlying MySQL connection buffer, and Scan
returns corrupted information.
This may increasingly appear stunning, nevertheless it is sensible if we perceive that canceling our Context
means canceling the SQL question within the consumer, not within the MySQL server. The MySQL server continues serving the outcomes of the question by means of our lively MySQL connection, so if we would like to have the ability to reuse the connection the place a question has been canceled, we should first “drain” all of the outcome packets that the server despatched. To empty and discard all of the remaining rows within the question, we have to learn these packets into our connection buffer, and since question cancellation can occur concurrently a Scan
name, we overwrite the reminiscence behind the Worth
s the consumer is scanning from. The result’s, with a particularly excessive probability, corrupted information.
This race is actually spooky as described thus far, however right here’s the scariest factor of all of it: in apply, you received’t be capable of inform if the race is going on or has occurred in your app. Within the earlier instance of rows.Scan
utilization, although we’re performing error checking, there isn’t a dependable method to inform if our SQL question was canceled as a result of our Context
expired. If context cancellation has occurred inside the rows.Scan
name (for instance, the place it must occur for this information race to set off), the row values are scanned, probably with corrupted information, however no error is returned as a result of rows.Scan
solely checks if the rows are closed as soon as when getting into the operate. Therefore, when the race triggers we’re getting again corrupted information with out an error return. We received’t be capable of inform that the question has been canceled till our subsequent name to rows.Scan
, however that decision by no means occurs as a result of rows.Subsequent()
returns false, once more with out reporting an error.
The one method to inform whether or not scanning the rows of our SQL question has completed cleanly or whether or not the question has been canceled early is checking the return worth of rows.Shut
. And we’re not doing that as a result of the rows are being closed in a defer
assertion. Oof. A quick Code Search for rows.Close()
on GitHub exhibits that just about no one is explicitly checking the return worth of rows.Shut()
of their Go code. This was really protected to do earlier than the QueryContext
APIs had been launched, as a result of rows.Scan
would all the time catch any errors, however since Go 1.8, this isn’t appropriate. Even after this information race is mounted within the Go MySQL driver, you received’t be capable of inform whether or not you’ve scanned all of the rows of your SQL question except you examine the return worth of rows.Shut()
. So replace your linters accordingly.
I can consider a number of methods to repair this race. The obvious is to all the time clone reminiscence when getting back from driver.Rows.Subsequent
, however that is additionally the slowest. The entire level of the driving force contract is that the driving force doesn’t allocate reminiscence when calling driver.Rows.Subsequent
, as a result of the reminiscence allocation is delayed till the consumer calls sql.(*Rows).Scan
. If we allocate in Subsequent
, we’re allocating twice for every row, which implies that this bug repair would trigger a big efficiency regression. The Go MySQL maintainers weren’t OK with this, and I wasn’t both. Different comparable approaches corresponding to truncating the underlying connection buffer when calling sql.(*Rows).Close
had been rejected for a similar cause—they probably allocate reminiscence on every Scan
. All these rejected fixes led to a stalemate between the maintainers which induced this vital bug to develop into stale for over six months. I personally discover it terrifying that such a bug can go unfixed for therefore lengthy, so I needed to give you a sizzling new take to repair the problem in our manufacturing hosts with out inflicting a efficiency regression.
The very first thing I tried was “draining” the MySQL connection with out utilizing the underlying connection buffer. If we don’t write to the connection buffer, then the info race received’t happen. This rapidly grew to become a nightmare of spaghetti code, as a result of MySQL can ship numerous packets which we should drain, and these packets all have totally different sizes, outlined of their headers. Partially parsing these headers with no buffer was not fairly.
After some extra thought, I lastly got here up with an answer which was very simple and quick sufficient to be finally merged upstream—double buffering. In historical Pc Graphics stacks, one would write particular person pixels immediately on a body buffer, and the monitor would concurrently learn the pixels at its given refresh fee. When the body buffer was being written to whereas the display was studying from it, a graphical glitch occurred—what we generally name flickering. Flickering is essentially a knowledge race which you’ll be able to see with your individual eyes, and also you’ll agree that’s a reasonably cool idea. Probably the most easy method to repair flickering in pc graphics is to allocate two body buffers: one buffer that the display can be studying from (the entrance buffer), and one other buffer that the graphics processor can be writing to (the again buffer). When the graphics processor is finished rendering a body, we atomically flip the again buffer with the entrance buffer, so the display by no means reads a body that’s presently being composed.
Determine 5: Double buffering within the Nintendo 64 graphics stack. If that is ok for Mario, it’s ok for the MySQL driver
This case sounds rather a lot like what’s occurring with the connection buffer on the MySQL driver, so why not repair it in the identical method? In our case, when our driver.Rows
struct is being closed as a result of a question was canceled early, we swap our connection buffer with a background buffer, in order that the consumer can nonetheless name sql.Rows.Scan
on the entrance buffer whereas we’re draining the MySQL connection into the again buffer. The subsequent time a SQL question is carried out on the identical connection, we’re going to proceed studying into the background buffer, till the Rows
are closed once more and we flip the buffers to their unique positions. This implementation is trivial, and though it allocates reminiscence, it solely does so as soon as for the lifetime of a MySQL connection, so the allocation is well amortized. We additional optimized this nook case by delaying the allocation of the again buffer till the primary time we have to drain information into it. Within the widespread case the place no queries are canceled early in a MySQL connection, no further allocations are carried out.
After carefully crafting some benchmarks to make sure that the double buffering method wasn’t inflicting efficiency regressions in any instances, I lastly landed the bug fix upstream. Sadly, we’ve got no method to examine how typically we’ve carried out corrupted reads up to now due to this bug, nevertheless it’s most likely been fairly just a few as a result of, regardless of the subtlety in its origin, it’s surprisingly straightforward to set off.
Manufacturing ideas
-
Always do an explicit check in for the error return in
(*Rows).Close
. It’s the only way to detect whether the SQL query has been interrupted during scanning. However, do not remove thedefer rows.Close()
call in your app, as it’s the only way theRows
get closed if apanic
or early return happens during your scanning. Calling(*Rows).Close
several times is always safe. -
Never use
(*Rows).Scan
with asql.RawBytes
target. Even though the Go MySQL driver is now much more resilient, accessingRawBytes
can and will fail with other SQL drivers if your query is cancelled early. You will, most likely, read invalid data if that happens. The only difference between scanning into[]byte
andsql.RawBytes
is that the raw version won’t allocate extra memory. This tiny optimization isn’t worth the potential data race in aContext
-aware application.
Wrapping up
Deploying code in a new programming language to production is always a challenge, particularly at the scale GitHub operates. The throughput and complexity of our MySQL usage patterns means we’ve spent many years tweaking our MySQL client for Ruby, and now we’re excited to do the same for the Go MySQL driver. The fixes we’ve upstreamed as part of our first Go deployment are now generally available as part of the 1.5.0 release of the Go MySQL driver, and we’ll proceed to contribute extra fixes and performance to the driving force as we increase our manufacturing utilization of Go.
Due to the Go MySQL maintainers, @methane and @julienschmidt, for his or her assist reviewing and touchdown these patches!