Publish-quantum Cryptography for the Go Ecosystem

filippo.io/mlkem768 is a pure-Go implementation of ML-KEM-768 optimized for correctness and readability. ML-KEM (previously often known as Kyber, renamed as a result of we are able to’t have good issues) is a post-quantum key trade mechanism within the technique of being standardized by NIST and adopted by many of the business.
The bundle quantities to ~500 lines of code, plus 200 traces of feedback, and 650 traces of exams. It has no dependencies apart from golang.org/x/crypto/sha3. It’s meant for upstreaming into the Go commonplace library (initially as an internal-only bundle utilized in an opt-in crypto/tls experiment) and was designed to offer excessive safety assurance by means of ease of evaluate, simplicity, and thorough testing.
I livecoded a part of its growth on Twitch, and you’ll watch the replay on YouTube.
In contrast to most different implementations, this code was not ported from the reference pq-crystals library, however written from scratch not having ever carefully learn different codebases. This was an intentional train in spec validation, to indicate it’s potential to provide an interoperable implementation from the specification alone.
The FIPS 203 document turned out to be a superb implementation information, with detailed pseudo-code, exhaustive definitions, and constant kind info. (That is one thing I wish to ask of any giant specification doc: outline your varieties and use them and denote them!) To make the code each simpler to evaluate and higher as a studying useful resource, perform and variable names, and even operation ordering, are fastidiously picked to reflect the FIPS specification.
The specification truly requires pretty restricted math background, however to facilitate the work of implementers, I wrote up Enough Polynomials and Linear Algebra to Implement Kyber.
Past that, the one elements left as an train to the reader have been
- implementing arithmetic modulo the prime 3329;
- concretely implementing the compress and decompress features mapping values [0, 3329) to and from [0, 2ᵈ); and
- ensuring constant time operations.
Modulo arithmetic was reasonably easy, as we all collectively learned a lot about finite field arithmetic through years of RSA and elliptic curve implementations. The small prime actually makes the task feel unnaturally simple.
Compression and decompression turned out to be the most difficult part of the project. The specification defines them in abstract terms as fractions and rounding rules—“just” compute (2ᵈ/q)·x or (q/2ᵈ)·y and round to the closest integer—but in practice we need to implement them with constant time arithmetic and bitwise operations! In my public comments I pointed out that having each implementation figure out a strategy is risky and redundant. I was more correct than I thought: it turned out that the reference implementation and ~every implementation ported from it used a division which depending on compiler optimizations and platform might result in a DIV instruction, which is variable-time even when the divisor is fixed. This package was unaffected, because it used Barrett reduction from the start, like BoringSSL.
You can read the rest of my formal public comments on the pqc-forum mailing list.
Readability was a major goal of the implementation, and it was pursued even especially for complex functions like compression and decompression. A readable implementation has two purposes: first, it allows effective review, both during the code review process and later by interested researchers, improving security; second, it serves as an educational resource for the next generation of maintainers and cryptography engineers (or curious nerds). Reading the Go cryptography standard library is how I got started on the path that led me here, so it is especially important to me to preserve and improve it as a learning resource. It’s obviously subjective, but I believe this to be the most understandable public ML-KEM/Kyber implementation. Compare for example our compression/decompression functions with the reference implementation.
Sometimes improving readability and reviewability means making code longer and less reusable: for example for ML-KEM-768 we need to serialize 1-, 4-, 10-, and 12-bit integers in a packed format. A universal 1-to-12 bit encoder and decoder is a pretty gnarly piece of code to write correctly, but each of those four sizes are actually pretty easy to write a dedicated encoder/decoder for. This is why we have ringCompressAndEncode1/4/10
etc. instead of a single universal function. This also made it easy to work some special required checks into the 12-bit decoder.
This, by the way, was only possible because we targeted ML-KEM-768 specifically, or we’d have had to implement 5- and 11-bit encodings, as well. ML-KEM is specified at three security levels (-512, -768, and -1024). However, the Kyber team recommends using -768 over -512 for a more conservative security margin against novel cryptanalysis, while -1024 exists only for the same reasons 256-bit security levels exist: compliance and blind strength matching. Most protocols being tested or standardized coalesced around ML-KEM-768, so targeting only that improves not only readability, but also security (because there are fewer moving parts), and performance (because we can optimize allocation sizes, iteration counts, and encoding algorithms) at little to no cost.
After readability, testing is the main component in this package’s high security assurance strategy. Besides checking that key generation, encapsulation, and decapsulation round-trip correctly, and maintaining a test coverage of 95%+, we
- ensure interoperability with test vectors obtained from NIST and other implementations;
- exhaustively test every input combination for base field arithmetic operations (addition, subtraction, and multiplication modulo 3329) against expected values computed trivially with variable-time operations;
- exhaustively test compression and decompression against math/big.Rat (contributed by David Buchanan);
- test that pre-computed constants match their definition;
- check that incorrect lengths (both long and short) cause the appropriate error for every input of every function;
- run an extensive set of reusable test vectors we developed (see below);
- run test vectors provided by Sophie Schmieg which will be eventually included in Wycheproof.
Our test vectors are designed to be reusable by other implementations, and are published as part of the CCTV project along with detailed intermediate values for testing and debugging each intermediate step and partial algorithm, which we used during development. There are different sets of tests vectors, each designed to reach different edge cases.
-
Negative test vectors provide invalid encapsulation keys, where the coefficients are higher than 3329. These were often requested, since all the test vectors from the Kyber and NIST teams are for regular, correct inputs. These vectors individually test every value from 3329 to 2¹²-1 and every coefficient location, sharing the remaining coefficients so they compress from 1–3 MiB down to 12–28 KiB.
-
“Unlucky” vectors require an unusually large number of XOF reads. Kyber samples a matrix from a portion of public keys with rejection sampling: it gets a random value between 0 and 2 ¹²-1 and checks if it’s less than 3329, if not, it tries again. The amount of bytes needed to sample a matrix depends on how lucky you get with the sampling, and that’s a random function of the public key component. These vectors are regular public keys and require reading more than 575 bytes from the SHAKE-128 XOF in SampleNTT, which would ordinarily happen with probability 2⁻³⁸. Sophie’s vectors were bruteforced further, and require up to 591 bytes.
At this point I would like to thank our detection and response team for not killing my job(s) hashing vast amounts of random seeds and looking for zeroes in the output. — Sophie Schmieg
-
Special vectors fail if strcmp is used in ML-KEM.Decaps. In ML-KEM.Decaps the ciphertext is compared with the output of K-PKE.Encrypt for implicit rejection. If an implementation were to use
strcmp()
for that comparison it would fail to reject some ciphertexts if a zero byte terminates the comparison early. This one I hope is going to sit as a silent trap for years—who would usestrcmp()
in cryptographic code—and then ruthlessly kill a vulnerability, because of course someone will. -
Accumulated vectors (derived from the reference pq-crystals implementation) allow testing randomly reachable edge cases without checking in large amounts of data. The reference implementation of Kyber includes a
test_vectors.c
program that generates 300MB of random vectors. I had no intention of checking in the output or compiling C, but since they are just randomly generated vectors, we can regenerate them in our tests from the deterministic RNG (SHAKE-128 with an empty input) and check they hash to an expected value. We can even take it further, and produce hashes for a million random tests, beyond the 10k they generate.
I am happy to report that none of the tests, many introduced after completion of the implementation, identified any issues in filippo.io/mlkem768. There is at least one reported instance of the negative vectors identifying a defect in a major implementation, though.
Performance is not a primary goal (neither of this package nor of the Go cryptography packages) but the package needs to be fast enough to be useful. Thankfully, ML-KEM is pretty fast, to the point that this simple implementation is competitive with our assembly-optimized P-256 and X25519 implementations.
To compare apples to apples, note that we need to compare the whole operation that each side needs to perform for key establishment: for ECDH, two scalar multiplications (one of them by the fixed base point); for KEMs, key generation and decapsulation on one side, and encapsulation on the other. ECDH is symmetrical, ML-KEM key establishment is not.
The ECDH benchmarks below already include the two scalar multiplications, while the mlkem768 benchmarks are split as key generation and decapsulation under “Alice” and encapsulation under “Bob”. Since decapsulation includes a full encryption (to check the resulting ciphertext matches the input), Alice takes a lot longer than Bob: the latter does an encryption, while the former does an encryption, a decryption, and a key generation.
All in all, “Bob” is as fast as our X25519 or P-256, while “Alice” takes less than twice. Compared to some of the fastest ML-KEM implementations out there (BoringSSL and libcrux), this package takes approximately double the time. For such a simple and unoptimized implementation, this is more than satisfactory.
goos: darwin
goarch: arm64
cpu: Intel(R) Core(TM) i5-7400 CPU @ 3.00GHz
pkg: crypto/ecdh
│ sec/op │
ECDH/P256-8 49.43µ ± 0%
ECDH/X25519-8 77.46µ ± 0%
pkg: filippo.io/mlkem768
│ sec/op │
RoundTrip/Alice-8 109.4µ ± 0%
RoundTrip/Bob-8 56.19µ ± 0%
goos: linux
goarch: amd64
pkg: crypto/ecdh
│ sec/op │
ECDH/P256-4 78.88µ ± 1%
ECDH/X25519-4 115.6µ ± 2%
pkg: filippo.io/mlkem768
│ sec/op │
RoundTrip/Alice-4 223.8µ ± 2%
RoundTrip/Bob-4 114.7µ ± 1%
The performance wasn’t entirely free. In general, I followed high-performance Go programming patterns, trying for example to minimize heap allocations. Next, I reworked the x/crypto/sha3 package so it could be used without any heap allocation thanks to the mid-stack inlining trick. However, I haven’t merged those changes yet and they are not included in the benchmarks above, because they have a negative effect on Apple M2 processors. No idea why yet.
goos: darwin
goarch: arm64
pkg: filippo.io/mlkem768
│ sec/op │ sec/op vs base │
RoundTrip/Alice-8 109.4µ ± 0% 121.3µ ± 1% +10.91% (p=0.000 n=10)
RoundTrip/Bob-8 56.19µ ± 0% 59.94µ ± 2% +6.66% (p=0.000 n=10)
goos: linux
goarch: amd64
│ sec/op │ sec/op vs base │
RoundTrip/Alice-4 223.8µ ± 2% 218.6µ ± 1% -2.32% (p=0.000 n=10)
RoundTrip/Bob-4 114.7µ ± 1% 109.5µ ± 0% -4.57% (p=0.000 n=10)
The one successful optimization was complaining about the confusing result above on the Gophers Slack #performance
channel, which sniped Josh Bleecher Snyder into contributing a couple changes 🙂
There is some low hanging fruit still: key generation and decapsulation both sample a matrix from the same value, and since the two are usually done sequentially on the Alice side, the matrix could be stored saving around 10% time. There might be an opportunity to save a copy in the sha3 read path, too. After that, it’s a matter of optimizing the field implementation.
If you got this far, you might want to follow me on Bluesky at @filippo.abyssdomain.expert or on Mastodon at @filippo@abyssdomain.expert.
Bonus track: using a ML-KEM implementation as Kyber v3
NIST made a few small changes to the Round 3 submission of Kyber. They are summarized in Section 1.3 of the FIPS draft.
However, there are a few experimental protocols defined in terms of Kyber v3 (or “draft00”), including the main deployed PQ TLS key exchange. Do we have to make a separate package to support them?
Luckily, no we don’t.
One change adds some validation for an edge case (non-canonical coefficient encodings in public keys) that was undefined in Kyber. Honest implementations will not produce such keys, so we can reject them as specified in the FIPS draft. It will make it possible to fingerprint our implementation as Kyber-on-ML-KEM but will be otherwise harmless.
One change removed a hashing step applied to CSPRNG input. Since those bytes are random, it’s impossible for any party to tell the difference.
The final change is the major one, and the trickiest. The ciphertext used to be hashed into the shared secret. This difference would prevent interoperability. However, the mixing happens as an additional key derivation, which was entirely removed in ML-KEM, which instead returns the value K as-is. This means we can run ML-KEM to generate the shared secret K and then apply
SHAKE-256(K || c)[:32]
to generate the Kyber shared secret. No want to interrupt the ML-KEM abstraction.
There’s one wrinkle: each Kyber and ML-KEM carry out implicit rejection in Decapsulate by hashing a secret with the ciphertext and returning that because the shared secret. If we do the important thing derivation above on prime of ML-KEM, we’ll hash the ciphertext twice for implicit rejections. That’s okay, as a result of the output of implicit rejection is unpredictable by design, not an interoperation goal.
The image
In Berlin there’s an outdated closed airport, Tempelhof, which is now a public park. Strolling down the taxiways (pictured) or alongside the centrelines of the 09L/27R and 09R/27L crossed-out runways is kinda unsettling, no less than for me. (“Ought to I be talking with Floor or Tower? Can I enter this runway?”) Enjoyable truth, in 2010 a single-engine airplane forgot to change gas tank and did an emergency touchdown on 27L. Closed runways are the very best unhealthy locations to land, in spite of everything.
This work was funded by a Google Open Source Security Subsidy and by my superior purchasers—Sigsum, Latacora, Interchain, Smallstep, Ava Labs, Teleport, and Tailscale—who, by means of our retainer contracts, get face time and limitless entry to recommendation on Go and cryptography.
Listed below are a number of phrases from a few of them!
Latacora — We wrote about password hashing with delegation, a considerably much less identified password hashing primitive. It is a PBKDF with a particular property, that enables offloading hashing computation to a doubtlessly untrusted server. On this weblog put up, we describe this primitive and focus on its applicability within the context of Finish-to-Finish Encrypted (E2EE) backup methods.
Teleport — For the previous 5 years, assaults and compromises have been shifting from conventional malware and safety breaches to figuring out and compromising legitimate consumer accounts and credentials with social engineering, credential theft, or phishing. Teleport Identity Governance & Security is designed to eradicate weak entry patterns by means of entry monitoring, decrease assault floor with entry requests, and purge unused permissions by way of necessary entry opinions.
Ava Labs — We at Ava Labs, maintainer of AvalancheGo (probably the most broadly used consumer for interacting with the Avalanche Network), consider the sustainable upkeep and growth of open supply cryptographic protocols is important to the broad adoption of blockchain know-how. We’re proud to assist this vital and impactful work by means of our ongoing sponsorship of Filippo and his crew.