PGP signatures on PyPI: worse than ineffective

Programming, philosophy, pedaling.
Could 21, 2023
Tags:
TL;DR: A lot of PGP signatures on PyPI can’t be correlated to any well-known
PGP key and, of the signatures that may be correlated, many are generated from weak keys or
malformed certificates. The outcomes recommend widespread misuse of GPG and different PGP implementations by Python
packagers, with mentioned misuse being inspired by the PGP ecosystem’s poor defaults, opaque
and user-hostile interfaces, and
outright dangerous recommendations.
Preword
I’ve been sitting on this publish for just a few months, partially due to journey
and partially as a result of its (meant) scope was starting to mirror PGP’s personal fractal complexity.
The model that I’m publishing now has been considerably pared down to take away prolonged
digressions on how unhealthy PGP’s packet format is, all of the other ways wherein a signature or
certificates packet might be damaged, incorrectly sure, &c.
I’ve eliminated these issues as a result of I believe the outcomes, as current, are enough proof
for the precise claims I’d prefer to make, particularly:
-
That present PGP signatures on PyPI serve no safety goal, and that every one proof
factors to no person ever trying to confirm them; -
Even superior technical communities, as an entire, largely fail to scale back PGP’s complexity
and pointless agility into an inexpensive and tractable subset.
And, simply in case it must be mentioned:
-
This publish isn’t meant to disparage PyPI: PyPI has finished every thing proper, together with
purposely removing frontend support for PGP years ago. -
This publish isn’t meant to disparage particular person packagers and maintainers nonetheless importing
signatures to PyPI. I believe that a lot of the continued signature importing is a outcome
of long-forgotten automation and, even when it isn’t: builders can not be blamed for
their misuse of obtuse instruments. Safety instruments, particularly cryptographic ones, are
solely pretty much as good as their least-informed and most distracted consumer.
Background
PyPI has supported PGP signatures in some kind or one other for a really very long time.
To this date, PGP continues to be (minimally) supported: package deal uploaders can nonetheless signal for his or her package deal
distributions and add the ensuing .asc
to PyPI for inclusion within the index. The
official uploading utility even helps invoking
gpg
straight by way of the --sign
and --sign-with
arguments!
To a novice Python programmer trying to publish their first package deal to PyPI, this would possibly give the
following impressions:
- That PGP affords safe and trendy cryptographic primtives;
- That PyPI encourages customers to add PGP signatures or that doing so is finest follow;
- That others anticipate PGP signatures, and that package deal adoption is (partially) predicated
on supplying PGP signatures.
The primary two are simply fallacious:
-
PGP is an insecure and
outdated ecosystem that hasn’t mirrored
cryptographic finest practices
in decades. -
PyPI’s help is vestigial in nature: signatures aren’t proven as a part of the net interface,
and are solely obliquely referenced within the PEP 503 and JSON
APIs.
The third is tougher to right away refute: PyPI nonetheless hosts signatures, in any case. Absent any
different info, it’s completely potential that firms and finish customers are quietly and diligently
verifying no matter signatures are current, utilizing belief units, monitoring revoked and expired keys,
and so forth.
Thus, my aim with this weblog publish:
- Decide what number of signatures are on PyPI;
- Correlate these signatures to their signing keys;
- Analyze these signing keys for his or her sensible worth: their power, liveness, &c.
Methodology
Comparatively early within the course of I made a decision to not acquire each single signature on PyPI,
for 2 predominant causes:
-
Relevance: PyPI hosts many aged package deal distributions, together with distributions
for Python 2.7 (and earlier!). Provided that Python 2 has been EOL for over three years at
this level, it didn’t really feel related (or environment friendly) to retrieve massive portions of
signatures that no person is more likely to ever strive set up the distributions for. -
Equity: each PGP and Python have a number of historical past, a lot of which predates
trendy understandings round cryptographic finest practices.
Provided that, it didn’t really feel truthful to research extraordinarily outdated
signatures, particularly if doing so would bias the statistics away from newer customers
who’re doing extra accountable issues.
Given these concerns, I made a decision to restrict my evaluation to solely signatures uploaded to PyPI
on or after 2020-03-27. I selected that date considerably arbitrarily whereas
additionally satisfying just a few constraints:
-
It’s nicely after the 2018 deployment of the new PyPI,
which didn’t emphasize help for PGP signatures (whereas nonetheless retaining it). In different phrases:
signatures uploaded in 2020 or later have been both finished by automation (implying some extent
of sophistication) or have been seemingly a acutely aware resolution by a packager to proceed signing
with PGP. -
It’s very latest, and finest practices round digital signatures haven’t modified
considerably since 2020. In different phrases: a best-practices signature (and key) made in 2020
ought to look similar to a best-practices signature (and key) made in 2023, and somebody
signing in 2020 would haven’t any good excuses for not making cheap decisions.
Really retrieving the signatures was a multi-step course of. To begin, I used
PyPI’s BigQuery dataset
to present me some primary metadata on each distribution file with an related signature:
1
2
3
4
SELECT title, model, filename, python_version, blake2_256_digest
FROM `bigquery-public-data.pypi.distribution_metadata`
WHERE has_signature
AND upload_time > TIMESTAMP("2020-03-27 00:00:00")
This produced 52900 distributions uploaded since 2020-03-27 for which PyPI additionally
had a signature (subtract 1 for the CSV header):
1
2
3
4
5
6
$ wc -l inputs/dists-with-signatures.csv
52901 inputs/dists-with-signatures.csv
$ head -2 inputs/dists-with-signatures.csv
title,model,filename,python_version,blake2_256_digest
pantsbuild.pants.testutil,1.30.0,pantsbuild.pants.testutil-1.30.0-py36.py37.py38-none-any.whl,py36.py37.py38,7ecbe47906ddbe8a2f1ee2505c2edb7f9313348d4925855e429be1d316660a00
From right here, I wanted to retrieve every launch distribution’s indifferent signature, i.e.
the adjoining .asc
URL in PyPI’s object storage.
I initially did this with the “conveyor” service, which turns
PEP 491 names into URLs like so:
1
https://recordsdata.pythonhosted.org/packages/supply/{model}/{title[0]}/{title}/{dist}.asc
Nonetheless, this was fairly lossy: for no matter purpose my URLs have been barely off about 20% of the
time, leading to a number of missed signatures. I finally realized that the BigQuery dataset
additionally consists of the Blake2 digest for every distribution, which means that I may use the precise
package deal URLs as an alternative:
1
https://recordsdata.pythonhosted.org/packages/{digest[0:2]}/{digest[2:4]}/{digest[4:]}/{dist}.asc
…and this was completely dependable.
From right here, I needed to determine (roughly) what number of distinctive keys produced these ~50k signatures.
I made a decision to make use of PGPy for that; excerpted from dists-by-keyid.py
:
1
2
3
4
5
6
7
8
9
10
11
sig = pgpy.PGPSignature.from_blob(sig_resp.content material)
strive:
# https://github.com/SecurityInnovation/PGPy/points/433
sig
sig.signer
besides AttributeError:
print("barf: could not get signer, most likely historical", file=sys.stderr)
_KEY_ID_MAP["<invalid signer>"].append(rec)
proceed
_KEY_ID_MAP[sig.signer].append(rec)
This left me with an enormous map of PGP key IDs to an inventory of distributions
signed by them, together with 26 distributions whose signatures PGPy couldn’t parse:
Package deal title | Distribution depend |
---|---|
agraph-python | 2 |
excerpt-html | 4 |
lektor-index-pages | 6 |
lektor-expression-type | 2 |
lektor-git-timestamp | 2 |
lektor-datetime-helpers | 3 |
lektor-limit-dependencies | 2 |
lektorlib | 2 |
lektor-polymorphic-type | 3 |
This can be a tiny failure (26 distributions out of 52900, or roughly 0.5%), nevertheless it
units the tone for the remainder of the publish.
Other than these 26 failures, the remaining 52874 signatures have been produced from
1067 “distinctive” PGP keys.
Outcomes
At this level, I had 1067 distinctive key IDs, every of which wanted to be retrieved
from a keyserver.
My expectation was that this wouldn’t be a major problem,
regardless of the widely publicized implosion of the SKS keyserver community again in
2018: there are nonetheless just a few major
keyservers operating, and package deal authors
pushing to PyPI ought to have the presence of thoughts to add their keys. Proper?
Pictured: your creator instantly earlier than making an attempt to retrieve PGP keys in 2023.
Flawed. Of the 1067 keys IDs collected by means of signatures on PyPI, a full 308
(or roughly 29%) had no publicly discoverable key on the most important remaining
keyservers. In different phrases: roughly 1/third of all signatures added to PyPI since 2020
are sure to keys that aren’t discoverable by the PGP ecosystem’s personal tooling.
They would possibly exist, hidden on private domains and documentation pages, however, for
all intents and functions, these 29% of keys are ineffective.
So, our first graphic of the publish: discoverable keys versus undiscoverable ones:
Pictured: a really regular and wholesome signing ecosystem.
That left 759 found keys to really audit. To maintain issues
easy, I restricted my evaluation to simply the next concerns:
If that looks like a restricted evaluation, it’s as a result of it’s: there are too many
methods to provide a weirdly formed PGP certificates and/or key packet sequence,
and the present tooling (issues like pgpdump
and pgp --with-colons
) weren’t as much as the duty.
As an alternative, I wrote a little tool (pgpkeydump
) to present me machine-readable
dumps of PGP keys, after which wrapped it in a bulk auditing script
that does some primary statistics on the outcomes.
To summarize the outcomes:
- Of the 759 found keys, 298 (39%) had no binding signature at their specified
creation time. In different phrases: these keys’ certificates got here with no verifiable proof for
an related identification, expiry, or any of the opposite primary metadata conceptually related
with a PGP key, together with its meant goal. - 375 (49%) had no binding signature on the time of the audit (2023-05-19), which means that
any binding signature that was current had already expired. In different phrases: half of all
keys used to signal on PyPI since 2020 are already expired. This strongly means that
no person is trying to confirm signatures from PyPI on any significant scale.
Then, on the algorithm and parameter sides:
Main keys:
Key kind | Depend |
---|---|
RSA-4096 | 497 |
RSA-2048 | 127 |
RSA-3072 | 45 |
DSA-1024 | 40 |
EdDSA | 35 |
DSA-3072 | 7 |
DSA-2048 | 4 |
NIST P-521 | 1 |
RSA-4064 | 1 |
RSA-4032 | 1 |
“Efficient” keys:
RSA-4096 | 471 |
---|---|
RSA-2048 | 151 |
RSA-3072 | 47 |
EdDSA | 43 |
DSA-1024 | 31 |
DSA-3072 | 7 |
DSA-2048 | 5 |
NIST P-521 | 1 |
brainpoolP512r1 | 1 |
RSA-4032 | 1 |
Or once more, as fairly charts:
First, the “good” elements:
- Whereas normally a bad choice, RSA is actually
the very best you are able to do by way of normal uneven signing algorithms in PGP. Over
two thirds of keys used to signal on PyPI are utilizing it, they usually’re utilizing cheap
key sizes (4096 and 3072).
Then, the meh:
-
A sizeable minority (20% of efficient keys, and 17% of main keys) are RSA-2048.
NIST considers RSA-2048 to be equal to roughly 112 bits of safety, and
does not recommend its use on knowledge that’s anticipated to have a safety life
of 15 years…beginning in 2015. That signifies that PyPI-hosted signatures in opposition to RSA-2048 keys
have roughly 7 years of “shelf life” in them. Model turnover in packaging ecosystems
has accelerated over the past decade; let’s hope that applies right here too! -
Some enterprising individuals are on the “bleeding edge”: they’re utilizing
EdDSA and some totally different ECDSA curves. It’s onerous to say whether or not that is good or unhealthy: it’s
good within the sense that these are virtually definitely higher than something supplied by
strictly RFC 4880 PGP implementations, however pointless within the sense that help for verifying
these signatures is restricted to only a few purchasers. It’s additionally most likely
pointlessly sluggish (for P-521 and brainpoolP512r1 specifically).
And eventually, the insane:
-
Roughly 5% of all keys used to signal for packages on PyPI are DSA. The bulk
of these are DSA-1024, which is roughly equal in power to RSA-1024.
DSA of any size is already very bad,
and DSA-1024 is nicely outdoors of any acceptable security margin for signatures in
2023, a lot much less 2020 and even 2010. -
RSA-4064 and RSA-4032. I do not know why anybody would do that. Perhaps some
misguided try to calculate a exact safety margin, or a misreading of another person’s
suggestions? -
One of many RSA-2048 keys has a public exponent of
41
, fairly than65537
(which each different
RSA key within the dataset makes use of). Once more, I do not know why anybody would do that: it’s pointlessly
slower and opens up padding issues thate = 65537
is resilient in opposition to.
Takeaways
To summarize: of simply the PGP signatures uploaded to PyPI within the final three years:
By all rights, these numbers symbolize the absolute best case for PGP signatures on
PyPI. Increasing the audit to 2015 and even earlier would seemingly reveal far worse practices.
In a single sense, none of it is a drawback: the breadth and depth of points right here
means that no person (fortunately!) is definitely counting on these signatures,
and the continued presence of recent signatures on PyPI is primarily a vestige of
forgotten automation and outdated tutorials.
Then again, these outcomes current a robust case in opposition to trying
to “rehabilitate” PGP signatures for PyPI, or every other packaging ecosystem:
all proof factors to finish customers (i.e., signers) being unable to differentiate
between the “good” and “unhealthy” elements of PGP, a lot much less use them in any respect (e.g. keyservers).
So, for closing conclusions:
- Given how damaged the PGP signatures and keys current on PyPI are, it’s unlikely that anyone
is at the moment doing wide-scale verification in opposition to them. - If anyone is (and I’d have an interest to listen to in case you are!), then it’s virtually definitely
inadvisable: “verifying” these signatures is, on common, seemingly to offer a
false diploma of confidence of their worth.
As with earlier posts, I’ve tried to make my steps and knowledge reproducible, and have
checked all of them into this repo. I welcome any discoveries of errors I’ve made, as
nicely as any makes an attempt to enhance the general element or constancy of the outcomes!
Discussions: