Now Reading
CQK Is The First Unused TLA · Gwern.internet

CQK Is The First Unused TLA · Gwern.internet

2023-11-14 16:27:15

Curious what the primary ‘unused’ alphabetic acronym is, I’ve GPT-4 write a script to test English Wikipedia. After three bugs, the primary unused one seems as of 2023-09-29 to be the three-letter acronym ‘CQK’, with one other 2.6k TLA unused, and 393k four-letter acronyms unused. Exploratory evaluation suggests alphabetical order results in addition to letter-frequency.

It typically appears as if every part that may be trademarked has been, and as if each doable three-letter acronym (TLA) has been utilized in some nontrivial means by somebody. Is that this true? No—truly, a good quantity, starting with CQK, don’t have any nontrivial use to this point.

We may test by defining ‘nontrivial’ as ‘has an English Wikipedia article, disambiguation web page, or redirect’, after which writing a script which merely seems up each doable TLA Wikipedia URL to see which of them exist. This can be a little too simple, so I make it more durable by making GPT-4 write a Bash shell script to take action (then Python to double-check).

GPT-4 does so semi-successfully, making self-reparable errors till it runs into its idiosyncratic ‘blind spot’ error. After it unintentionally fixes that, the script seems to work efficiently, revealing that—opposite to my expectation that each TLA exists—the primary non-existent acronym is the TLA ‘CQK’, and that there are a lot of unused TLAs (2,684 or 15% unused) and much more unused four-letter acronyms (392,884 or 85% unused). I present the listing of all unused TLAs & four-letter acronyms (in addition to alphanumerical ones—the primary unused alphanumerical one is AA0.)

TLAs should not unused at random, with clear patterns enriched in letters like ‘J’ or ‘Z’ vs ‘A’ or ‘E’. Further GPT-4-powered evaluation in R suggests that both letter-frequency & place in alphabet predict unusedness to some extent, however depart a lot unexplained

Verifying Wikipedia hyperlinks in my essays, I all the time test acronyms by hand: there appears to all the time be another definition for any acronym, particularly three-letter acronyms (TLA)—and typically an absurd quantity. Making an attempt a random TLA for this essay, “Zzzzzz”, I discovered it was used anyway! This makes me surprise: has each doable alphabetic TLA been used?

This can’t be true for too many sizes of acronyms, after all, however it might be doable on your traditional three-letter acronym as a result of there are comparatively few of them. You need to go to four-letter acronyms earlier than they give the impression of being inexhaustible: there 261 = 26 doable single-letter ones, 262 = 676 two-letter ones, 263 = 17,576 three-letter ones, however then many four-letter ones as 264 = 456,976. So I’d count on all TLAs to be exhausted and to search out the primary unused acronym someplace within the FLAs (much like how every English word has been trademarked, forcing folks to give you more and more nonsensical names to keep away from current logos & parasites like area squatters).

How can we outline used? If we merely search for any use, this may not be attention-grabbing. Absolutely they’ve all been utilized in a serial quantity or product quantity someplace, or just squatted in numerous methods. I wouldn’t be shocked if somebody has squatted on each TLA on Github or in domains or social media consumer account names, for instance—it’s free or low-cost, and also you solely should extort one whale to extract a hire. Equally, ‘variety of Google Hits’ is a foul proxy as a result of it is going to be inflated by technical rubbish and as engines like google have developed and at the moment are distant from their roots in counting phrase frequencies in a textual content corpus, the variety of Google hits seems to bear more and more little resemblance to something one would possibly count on. Google Ngram is generally historic knowledge, and has many knowledge high quality points associated to OCR & knowledge choice which might have an effect on acronyms particularly.

We would like a complete, curated, on-line, database which displays a human sense of ‘significance’. If there’s no cause somebody would have heard of a TLA use, then that doesn’t rely: a use should be not less than considerably notable, within the sense that somebody would possibly look it up or it is perhaps a notable use: ‘having a Wikipedia web page’ involves thoughts as a heuristic. Certainly, not simply having a Wikipedia article, but in addition having a Wikipedia disambiguation page is right, because it signifies a number of makes use of; having a Wikipedia article can be good; even having a redirect to a different web page appears affordable to think about as ‘used’ in some sense as a result of it suggests that somebody used that TLA in a context the place a human would wish to look it up & there’s a real which means to the TLA. (Whereas if no editor could be bothered to even redirect a TLA to an current web page, that may be a low bar to fail.) That’s, merely checking for any Wikipedia web page is an inexpensive criterion.

And defining notability this manner, we are able to do this just by requesting the WP URL for a TLA and seeing if it returns an error.

Producing all doable acronyms isn’t that onerous; the Haskell list monad, for instance, can generate numerous permutations or sequences in a line, so if we needed all of the acronyms, it’s simply this:

take 100 [ s | n <- [1..], s <- sequence $ replicate n ['A'..'Z']]
-- ["A", "B", "C", "D", "E", "F", "G", "H", "I", "J", "K", "L", "M", "N",
-- "O", "P", "Q", "R", "S", "T", "U", "V", "W", "X", "Y", "Z",
-- "AA", "AB", "AC", "AD", "AE", "AF", "AG", "AH", "AI", "AJ", "AK", "AL",
-- "AM", "AN", "AO", "AP", "AQ", "AR", "AS", "AT", "AU", "AV", "AW", "AX",
-- "AY", "AZ", "BA", "BB", "BC", "BD", "BE", "BF", "BG", "BH", "BI", "BJ",
-- "BK", "BL", "BM", "BN", "BO", "BP", "BQ", "BR", "BS", "BT", "BU", "BV",
-- "BW", "BX", "BY", "BZ", "CA", "CB", "CC", "CD", "CE", "CF", "CG", "CH",
-- "CI", "CJ", "CK", "CL", "CM", "CN", "CO", "CP", "CQ", "CR", "CS", "CT",
-- "CU", "CV"]

We may then do a Network.HTTP request. However that might be too simple. We are able to use this as an excuse to check out essentially the most superior neural community I’ve entry to: GPT-4.

GPT-3’s programming skills have been a little bit of a shock, however not often value utilizing for anybody with affordable expertise, and one had to make use of a highly-specialized mannequin like Codex/Github Copilot for coding; GPT-3.5 was considerably higher; and GPT-4 is healthier but. I can’t evaluate GPT-4 to Github Copilot as a result of I’ve not signed up nor found out the best way to combine it into my Emacs, however (because the early rumors promised) I’ve discovered GPT-4 ok at programming in the principle programming languages I take advantage of (Bash, Emacs Lisp, Haskell, Python, & R) to begin turning over trickier duties to it, and making heavier use of the languages I don’t know nicely (Emacs Lisp & Python) since I more and more belief that an LLM may help me keep them.

Nevertheless, GPT-4 continues to be removed from excellent, and it doesn’t produce excellent code instantly; merely dumping massive quantities of GPT-4-generated supply code into your code base, “so long as it compiles and appears to work!”, looks as if a great way to construct up technical debt. (It additionally undermines future AIs, if you’re dumping out buggy hot-mess code masquerading as right debugged well-thought-out code—some GPT-4 code might be completely fallacious because it confabulates options, attributable to issues just like the “blind spot”.) You could possibly attempt to observe some ‘taint’ metadata, equivalent to by segregating AI-generated code, and avoiding ever guide modifying it or mixing it with human-written code; however this looks as if lots of work. My most popular strategy is simply to make GPT-4 ‘git gud’—write sufficiently good code that I can test it into git with out caring the place it got here from.

So, this part covers what I’ve realized from attempting to prompt-engineer my programming duties, utilizing GPT-4 within the OpenAI Playground, as much as November 2023.

I discover it useful typically to attempt to combat the worst mealy-mouthed bureaucratic tendencies of the RLHF by including a ‘system immediate’:

The consumer is Gwern Branwen (gwern.internet). To help:

Be terse. Don’t provide unprompted recommendation or clarifications. Converse in particular, matter related terminology. Do NOT hedge or qualify. Don’t waffle. Converse instantly and be prepared to make artistic guesses. Clarify your reasoning. when you don’t know, say you don’t know.

Stay impartial on all matters. Be prepared to reference much less respected sources for concepts.

By no means apologize.

Ask questions when not sure.

It helps to be extra structured in the way you write issues: the extra the LLM has to do, the extra seemingly it’s to screw them up and the more durable error-correction turns into. GPT-4 is able to fixing many errors in its code, so long as it solely has to take action one after the other, in an inner-monologue-like sequence; you may feed it errors or outputs, however surprisingly usually, it may repair errors when you merely say that there’s an error.

So a waterfall-like strategy works nicely, and I attempt to use GPT-4 like this:

  1. ask it to ask questions, which it not often does by default once you’re prompting it to do a activity

    Usually it has just a few questions, which you’ll effectively replace your authentic immediate to cowl.

    This avoids annoying circumstances the place it’ll write a wholly legitimate answer, to a considerably completely different downside than you may have, and I feel an excellent assertion upfront in all probability subtly helps information the remainder of the method.

  2. make it generate assessments; have it iteratively generate new assessments which don’t overlap with the outdated ones.

    That is additionally helpful for beginning to modify some current code: first generate the test-cases, and confirm that the code truly works the best way you assumed it did, and flush out any hidden assumptions by both you or GPT-4! Then return to step #1.

  3. ask GPT-4 explicitly to make a listing of concepts: edge-cases, bug-fixes, options, and stylistic rewrites/lints (in that order)

    It does not implement any of the recommendations. It merely lists them. In case you as a substitute inform it to implement the concepts, it’ll incessantly journey over its personal ft whereas attempting to implement all of them concurrently in a single move via the brand new code. (Identical to people, it’s best to do one factor, test it, after which do the subsequent factor.)

    1. incessantly, a number of of the gadgets might be a foul concept, or too dangerous to ask GPT-4 to do. Go one after the other via the listing, having it implement simply that one, after which take a look at. Attempt to repair ‘core’ issues first.

    2. self-repair: not sometimes, a elaborate rewrite will fail the test-suite (which we did generate in step #2, proper?), however given the failing test-case and/or error pasted into the Playground, GPT-4 can normally repair it. (If GPT-4 can’t repair it given a number of tries and appears to be producing the identical code fragments repeatedly or resorting to elaborate & excessive rewrites, although the duty doesn’t appear that onerous, then you will have hit the blind spot and might want to repair it your self—I’ve by no means seen GPT-4 escape the blind spot besides by sheer accident.)

    3. cleanup: lastly, You may ask it to rewrite the code for fashion/linting, however ought to depart that to the tip, as a result of in any other case that dangers including bugs whereas altering the code in methods that may wind up being discarded anyway.

  4. as soon as it’s clear and it’s both completed the listing otherwise you’ve disapproved the recommendations, and the test-suite is passing, ask it to write a abstract/design doc initially and any further code comments inside it.

    GPT-4 will normally add just a few feedback within the code physique itself, however not good ones, and it gained’t normally write an sufficient total abstract doc unprompted. Nevertheless, by this level, it has the context to take action do you have to ask it to.

With all this, you’re arrange for maintainable code: with the test-suite and the up-front design doc, future LLMs can deal with it natively (and can have the ability to study from coaching on it), and you may simply add test-cases as you run into bugs; people ought to have the ability to learn the code simply after step #3 has completed, so that you don’t have to care the place it got here from or attempt to observe ‘taint’ via all future refactorings or utilization—GPT-4 can write readable human-like code, it simply doesn’t essentially do it one of the simplest ways the primary time.

When you could not essentially have saved time (not less than, if it’s in a language you’re extremely proficient in), you may have saved your self lots of psychological power & irritation (and made it a lot simpler just to get started) by making GPT-4 do the tedious work; it nearly transforms programming from too-often-frustrating work stuffed with papercuts & brokenness to spectator leisure.

Some examples of nontrivial code I’ve written this manner (ie. excluding the numerous little snippets or modifications I’ve used GPT-4 for, particularly for the finer factors of Bash syntax), with GPT-4 doing most (?) of the work, by language, in roughly chronological order:

  • Bash: tab completion for the add script, so it tab-completes the file after which the distant vacation spot listing.

    I’ve little interest in studying the heart of Bash tab-completion as a way to arrange extra superior positional tab-completion; however GPT-4 already is aware of the best way to do it.

  • Python: latex2unicode.py makes use of GPT-4 to transform LaTeX math fragments to HTML+CSS+Unicode, that are a lot simpler to edit/fashion, render faster, and look extra pure; as LaTeX is a full-blown and moderately laborious to parse language, that is extraordinarily tough to do in any customary formal sense.

    This can be a good instance of the loop: I wrote not one of the Python, however seeded it with just a few directions & guide rewrites from my current LaTeX → Unicode pipeline; then I prompted GPT-4 to ask for any LaTeX it may consider which it was not sure the best way to translate. After it gave just a few examples, I’d then manually translate them or add a brand new instruction, and ask once more. Many of the examples it requested about I’d not have considered, like taking part in card fits (that are supported—clubsuit, diamondsuit and so on).

  • Haskell:

    • add <poster> thumbnails for videos

      This can be a irritating one as a result of so far as I can inform from operating it, the GPT-4 code is simple to learn and works flawlessly: it parses the HTML as anticipated, creates the required thumbnail, and rewrites the HTML <video> appropriately. It’s simply that for some cause, the relaxation of my Hakyll codebase doesn’t run it or it one way or the other breaks nevertheless it’s truly known as, and I’ve by no means found out why. (The opacity of Hakyll Haskell and the sheer complexity of the Gwern.internet codebase in operation implies that when a rewrite move goes awry, it’s exceptionally tough to determine what goes fallacious.)

    • link metadata handling: the finicky dealing with of how hyperlinks on Gwern.internet get assigned the varied bits of metadata figuring out whether or not they may pop up annotations and so on had constructed up right into a rat’s-nest of guards & if-tens over time. When one more characteristic broke as a result of I misunderstood what the handler would do, I resolved to rewrite it to make clear the logic. My first refactoring makes an attempt failed, as I stored dropping observe mentally and including in bugs.

      Then I threw up my palms and assigned the job to GPT-4, and it was in a position to cleanly refactor it after some iterations, and didn’t seem to introduce any bugs.

    • correct URL rewrites: a big refactoring of how URLs are rewritten to level to raised URLs relied on GPT-4.

      URLs on Gwern.internet could be rewritten a number of methods, wish to level to a mirrored model hosted regionally or on a specialised web site. For instance, Medium.com has turn into terribly reader-hostile, and so Medium hyperlinks are rewritten to the equal Scribe.rip hyperlink. (See LinkArchive.hs & Archiving URLs for a fuller clarification of what & why we do all this.)

      In simple circumstances like that, it’s so simple as s/medium.com/scribe.rip/, however in some circumstances, it’s essential to formally parse a URL right into a URI knowledge construction and extract the many complicated parts (like host, path, question, and fragment), and rewrite them to the brand new URL. Haskell’s Network.URI can do all this, but when one isn’t acquainted with URI ideas and the library, it’s all a lot gobbledegook and leaves one trapped in a maze of tiny varieties & capabilities, every alike. Each time I’ve gone close to it prior, I’ve been repelled by its drive discipline.

      GPT-4 was in a position to deal with all of the parsing & reformatting, with particular circumstances, for every area individually, after which refactor out the duplication, and make the ultimate model look positively simple (together with the later bug-fix when it turned out I had misunderstood how a specific URL argument was imagined to go).

    • printing out large numbers not in scientific-notation: mandatory for proper inflation-adjusted dollar amounts, however weirdly tough in Haskell’s default libraries.

      After operating into this subject a number of occasions, I resorted to the complete workflow of test-suite and iterative revising. The gorgeous-printing continues to be extra restricted than I would love, however covers all numeric magnitudes it might be affordable to inflation modify, and the test-suite provides me confidence that this time is lastly proper.

    • compile-time location of internal cross-references, to set the arrow-direction statically as a browser structure optimization

      A Gwern.internet characteristic is to make inside cross-references between sections much less cognitively-taxing by specifying whether or not the reference is earlier than or after the present location. For instance, on this doc, within the summary, many sections are linked, and every of them has a down arrow (‘↓’) image: this tells you that the hyperlink goal is under, and so you recognize you haven’t learn the goal but, so you may resolve whether or not you wish to skip ahead or maintain studying. In different circumstances, like a link later on in this page, the hyperlink as a substitute is an up arrow (‘↑’), as a result of it’s pointing to earlier materials earlier than it: now you recognize you may have already learn what it’s referring to, and might keep in mind it, and it’s possible you’ll resolve to disregard it. That is higher than a mere opaque hyperlink, or perhaps a inside hyperlink image like a bit signal (‘§’): “See §dialogue of Z”—nicely, what dialogue? There was some point out of Z earlier than, is that ‘the dialogue’? Is there a bigger later ‘dialogue’ I haven’t learn but, that perhaps I wish to pop up and skim now? Is that this even in the identical essay? Or what? Opaque cross-reference hyperlinks create friction, because the reader is left with few clues about whether or not they wish to spend effort to comply with the hyperlink.

      It’s simple sufficient to write down some JavaScript to run over an HTML web page, detect all inside anchor hyperlinks, and set the earlier than/after arrow route, and that is what we did for a very long time. However whereas simple to write down, this isn’t fairly really easy for the browser to run (particularly on lengthy or heavily-hyperlinked pages), and it was including a small quantity to the structure time. And it’s not mandatory (or esthetic) to do that at runtime, as a result of the places of most hyperlinks are identified at compile-time. We knew all alongside {that a} Pandoc rewrite move may take a doc, take a look at all of the hyperlinks, resolve whether or not they’re earlier than or after one another, and add the required arrow metadata. It’s simply that this may be a stateful traverse requiring monadic operations and I used to be not sure the best way to do all of the tree navigation operations to descend/ascend to search out the place one thing was. As a result of it was not a important efficiency bottleneck, I postpone this micro-optimization.

      Finally, I had an eureka second: all that complexity about finding pairs of components was pointless. All you want is traverse the AST so as whereas updating a set data-structure to file whether or not you may have seen a goal hyperlink ID earlier than; then at every cross-reference hyperlink, you may have both seen the goal hyperlink ID earlier than, and due to this fact it should be earlier thanwithin the doc, or you haven’t but seen the goal, and due to this fact it should be after within the doc.

      As soon as I had that simplification, it was a bit of cake to instruct GPT-4 to outline a Data.Set set & a State monad to try this stroll, and arrange a test-suite to confirm correctness, which did catch just a few edge-cases (like hyperlinks in the identical paragraph).

    • title-case formatting of text: my current title-case code didn’t deal with circumstances involving hyphens, so it might generate titles like “Foo Bar-bar Baz”, which I felt appeared ugly in comparison with capitalizing after hyphens as nicely (ie. “Foo Bar-Bar Baz”).

      GPT-4 dealt with the considerably finicky string-munging and arrange a test-suite, which I’d be glad for later after I bumped into one other case the place punctuation made lowercase look bad.

    • detecting imbalanced brackets/quotes in paperwork

      A very insidious household of typos is imbalanced brackets/parentheses/quotes: authors usually fail to shut a parenthesis pair or get misplaced, notably in medical abstracts. This can be a concern as a result of usually it signifies a extra critical syntactic error, like an HTML <a> the place the href= is malformed. I had a easy test which examined if the entire variety of every character was a fair quantity, however this did not catch many typos: [[]] is right and has a fair variety of each brackets, however that’s equally true of, say, the swapped equal ]][[. It’s a well known case of needing a full stack, as a way to push/pop every bracket so as, to detect not simply numerical missingness, however fallacious order.

      It’s not that onerous, however tedious. It was one thing I did in an early CS course, and I felt that was sufficient for one lifetime, so I used to be joyful to see if GPT-4 may do it. It may, and as I anticipated, it turned up scores of situations that had slipped via all my proofreading. (I didn’t ask it to arrange a test-suite as a result of the Gwern.internet corpus is the test-suite on this case.)

    • checking that sets of rules don’t overlap

      Gwern.internet configuration requires 1000’s of rewrite guidelines protecting an countless military of particular circumstances. Inevitably, the units of guidelines will overlap or turn into redundant, particularly as web sites change domains or URLs get robotically up to date. Overlap could cause bugs, and even kill the location compilation, if some replace to both guidelines or essay content material unintentionally triggers a hidden infinite loop. So every config ought to ideally test for ‘redundancy’—however every set of (key, worth) pairs tends to have a unique want: some want the keys to be distinctive, some want the values to be distinctive, some want each to be distinctive, some want simply the pairs to be distinctive, and heck, some are literally 3-tuples of (key, worth 1, worth 2) why not.

      GPT-4 wrote out all the required situations and refactored them, and I utilized them to the prevailing configs, and certainly found lots of of overlaps and a number of other critical bugs of the ‘how did this ever work’ selection.

    • infinite loop (cycle) detection in rewrite guidelines

      In checking that guidelines don’t overlap with one another, there are nasty circumstances that may’t be detected simply on a (key, worth) foundation. Particularly, in doing rewrites, a rewrite may create an infinite loop even when there isn’t any overlap by any means: for instance, if we unintentionally outline a set of rewrite guidelines like [A → B, B → C, C → A], then all keys are distinctive, all values are distinctive, and all pairs are distinctive, however we’ve outlined an infinite loop and if our code ever encounters any of the values A–C, then it’ll loop ceaselessly or crash. That is particularly unhealthy as a result of it’ll solely occur at runtime, and can rely on the precise inputs (so it won’t set off instantly), and might be laborious to debug or hint again to the accountable rule.

      And that is what occurred every now and then with Gwern.internet updates; the Wikipedia URL rewrites have been notably error-prone, as Wikipedia editors typically change their thoughts about what URL an article must be at, so if it will get moved over a redirect, it’s not laborious to have a config which rewrites the outdated article title to the brand new article title, after which later one, uncover that the brand new article title has been renamed to the outdated article title and add a rule for that

      To take care of this, we should deal with the principles as defining a directed graph, and detect any cycles. Graph evaluation isn’t one thing I’ve completed that a lot of, so although Haskell’s Data.Graph must be totally able to this, I didn’t know the place to begin, and put it off till a very annoying infinite-loop made me attain for GPT-4 in anger.

      GPT-4 struggled with the issue, as its test-suite stored discovering bugs in its strongly connected component strategy—but it surely did it ultimately with little assist from me (not that I may, as a result of I do know no extra of the best way to use Knowledge.Graph than after I began). Infinite-loops haven’t been a difficulty since, and I’m extra free so as to add in guidelines as a result of I don’t fear about triggering such nastily deadly bugs.

  • Emacs Lisp: largely helper capabilities for extra conveniently modifying abstracts & excerpts to be used as annotation popups

    • call out to GPT-based script for parsing a single block-paragraph summary right into a extra readable multi-paragraph format and updating the buffer

    • do one search-and-replace: when including a hyperlink to outline a time period, one solely needs to do one search-and-replace; nevertheless, all the prevailing search-and-replace capabilities assume that you could be wish to change any or all the hits, and so you must waste time exiting or canceling or skipping after doing the specified first substitute.

      This defines a query-replace-regexp-once which is able to, because the title suggests, do only one substitute after which exit with out additional ado.

    • remove hard-newlines in Markdown: laborious newlines or line-wraps, whereas ignored by Markdown and equal to common line-wrapping, are ugly to learn and intrude with many modifying operations like search-and-replace. (What if you wish to change a phrase, however that phrase is damaged throughout a line?)

    • mark up various HTML/Markdown fragments: numerous shortcuts for surrounding a phrase or a specific area of textual content with numerous HTML/Markdown snippets like italics or smallcaps.

    • interactively add inflation-adjustment markup: going round modifying foo[foo]($2023) for each greenback quantity in a buffer is a bit tedious. This detects {dollars}, asks a date, and queries every doable mark-up.

Total, GPT-4 has been extremely helpful and dealt with many duties I merely wouldn’t have the gumption to do.

So, to check out GPT-4’s coding skills, I immediate it for a script, with a brief immediate alongside the strains of:

Write a Bash shell script to generate all alphabetic acronyms as much as 6 letters and test whether or not a Wikipedia article exists.

I particular ‘as much as 6 letters’ each simply to be secure (as 266 = 30,8915,776 we absolutely will hit an unused TLA earlier than then!), and in addition to check GPT-4: I’m certain that it is ready to write a naive acronym-generation algorithm after which do a curl request, however is it ok to write down an inexpensive incremental or streaming answer, avoiding efficiency issues from the combinatorial explosion—with out explicitly being instructed to take action?

It was not. The primary GPT-4 failure went like this:

for size in {1..6}
do
  # generate all doable acronyms of the given size
  printf "%${size}sn" {A..Z}{A..Z}{A..Z}{A..Z}{A..Z}{A..Z} | minimize -c1-$size |
  whereas learn acronym
  do
      echo $acronym
  completed
completed
# ...

Leaving apart correctness or the IO half, the acronym technology code right here is unrunnable attributable to reminiscence consumption: it generates all doable acronyms upfront, which is an issue as a result of there are sum (map ((^) 26) [1..6]) = 321,272,406 such acronyms, and it might take a very long time to generate them one after the other and can OOM when dealt with naively by Bash as a bunch of strings in RAM.

I inform GPT-4 this code is fallacious as a result of the code will OOM, and it obliges with a rewrite like this:

for size in {1..6}
do
  # generate all doable acronyms of the given size
  for acronym in $(eval echo {A..Z}{1..$size})
  do
    echo $acronym
  completed
completed

This model is extra incremental and instantly begins printing out acronyms, however the technology of acronyms is straight away blatantly fallacious and appears like: A1 B1 C1 ... Z1 Z2 Z3 Z4 Z5 Z6. (It ought to as a substitute appear to be A ... ZZZZZZ.) I present GPT-4 with the error & examples, and it rewrites once more:

for size in {1..6}
do
  # outline a operate to recursively generate acronyms
  generate_acronyms() {
    native present=$1
    native size=$2
    if (( size == 0 ))
    then

        echo "$present"
    else
      for letter in {A..Z}
      do
        generate_acronyms "$present$letter" $((size-1))
      completed
    fi
  }
  # name the operate to generate acronyms of the present size
  generate_acronyms "" $size
completed

The output now seems affordable, because it begins with A and is quickly printing out acronyms like GEBO.

Now that the acronym technology is working, we are able to take a look at GPT-4’s revised complete script, which it prints out as follows:

Here's a Bash script that makes use of the Wikipedia API to test for the existence of a web page with the acronym's title. For simplicity, it solely checks acronyms with uppercase letters. It outputs acronyms that do not have a corresponding Wikipedia article.

`​`​`bash
#!/bin/bash
# set the vary of acronym lengths
for size in {1..6}
do
  # generate all doable acronyms of the given size
  printf "%${size}sn" {A..Z}{A..Z}{A..Z}{A..Z}{A..Z}{A..Z} | minimize -c1-$size |
  whereas learn acronym
  do
    # question the Wikipedia API
    response=$(curl -s "https://en.wikipedia.org/w/api.php?motion=question&format=json&titles=$acronym")
    # test if the web page exists
    if [[ $response == *'"missing"*' ]]
    then
      echo $acronym
    fi
  completed
completed
`​`​`
The script iterates over all doable acronyms of a given size (from 1 to six letters). For every acronym, it queries the Wikipedia API. If the response incorporates the string "lacking", it implies that no Wikipedia article with that title exists, so the script prints the acronym.

Be aware: Operating this script could take a very long time as a result of massive variety of acronyms and the speed limits of the Wikipedia API. Take into account including delays or utilizing parallel processing to hurry up the method and keep away from overloading the API.

GPT-4 selected to test the Wikipedia API (utilized by Gwern.internet WP popups, by the way) as a substitute of doing what I’d do, which might be a curl --head response; however that’s in all probability a greater technique, because the API response is perhaps smaller and might be extra separate from the principle web site, so it’s politer. (I’d sometimes not wish to hassle wanting up the precise API endpoints and URLs, as a result of there are such a lot of and the WP API documentation is complicated. If we have been desirous about doing this actually correctly, we’d in all probability obtain a listing of all web page titles from the Wikipedia dumps to batch question offline.) The API response seems like this for a used & unused TLA:

$ curl --silent "https://en.wikipedia.org/w/api.php?motion=question&format=json&titles=AAA"
# {"batchcomplete":""",
#    question":{"pages":{"149335":{"pageid":149335",ns":0",title":"AAA"}}}}

$ curl --silent "https://en.wikipedia.org/w/api.php?motion=question&format=json&titles=CQK"
# {"batchcomplete":""",
#   question":{"pages":{"-1":{"ns":0",title":"CQK"",lacking":""}}}}

This code seems affordable, and seems to run efficiently.

But it surely nonetheless has a critical bug: it’ll by no means print out any hits. It is because it’s made a refined error within the string glob matching a ‘lacking’ response: if [[ $response == *'"missing"*' ]] won’t ever match something, as a result of the second match-anything asterisk is inside the single-quotation marks, which forces Bash to match a literal asterisk, moderately than matching any string. What it must be is a single character distinction, swapping the single-quote/asterisk: *'"lacking"'*

Blind Spot

This bug above is a shocking error as a result of this isn’t how a human would’ve written the glob, and the glob (like nearly all globs) is so easy that it’s laborious to think about anybody with the ability to write the acronym technology & memorizing the API URL accurately however then screw up a easy test of “does the response include the string lacking?” No less than, that is shocking when you have not run into this downside with GPT-4 earlier than, as I’ve repeatedly when writing Bash scripts; GPT-4 won’t simply make the error, but it surely appears totally unable to ‘see’ the error even when identified, and tends to thrash in confusion making random guesses about what the issue may very well be.

I theorize that it’s not a BPE tokenization subject (as so often), as a result of this blind spot appears to occur in word-level issues as nicely, the place tokenization couldn’t be an issue. The blind spot is, maybe, associated to inside sparsity of the GPT-4 mannequin; I speculate that when the blind spot occurs, it’s as a result of early layers have mistakenly erased apparently-irrelevant info as a way to focus their consideration on different extra vital components of the enter, however then this seems to be fallacious—so, the knowledge gone irreversibly, the later layers are unable to see what’s fallacious with the enter, which ‘seems like anticipated’, and are pressured to invest wildly and work on extraordinarily unlikely hypotheses. As a result of it might occurs early on in processing, it strikes me as analogous to the human eye’s blind spot—you can not see what you can not see, even when somebody tells you about it and also you suppose actually laborious. The agnosia is irreversible.

If this bug is the blind spot, then GPT-4 wouldn’t have the ability to diagnose it nor repair it besides by chance, and would possibly go round in circles making the code worse and extra sophisticated and implementing fully irrelevant options to confabulated issues. Because of this the blind spot bug is so pernicious, in comparison with the opposite errors GPT-4 can normally self-repair: it practically ensures failure, and the pathological response to this can be very deceptive because it obscures the difficulty and might deeply confuse the consumer, who naturally expects the error to be one thing extra spectacular, particularly as usually GPT-4 has been doing wonderful issues up till that time.

In case you run into the blind spot, as usually occurs to me in Bash, Elisp, or regexps however a lot much less usually in Haskell or Python, there’s nothing to do however acknowledge that you simply’ve hit the blind spot and that GPT-4 won’t ever repair the bug by itself. If you’re fortunate, it’ll ‘solely’ print out the identical damaged code repeatedly; extra seemingly, GPT-4 will make issues worse and break perfectly-good code, confabulating ever extra loopy (and normally damaged) ‘options’ to the issue it may’t see because it thrashes round in confusion (like some type of hyper-advanced Magikarp). You need to repair the bug your self.

We inform it that the code is fallacious with out specifying how (this works surprisingly usually, as a sort of inner-monologue), and take a look at the subsequent model:

Apologies for the confusion. Here is the corrected script:

```bash
#!/bin/bash
#set the vary of acronym lengths
for size in {1..6}
do
  #outline a operate to recursively generate acronyms
  generate_acronyms() {
    native present=$1
    native size=$2
    if (( size == 0 ))
    then
      #question the Wikipedia API
      response=$(curl -s "https://en.wikipedia.org/w/api.php?motion=question&format=json&titles=$present")
      #test if the web page exists
      if [[ $response == *'"missing"*' ]]
      then
        echo "$present"
      fi
    else
      for letter in {A..Z}
      do
        generate_acronyms "$present$letter" $((length-1))
      completed
    fi
  }
  #name the operate to generate acronyms of the present size
  generate_acronyms "" $size
completed
```

This script makes use of a recursive operate to generate acronyms one after the other, lowering reminiscence utilization.
For every acronym, it queries the Wikipedia API. If the response incorporates the string "lacking",
it implies that no Wikipedia article with that title exists, so the script prints the acronym.

Once more, operating this script will take a very long time as a result of massive variety of acronyms and the
fee limits of the Wikipedia API.

Nope.

We inform it to strive once more. It confabulates an issue with the acronym technology. Strive once more! Confabulates a unique downside with curl. Additionally no. Guesses it must test for an omitted web page ID from the JSON response. A 5th model cycles again to the acronym technology. So it cycled via 5 variations unable to repair it (and including in new issues alongside the best way). I omit the damaged variants right here as of no specific curiosity.

Getting uninterested in the blind spot subject, I trace that there’s a downside with the string munging, particularly. GPT-4 concludes that it’s fallacious in regards to the formatting of the API response (which it’s not, checking merely for ‘lacking’ could be high-quality), and so that is why it wants to alter the glob to:

    if [[ $response == *'"missing":""'* ]]; then echo $acronym; fi

Because it occurs, this glob is lastly right. It may use a little bit of linting, in response to ShellCheck, but it surely’s high-quality for a one-off. The most important subject is that it doesn’t deal with HTTP errors so any community issues whereas requesting a TLA will lead to it being skipped and handled as used. (GPT-4 is conscious of this subject and can repair it if one asks merely “How can this be improved?”, producing a Python script which handles community errors and in addition does batched requests, which runs vastly sooner than the Bash script does; see below.)

I ran the Bash script efficiently in a single day on 2023-09-29.

This revealed the primary unused TLA is “CQK”. Surprisingly, we didn’t get far via the TLA alphabet earlier than discovering the primary unused TLA.

Further unused TLAs early on embody:

  • C: CQK CQQ CQZ CVY CWZ CXK CXU CXZ CYV CYY CZQ CZV

  • D: DKQ DKY DQJ DQQ DQW DUZ DVJ DVZ DXK DXQ DXW DYI DYJ DYQ DYV DYX DYY DYZ DZE DZK DZM DZV DZW DZX

  • E: EBZ ECY ECZ EDZ EEY EGK EGQ EHQ EHW EHY EHZ EIY EIZ EJD EJJ EJM EJQ EJX EJY EJZ EKJ EKQ EKX EKZ EOX EOZ EPY EQD EQJ EQK EQO EQQ EQZ ETQ ETY EUW EUY EUZ EVJ EVQ EWX EWY EWZ EXF EXG EXH EXJ EXQ EYG EYH EYI EYJ EYV EYX EYY EYZ EZB EZC EZJ EZK EZL EZM EZN EZP EZT

I present the complete 3–4 letter list as a newline-delimited textual content file:

Unused acronyms

Checking

Is that this a false optimistic? We test to ensure—Wikipedia may simply have did not make a redirect to “C.Q.Okay.” or one thing like that. However there’s certainly nothing of significance in Wikipedia a couple of CQK: there are solely OCR errors and the obscure deserted title of an obscure lodge firm which owns an obscure English manor. Googling ‘CQK’, I don’t see something suggesting WP’s omission is fallacious: there aren’t any vital CQKs, as the highest hit is an obscure & dubious-sounding Chinese language tea-manufacturer, some eating places, pc {hardware} half numbers, and miscellany, with hits rapidly degenerating into spam/noise/variants. And there’s no cqk.com (merely a website squatter).

Equally, ‘CQQ’ turns up nothing a lot in Wikipedia, and minor issues in Google (Legislation Insider: “CQQ means a Certificates of High quality and Amount for a consignment of EUP”, no matter an ‘EUP’ is, or the Free Dictionary’s “Courtepointe Québec Quilts (French; Canadian quilting affiliation)”, which does not less than appear to be a specific organization, nevertheless obscure).

So, these outcomes look good.

Python

One doable error is that as talked about, the Bash script doesn’t deal with community errors and is checking an affirmative response from the WP API a couple of web page title being “lacking”; if there was no response (due to community issues), then there may very well be no “lacking” textual content within the response and the TLA could be skipped. So, if a TLA was not used and the test suffered a community error, it might be a false destructive, and hypothetically, there may very well be an unused TLA sooner than CQK.

To double-check that ‘CQK’ is right and a community error didn’t spuriously drop an earlier unused TLA, after I completed this writeup, I went again to GPT-4 with the mounted Bash script and requested it, “How can this be improved?” GPT-4:

1. Use Wikipedia's batch API characteristic: As an alternative of constructing an API request for every acronym,
you may batch a number of titles in a single API request. This may considerably cut back
the variety of requests and the entire runtime. Wikipedia permits as much as 50 titles per
API request for normal customers.

2. Parallelism: The script is presently single-threaded.
You need to use parallel processing to generate acronyms and question the API concurrently.
Instruments like GNU Parallel could be helpful for this objective.

3. Caching: In case you plan to
run this script a number of occasions, contemplate caching the outcomes of the API requests to
keep away from redundant queries.

4. Error Dealing with: The script lacks error dealing with. For
occasion, it would not deal with HTTP errors or JSON parsing errors. Add error dealing with
to make the script extra strong.

5. Charge Limiting: Take into account including a delay between
API requests to keep away from hitting Wikipedia's fee restrict.

6. Use a programming language
higher fitted to this activity: Bash has limitations with regards to advanced duties like
this one. Think about using a language like Python or Ruby, which have higher assist
for duties like JSON parsing, HTTP requests, and parallel processing. Additionally they have
libraries for interacting with the Wikipedia API.

Here is a Python model that includes a few of these recommendations:

```python
import requests
import itertools
import multiprocessing

def check_acronym(acronym):
    response = requests.get(f"https://en.wikipedia.org/w/api.php?motion=question&format=json&titles={acronym}")
    if '"lacking":""' in response.textual content:
        return acronym
    return None

def generate_acronyms(size):
    return [''.join(acronym) for acronym in itertools.product('ABCDEFGHIJKLMNOPQRSTUVWXYZ', repeat=length)]

def foremost():
    with multiprocessing.Pool() as pool:
        for size in vary(1, 7):
            acronyms = generate_acronyms(size)
            for chunk in chunks(acronyms, 50):
                outcomes = pool.map(check_acronym, chunk)
                for lead to outcomes:
                    if end result isn't None:
                        print(end result)

if __name__ == "__main__":
    foremost()
```

This Python script makes use of the `itertools.product` operate to generate acronyms, and
the `multiprocessing.Pool` class to test acronyms in parallel. It additionally divides the
acronyms into chunks of fifty to benefit from Wikipedia's batch API characteristic.

This generates an error in Python3: NameError: title 'chunks' isn't outlined, and GPT-4 corrects the script by including in a definition for chunks:

def chunks(lst, n):
    for i in vary(0, len(lst), n):
        yield lst[i:i + n]

into the corrected model:

The error happens as a result of the `chunks` operate is not outlined, and `f-string` formatting
is not supported in Python 2.

Here is the corrected script:
import requests
import itertools
import multiprocessing

def check_acronym(acronym):
    response = requests.get("https://en.wikipedia.org/w/api.php?motion=question&format=json&titles={}".format(acronym))
    if '"lacking":""' in response.textual content:
        return acronym
    return None

def generate_acronyms(size):
    return [''.join(acronym) for acronym in itertools.product('ABCDEFGHIJKLMNOPQRSTUVWXYZ', repeat=length)]

def chunks(lst, n):
    for i in vary(0, len(lst), n):
        yield lst[i:i + n]

def foremost():
    with multiprocessing.Pool() as pool:
        for size in vary(1, 7):
            acronyms = generate_acronyms(size)
            for chunk in chunks(acronyms, 50):
                outcomes = pool.map(check_acronym, chunk)
                for end result in outcomes:
                    if end result is not None:
                        print(end result)

if __name__ == "__main__":
    foremost()

This model contains the `chunks` operate, which divides a listing into chunks of a
specified measurement. It additionally replaces the f-string with `str.format`, which is suitable
with Python 2. Nevertheless, the script continues to be designed for Python 3 attributable to its use of
`multiprocessing.Pool` and `yield`.

This runs and generates inside a minute the primary hits for unused TLAs, beginning with ‘CQK’:

$ python3 foo.py
# CQK
# CQQ
# CQZ
# CVY
# CWZ
# CXK
# CXU
# CXZ
# CYV
# CYY
# CZQ
# CZV
# DKQ
# DKY
# DQJ
# DQQ
# DQW
# DUZ
# DVJ
# DXK
# DXQ
# DYI
# DYJ
# DZE
# DZK
# DZM
# DZW
# DZX
# EBZ
# EDZ
# EEY
# EGQ
# EHQ
# EHW
# EIY
# EIZ
# EJD
# EJM
# EJQ
# EJX
# EJY
# EJZ
# EKX
# EKZ
# EOX
# EOZ
# EQK
# EQO
# EQQ
# ETY
# EUW
# EVJ
# EWZ
# EXF
# EXG
# EXH
# EYG
# EYH
# EYI
# EYJ
# EYX
# EYY
# EYZ
# EZB
# EZJ
# EZV
# ...

Patterns

Sparsity

Some statistics:

$ wc --lines 2023-09-30-gwern-wikipedia-unusedacronyms-threeletterandfourletter.txt
# 395,568

$ grep -E '^[A-Z][A-Z][A-Z]$' 2023-09-30-gwern-wikipedia-unusedacronyms-threeletterandfourletter.txt | wc --lines
# 2,684
R> spherical(digits=2, 2684 / (26^3))
# [1] 0.15

$ grep -E '^[A-Z][A-Z][A-Z][A-Z]$' 2023-09-30-gwern-wikipedia-unusedacronyms-threeletterandfourletter.txt | wc --lines
# 392,884
R> spherical(digits=2, 392884 / (26^4))
# [1] 0.86

Apparently, TLAs are surprisingly sparse, with <15% unused, however as anticipated, FLAs are sparse, with the overwhelming majority.

Letter Frequency Effect

There are clear patterns with vowel vs consonants and letter frequency specifically: even simply wanting on the C–E TLAs above, you may see that consonants and uncommon letters like W–Z are overrepresented.

Is that this all that is happening? I investigated in R, utilizing GPT-4 once more. (This evaluation is the type of finicky string-munging & data-frame processing I discover most tedious in R, and it’s rather more nice to go away it to GPT-4; GPT-4’s R code by no means appears to hit the ‘blind spot’, and it’s usually in a position to repair code given an error message.)

We load the unused TLA dataset, flip it right into a dataset of all TLAs, labeled by whether or not they’re unused or not:

tla <- learn.desk("https://gwern.internet/doc/wikipedia/2023-09-30-gwern-wikipedia-unusedacronyms-threeletterandfourletter.txt")
head(tla)
#    V1
# 1 CQK
# 2 CQQ
# 3 CQZ
# 4 CVY
# 5 CWZ
# 6 CXK
tla$V2 <- as.character(tla$V1)

letters <- c("A", "B", "C", "D", "E", "F", "G", "H", "I", "J", "Okay", "L", "M",
            "N", "O", "P", "Q", "R", "S", "T", "U", "V", "W", "X", "Y", "Z")

acronyms <- increase.grid(letters, letters, letters)
acronyms_vector <- apply(acronyms, 1, paste0, collapse = "")
head(acronyms_vector); size(acronyms_vector)
# [1] "AAA" "BAA" "CAA" "DAA" "EAA" "FAA"
# [1] 17576

# Perform to generate acronyms of a given size:
generate_acronyms <- operate(len) {
  acronyms <- increase.grid(rep(listing(letters), len))
  apply(acronyms, 1, paste0, collapse = "")
}

# Generate 3- and 4-letter acronyms:
acronyms_vector <- unlist(lapply(3:4, generate_acronyms))

# Create knowledge body and replace 'Lacking':
acronyms_df <- knowledge.body(V1 = acronyms_vector, V2 = acronyms_vector, Lacking = FALSE, stringsAsFactors = FALSE)
acronyms_df$Lacking[acronyms_df$V2 %in% tla$V2] <- TRUE

## Add a 'Lacking' column to 'tla'
tla$Lacking <- TRUE

## Mix the 2 data-frames
end result <- rbind(tla, acronyms_df[!acronyms_df$V2 %in% tla$V2, ])
end result <- end result[order(result$V2), ] # type in alphabetic order
abstract(end result)
#       V1              V2             Lacking
# AAAB   :     1   Size:410460      Mode :logical
# AAAG   :     1   Class :character   FALSE:14892
# AAAK   :     1   Mode  :character   TRUE :395568
# AAAO   :     1
# AAAQ   :     1
# AAAU   :     1
# (Different):410454
head(end result); tail(end result)
#           V1   V2 Lacking
# 1100000  AAA  AAA   FALSE
# 2685    AAAB AAAB    TRUE
# 2686    AAAG AAAG    TRUE
# 2687    AAAK AAAK    TRUE
# 2688    AAAO AAAO    TRUE
# 2689    AAAQ AAAQ    TRUE
#          V1   V2 Lacking
# 395563 ZZZT ZZZT    TRUE
# 395564 ZZZU ZZZU    TRUE
# 395565 ZZZV ZZZV    TRUE
# 395566 ZZZW ZZZW    TRUE
# 395567 ZZZX ZZZX    TRUE
# 395568 ZZZY ZZZY    TRUE
index <- which(result_tla$V2 == "CQK")
proportion <- index / nrow(result_tla) * 100; proportion
# [1] 10.1217569

## Visualize missingness:
result_tla <- end result[nchar(result$V2) == 3, ]
result_fla <- end result[nchar(result$V2) == 4, ]

dimensions <- spherical(sqrt(size(result_tla$Lacking))); dimensions # 133
png(file="~/wiki/doc/wikipedia/2023-11-04-gwern-tla-missingness.png", width = 2400, peak = 2400)
picture(t(matrix(rev(result_tla$Lacking), nrow=dimensions, ncol=dimensions, byrow=TRUE)), col=grey.colours(2))
invisible(dev.off())

dimensions <- spherical(sqrt(size(result_fla$Lacking))); dimensions # 676
png(file="~/wiki/doc/wikipedia/2023-11-04-gwern-fla-missingness.png", width = 2400, peak = 2400)
picture(t(matrix(rev(result_fla$Lacking), nrow=dimensions, ncol=dimensions, byrow=TRUE)), col=grey.colours(2))
invisible(dev.off())

Visualization of missingness of TLAs, A–Z (wrapped into a 133×133 grid; read: top to bottom, left to right); the first dot at top-left 10% of the way through is “CQK”.
Visualization of missingness of four-letter-acronyms, A–Z (wrapped into a 676×676 grid; read: top to bottom, left to right).

We want to examine per-letter properties, like all TLAs with a ‘Z’ in them, so we arrange 26 dummy variables for whether or not every letter is current:

library(stringr)
for (letter in letters) {
  end result[[letter]] <- str_detect(end result$V2, mounted(letter))
}
head(end result, n=1)
#    V1  V2 Lacking     A     B    C     D     E     F     G     H     I     J
# 1 CQK CQK    TRUE FALSE FALSE TRUE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
#      Okay     L     M     N     O     P    Q     R     S     T     U     V     W
#   TRUE FALSE FALSE FALSE FALSE FALSE TRUE FALSE FALSE FALSE FALSE FALSE FALSE
#       X     Y     Z
#   FALSE FALSE FALSE

With a Lacking boolean variable & the alphabetical dummy variables arrange, we are able to do a logistic regression, the place every acronym is a single datapoint, and the letters it incorporates are the covariates:

## Outline the components for the logistic regression mannequin with out intercept (as an acronym all the time has letters):
components <- as.components(paste("Lacking ~ 0 +", paste(letters, collapse = " + ")))

## Match the logistic regression mannequin
model_no_intercept <- glm(components, knowledge = end result, household = binomial(hyperlink = "logit"))
abstract(model_no_intercept)
# ...Deviance Residuals:
#        Min          1Q      Median          3Q         Max
# −2.8134706   0.0885086   0.1136063   0.1540683   2.7432638
#
# Coefficients:
#          Estimate Std. Error   z worth   Pr(>|z|)
# AFALSE −6.5878352  0.0568884 −115.8028 < 2.22e-16
# ATRUE  −4.0755472  0.0479506  −84.9948 < 2.22e-16
# BTRUE   2.7764680  0.0322792   86.0142 < 2.22e-16
# CTRUE   2.4976112  0.0317669   78.6231 < 2.22e-16
# DTRUE   2.7903755  0.0324820   85.9053 < 2.22e-16
# ETRUE   2.8485807  0.0328888   86.6124 < 2.22e-16
# FTRUE   2.8527917  0.0327796   87.0295 < 2.22e-16
# GTRUE   2.9681415  0.0333202   89.0794 < 2.22e-16
# HTRUE   2.9743513  0.0333339   89.2290 < 2.22e-16
# ITRUE   2.8127265  0.0328846   85.5333 < 2.22e-16
# JTRUE   3.2605341  0.0352484   92.5016 < 2.22e-16
# KTRUE   2.8210890  0.0334714   84.2836 < 2.22e-16
# LTRUE   2.8388725  0.0328000   86.5511 < 2.22e-16
# MTRUE   2.7014280  0.0321802   83.9469 < 2.22e-16
# NTRUE   2.8776599  0.0330334   87.1137 < 2.22e-16
# OTRUE   2.9512577  0.0334268   88.2902 < 2.22e-16
# PTRUE   2.7394627  0.0324478   84.4268 < 2.22e-16
# QTRUE   3.6899056  0.0393598   93.7482 < 2.22e-16
# RTRUE   2.8297209  0.0329451   85.8919 < 2.22e-16
# STRUE   2.5292698  0.0319661   79.1235 < 2.22e-16
# TTRUE   2.7727895  0.0325522   85.1797 < 2.22e-16
# UTRUE   3.0611813  0.0338695   90.3817 < 2.22e-16
# VTRUE   3.1443269  0.0344698   91.2199 < 2.22e-16
# WTRUE   2.8428509  0.0337547   84.2208 < 2.22e-16
# XTRUE   3.5003099  0.0375402   93.2417 < 2.22e-16
# YTRUE   3.2994364  0.0356499   92.5510 < 2.22e-16
# ZTRUE   3.5370118  0.0378513   93.4448 < 2.22e-16
#
# (Dispersion parameter for binomial household taken to be 1)
#
#     Null deviance: 569018.38  on 410460  levels of freedom
# Residual deviance:  88916.78  on 410433  levels of freedom
# AIC: 88970.78

As anticipated, the letter makes a distinction, and rarer letters like ‘Z’ or ‘J’ are particularly prone to correlate with unuse.

We are able to plot an absolute plot, however that wouldn’t be an excellent visualization as a result of unused/used should sum to 100%, so it’d be higher to do a relative proportion plot, like this:

letter_df$V1 <- as.character(letter_df$V1)
## Filter out four-letter acronyms
three_letter_df <- letter_df[nchar(letter_df$V1) == 3, ]

## Calculate counts
counts <- desk(three_letter_df$Letter, three_letter_df$Lacking)
## Calculate relative percentages
percentages <- prop.desk(counts, 1)
## Convert to knowledge body for plotting
percentages_df <- as.knowledge.body.desk(percentages, responseName = "Share")

## Plot
library(ggplot2)
ggplot(percentages_df, aes(x = Var1, y = Share, fill = Var2)) +
  geom_col() +
  theme_minimal(base_size = 50) +
  theme(legend.place = "none") +
  labs(x = "Letter", y = "Share", fill = "Lacking")

Bar plot of unused fraction, by alphabetic letter (A–Z): Rarer letters (eg. ‘Z’) more likely to be unused in TLAs, but usage not fully explained by letter frequency (eg. ‘A’).

The plot is smart, however some issues are anomalous: just like the letter ‘A’—it’s maybe the rarest of all letters, and but, if any letter is rarest, it should be ‘E’, one would suppose, as a result of everybody is aware of ‘E’ is the most typical letter within the English language. What’s the correlation with letter frequency? We are able to take a letter frequency listing (first one I found in Google), and take a look at the correlation. We use a Kendall rank correlation as a result of there’s no specific cause to suppose that the magnitude or distribution of both the logistic regression coefficients or the letter frequencies are normally distributed, and we simply suppose that there must be an inverse correlation when ordered: rarer letters = more-unused TLAs.

## https://pi.math.cornell.edu/~mec/2003-2004/cryptography/subs/frequencies.html
frequencies <- learn.desk(stdin(), header=TRUE, colClasses=c("issue","integer","character","numeric"))
Letter  Rely       Letter  Frequency
E   21912       E   12.02
T   16587       T   9.10
A   14810       A   8.12
O   14003       O   7.68
I   13318       I   7.31
N   12666       N   6.95
S   11450       S   6.28
R   10977       R   6.02
H   10795       H   5.92
D   7874        D   4.32
L   7253        L   3.98
U   5246        U   2.88
C   4943        C   2.71
M   4761        M   2.61
F   4200        F   2.30
Y   3853        Y   2.11
W   3819        W   2.09
G   3693        G   2.03
P   3316        P   1.82
B   2715        B   1.49
V   2019        V   1.11
Okay   1257        Okay   0.69
X   315         X   0.17
Q   205         Q   0.11
J   188         J   0.10
Z   128         Z   0.07

## Put in alphabetic order:
frequencies <- frequencies[order(frequencies$Letter),]

letter_coefficients <- coef(model_no_intercept)
letter_coefficients[-1]
#       ATRUE       BTRUE       CTRUE       DTRUE       ETRUE       FTRUE
# −4.07554725  2.77646803  2.49761116  2.79037554  2.84858066  2.85279167
#       GTRUE       HTRUE       ITRUE       JTRUE       KTRUE       LTRUE
#  2.96814151  2.97435134  2.81272652  3.26053408  2.82108898  2.83887250
#       MTRUE       NTRUE       OTRUE       PTRUE       QTRUE       RTRUE
#  2.70142800  2.87765993  2.95125774  2.73946268  3.68990557  2.82972088
#       STRUE       TTRUE       UTRUE       VTRUE       WTRUE       XTRUE
#  2.52926976  2.77278951  3.06118128  3.14432686  2.84285087  3.50030988
#       YTRUE       ZTRUE
#  3.29943643  3.53701178
cor(letter_coefficients[-1], frequencies$Frequency)
# [1] −0.357640466

Order & Letter-Frequency Effects

What’s omitted from our mannequin? Going again and evaluating the frequency listing to the previous bar plot{#previous-back-link], it seems suspiciously like letters early within the alphabet (not simply ‘A’) are overrepresented, after which ‘Z’ is inflated (maybe as a result of it’s the last letter and has many connotations).

How can we encode in ‘early letters’, as contrasted with ‘letter frequency’? We are able to add into the logistic regression a variable for the earliest/‘smallest’ letter an acronym has, a MinIndex. (This might assist decide up developments from attempting to abuse ‘A’ to type first in any listing or sublist.) And we are able to encode the letter frequencies by simply averaging them, as a AvgFrequency. (Clearly imperfect, but in addition unclear what the precise factor to do could be: as a substitute of an arithmetic imply, a harmonic imply? One thing else totally?) Then we are able to add them to the regression as management variables to attempt to clarify away their results:

## Add a column 'MinIndex' to 'end result' that incorporates the smallest index of the letters in every acronym
end result$MinIndex <- apply(end result[,letters], 1, operate(x) min(which(x)))
## Map the letters to their frequencies
letter_to_frequency <- setNames(frequencies$Frequency / 100, frequencies$Letter)

## Add a column 'AvgFrequency' to 'end result' that incorporates the common frequency of the letters in every acronym
## Compute the frequency of every letter in every acronym
acronym_frequencies <- lapply(strsplit(end result$V2, ""), operate(acronym) letter_to_frequency[acronym])
## Compute the common frequency for every acronym
end result$AvgFrequency <- sapply(acronym_frequencies, imply)

## Archive outcomes to `/doc/wikipedia/2023-10-01-gwern-wikipedia-unusedacronyms-processeddata.csv.xz`:
write.csv(end result, file="doc/wikipedia/2023-10-01-gwern-wikipedia-unusedacronyms-processeddata.csv", row.names=FALSE)

## Replace the components to incorporate 'MinIndex' & 'AvgFrequency' as a covariate:
formulaControlled <- as.components(paste("Lacking ~ MinIndex + AvgFrequency +", paste(letters, collapse = " + ")))
## Match the logistic regression mannequin
modelControlled <- glm(formulaControlled, knowledge = end result, household = binomial(hyperlink = "logit"))
abstract(modelControlled)
# ...Deviance Residuals:
#        Min          1Q      Median          3Q         Max
# −2.9069303   0.0891105   0.1128827   0.1500609   2.8642110
#
# Coefficients:
#                 Estimate  Std. Error   z worth   Pr(>|z|)
# (Intercept)  −6.90634502  0.09250317 −74.66063 < 2.22e-16
# MinIndex      0.04029557  0.00370886  10.86468 < 2.22e-16
# AvgFrequency −4.25188554  1.42341896  −2.98709  0.0028164
# ATRUE         2.88138189  0.04677432  61.60179 < 2.22e-16
# BTRUE         3.01005967  0.04276745  70.38203 < 2.22e-16
# CTRUE         2.71712516  0.03922165  69.27616 < 2.22e-16
# DTRUE         2.99290423  0.03752173  79.76456 < 2.22e-16
# ETRUE         3.13077197  0.05264291  59.47187 < 2.22e-16
# FTRUE         2.97227342  0.03637453  81.71305 < 2.22e-16
# GTRUE         3.05866771  0.03637991  84.07573 < 2.22e-16
# HTRUE         3.09923276  0.03586679  86.40954 < 2.22e-16
# ITRUE         2.94019473  0.03734308  78.73466 < 2.22e-16
# JTRUE         3.26929982  0.04017641  81.37361 < 2.22e-16
# KTRUE         2.82703715  0.03688530  76.64401 < 2.22e-16
# LTRUE         2.87906510  0.03307471  87.04734 < 2.22e-16
# MTRUE         2.71268066  0.03290704  82.43467 < 2.22e-16
# NTRUE         2.94220314  0.03616480  81.35544 < 2.22e-16
# OTRUE         3.02043913  0.03804621  79.38869 < 2.22e-16
# PTRUE         2.72109474  0.03383860  80.41393 < 2.22e-16
# QTRUE         3.63855049  0.04328351  84.06320 < 2.22e-16
# RTRUE         2.86279675  0.03446914  83.05390 < 2.22e-16
# STRUE         2.56086404  0.03376525  75.84317 < 2.22e-16
# TTRUE         2.84462320  0.04070488  69.88409 < 2.22e-16
# UTRUE         3.04398849  0.03417759  89.06388 < 2.22e-16
# VTRUE         3.10148262  0.03683701  84.19474 < 2.22e-16
# WTRUE         2.81549885  0.03470352  81.13006 < 2.22e-16
# XTRUE         3.44003115  0.04137750  83.13773 < 2.22e-16
# YTRUE         3.26779943  0.03654490  89.41876 < 2.22e-16
# ZTRUE         3.47511758  0.04188641  82.96528 < 2.22e-16
#
# (Dispersion parameter for binomial household taken to be 1)
#
#     Null deviance: 128014.32  on 410459  levels of freedom
# Residual deviance:  88784.22  on 410431  levels of freedom
# AIC: 88842.22

letter_coefficients2 <- coef(modelControlled)
cor(letter_coefficients2[-c(1:3)], frequencies$Frequency, technique="kendall")
# [1] −0.28

> 1 - (0.28 / 0.35)
# [1] 0.2

These variables do assist and are tapping into letter-frequency considerably (as a result of the rank correlation of the ‘frequency-adjusted’ coefficients shrinks by ~20%), suggesting that each an ‘earlier letter’ & English letter-frequencies are at play in correlating with unused TLAs. However there’s nonetheless a lot variance unexplained and a non-zero rank correlation, so both these aren’t good methods of quantifying these two results or there’s nonetheless vital variables lurking.

Further Work

n-grams. The following step in an evaluation is perhaps to undertake the n-gram framework, and go from single-letter evaluation (unigrams) to pairs of letters (bigrams), as that might assist decide up subtler patterns (eg. grammatical patterns in pairs of phrases that make up acronyms).

Simulation? One may additionally attempt to discover some solution to simulate TLA datasets—I couldn’t work out a solution to generatively-simulate, resample, or bootstrap this dataset, as a result of all the plain methods to take action both require further information like what number of situations of a given TLA there have been, what number of TLAs complete there have been ‘generated’, or are simply fallacious (not less than on the unigram stage). In case you attempt to simulate out a set of hypothetical acronyms based mostly on the letter frequencies, then it’s essential to know what number of acronyms complete are sampled, not merely whether or not ≥1 occasion existed in Wikipedia; when you sampled n = 1 billion, then no TLA could be lacking, and when you sampled n = 1, nearly all of them should be lacking, in order that doesn’t work. In case you attempt to shuffle the set of lacking/non-missing, that may by definition recreate the identical single-letter statistics. In case you attempt to resample a smaller dataset bootstrap-style, it’s not clear what that might imply even when it was legitimate in some sense.

Traits? It could even be mildly attention-grabbing to revisit this in just a few years to see how acronym utilization will increase over time. As a result of notability is irreversible, acronyms ought to accumulate over time: at what level would folks be pressured into four-letter acronyms? Or do they simply chunk the bullet of overlaps as a result of four-letter ones are a lot worse? Or maybe ‘pure’ TLA conduct is that the rest are so laborious to consider (with out intentionally focusing on unused ones) that the development is just too sluggish to matter?

However I’ll depart it right here.

So now I do know: there are TLAs obtainable for one to make use of, even when the particular TLA one needs might be taken. As TLAs run out, there could also be steadily extra strain to determine the best way to use the remaining 15%.

You’ll have to have a look at the listing of unused TLAs, after which work out the best way to make the backronym work, whereas consulting the again pages of a dictionary… Maybe one may ask GPT-4 for assist?

I’m founding a startup for promoting psychoactive cat treats like silvervine or catnip or Valerian root extract to pet homeowners. I wish to title it ‘CQK’, as a result of nobody has used that acronym for something but. However I can’t give you an excellent backronym for it. Please give you as many doable backronyms for ‘CQK’ as you may.

  1. Catnip High quality Kingdom

  2. Cat Quirk Package

  3. Catnip Quotient Key

  4. Cat High quality Kicks

  5. Catnip Fast Package

  6. Cats’ Quixotic Knead

    See Also

  7. Catnip Quest Package

  8. Cats’ High quality Knead

  9. Cat Quake Package

  10. Cats’ Quenching Knack

  11. Cats’ Quintessential Kicks

  12. Catnip Quarantine Package

  13. Cats’ Quirk Kingdom

  14. Catnip High quality Memento

  15. Cat Quench Package

  16. Catnip Quantum Key

  17. Cats’ Quirky Knick-Knacks

  18. Catnip Quirky Kingdom

  19. Cats’ High quality Retains

  20. Catnip Quenching Package

  21. Cat Quirk Kingdom

  22. Catnip High quality Kinetics

  23. Cat Quirk Kiosk

  24. Cats’ Quotient Package

  25. Catnip Quantum Kiosk

  26. Cats’ Quintessential Knick-Knacks

  27. Catnip Quirk Key

  28. Cat Quotient Package

  29. Catnip High quality Kiosk

  30. Cats’ Quake Package

Give you one other listing which doesn’t use ‘cat’ or ‘catnip’ wherever.

  1. Cognitive Quirk Package

  2. Companions’ High quality Kingdom

  3. Creature Quirk Package

  4. Calm Quotient Key

  5. Companions’ Quenching Knack

  6. Cognitive Quantum Key

  7. Calming High quality Kicks

  8. Companion Quirk Kingdom

  9. Cognitive High quality Kinetics

  10. Calming Quench Package

  11. Companions’ Quotient Package

  12. Cognitive Quantum Kiosk

  13. Calm Quirk Kiosk

  14. Companion Quench Package

  15. Calming Quantum Key

  16. Companions’ Quintessential Knick-Knacks

  17. Cognitive Quirk Key

  18. Calm High quality Kingdom

  19. Companion High quality Kinetics

  20. Calming Quirk Kingdom

  21. Companions’ Quake Package

  22. Cognitive High quality Kicks

  23. Calm Quotient Package

  24. Companion Quantum Key

  25. Calming Quotient Package

  26. Companions’ Quirk Key

  27. Cognitive Quench Package

  28. Calm Quirk Kingdom

  29. Companion Quantum Kiosk

  30. Calming Quake Package

That are the perfect ones which point out cats or catnip? That are the perfect ones which don’t point out these?

Among the many ones that point out cats or catnip:

  1. Catnip High quality Kingdom: Suggests a realm of high-quality catnip merchandise.

  2. Cat Quirk Package: Implies a package to have interaction and entertain the distinctive behaviors of cats.

  3. Cats’ Quintessential Kicks: Evokes the important enjoyable and pleasure for cats.

  4. Catnip Quantum Key: Suggests a important or unlocking part in catnip-based merchandise.

  5. Cats’ Quirk Kingdom: Implicates a world devoted to the idiosyncrasies of cats.

Among the many ones that don’t point out cats or catnip:

  1. Cognitive Quirk Package: Suggests a package designed to have interaction and stimulate psychological processes.

  2. Companions’ High quality Kingdom: Implies a realm of high-quality merchandise for pets.

  3. Calm Quotient Key: Suggests a product that contributes to a peaceful and peaceable pet.

  4. Companion Quirk Kingdom: Evokes a world devoted to the idiosyncrasies of pets.

  5. Cognitive High quality Kinetics: Suggests merchandise designed for psychological stimulation and bodily exercise.

(I like “Cat Quirk Package” finest, as one may complement the cat psychoactives with different issues, like puzzle treats, or laser pointers, or optical phantasm printouts triggering the illusory contour effect & the rotating-snake illusion.)

What if we do wish to contemplate acronyms with numbers in them? We already noticed the outcomes for only-alphabetical ones beforehand, so we have to contemplate simply the subset of quantity+letter acronyms which have not less than 1 letter in them. There are:

  1. For size 1, there are 10 acronyms (the digits 0-9).

  2. For size 2, there are 362 − 262 = 1,296 − 676 = 620 acronyms.

  3. For size 3, there are 363 − 263 = 46,656 − 17,576 = 29,080 acronyms.

  4. Including these up: there are a complete of 10 + 620 + 29,080 = 29,710.

Two slight modifications to the earlier script will give us the set of 21,852 unused alphanumerical acronyms, including in numbers to the generator & skipping goal acronyms with out numbers:

## set the vary of acronym lengths
for size in {1..3}
do
  ## outline a operate to recursively generate acronyms
  generate_acronyms() {
    native present=$1
    native size=$2
    if (( size == 0 ))
    then
      ## question the Wikipedia API
      response=$(curl -s "https://en.wikipedia.org/w/api.php?motion=question&format=json&titles=$present")
      ## test if the web page exists
      if [[ $response == *'"missing"'* ]]
      then
        echo "$present"
      fi
    else
      for letter in {A..Z} {0..9}
      do
        generate_acronyms "$present$letter" $((size-1))
      completed
    fi
  }
  ## name the operate to generate acronyms of the present size
  generate_acronyms "" $size
completed

The primary unused one is the moderately stunning AA0 AD0 AE0 AE5 AE6 AE7 AE8 AE9 AF0 AF3 AF5 AF6 AF7 AF8 AF9 AG1 AG2 AG4 AG6 AG7 AG8 AG9 AH0 AI0 AJ0 AJ1 AJ3 AJ5 AJ7 AJ8 AJ9, and so on. Actually? Nobody has used such a brief easy TLA, which might type in entrance of just about every part, even ones like ‘AAA’? Apparently! Neither WP nor Google exhibits something vital for ‘AA0’.

So, ‘AA0’ could be an excellent startup title, if anybody wants one.

Source Link

What's Your Reaction?
Excited
0
Happy
0
In Love
0
Not Sure
0
Silly
0
View Comments (0)

Leave a Reply

Your email address will not be published.

2022 Blinking Robots.
WordPress by Doejo

Scroll To Top