Utilizing Arc to decode Venter’s secret DNA watermark

Lately Craig Venter (who decoded the human genome) created a synthetic bacterium. The J. Craig Venter Institute (JCVI) took a bacterium’s DNA sequence as a pc file, modified it, made bodily DNA from this sequence, and caught this DNA right into a cell, which then reproduced below management of the brand new DNA to create a brand new bacterium. This can be a actually cool outcome, because it exhibits you’ll be able to create the DNA of an organism totally from scratch. (I would not precisely name it artificial life although, because it does require an present cell to get going.) Though this venture took 10 years to finish, I am certain it is solely a matter of time earlier than it is possible for you to to ship an information file to some firm and get the ensuing cells despatched again to you.
One fascinating function of this artificial bacterium is it contains 4 “watermarks”, particular sequences of DNA that show this bacterium was created from the information file, and isn’t pure. Nonetheless, they did not reveal how the watermarks had been encoded. The DNA sequences had been printed (GTTCGAATATTT and so forth), however methods to get the which means out of this was left a puzzle. For detailed background on the watermarks, see Singularity Hub. I broke the code (as I described earlier) and located the names of the authors, some quotations, and the next hidden net web page. This looks like science fiction, however it’s true. There’s truly a brand new bacterium that has an internet web page encoded in its DNA:
I contacted the JCVI utilizing the hyperlink and discovered I used to be the thirty first particular person to interrupt the code, which is a bit disappointing. I have not seen particulars elsewhere on how the code works, so I am going to present the main points. (Spoilers forward if you wish to break the code your self.) I initially used Python to interrupt the code, however since that is nominally an Arc weblog, I am going to present how it may be performed utilizing the Arc language.
The oversimplified introduction to DNA
Maybe I ought to give the crash course in DNA at this level. The genetic directions for all organisms are contained in very, very lengthy molecules referred to as DNA. For our functions, DNA could be described as a sequence of 4 totally different letters: C, G, A, and T. The human genome is 3 billion base pairs in 23 chromosome pairs, so your DNA could be described as a pair of sequences of three billion C’s, G’s, A’s, and T’s.
Watson and Crick received the Nobel Prize for determining the construction of DNA and the way it encodes genes. Your physique is made up of vital issues comparable to enzymes, that are made up of carefully-constructed proteins, that are made up of particular sequences of 18 totally different amino acids. Your genes, that are elements of the DNA, specify these amino acid sequences. Every three DNA bases specify a specific amino acid within the sequence. As an illustration, the DNA sequence CTATGG specifies the amino acids leucine and tryptophan as a result of CTA signifies leucine, and TGG signifies tryptophan. Crick and Watson and Rosalind Franklin found the genetic code that associates a specific amino acid with every of the 64 doable DNA triples. I am omitting a number of fascinating particulars right here, however the important thing level is that every three DNA symbols specifies one amino acid.
Encoding textual content in DNA
Now, the puzzle is to determine how JCVI encoded textual content within the artificial bacterium’s DNA. The plain extension is as a substitute of assigning an amino acid to every of the 64 DNA triples, assign a personality to it. (I ought to admit that this was solely apparent to me on reflection, as I defined in my earlier blog posting.) Thus if the DNA sequence is GTTCGA, then GTT can be one letter and CGA can be one other, and so forth. Now we simply want to determine what letter goes with every DNA triple.
If we all know some textual content and the DNA that goes with it, it’s easy to crack the code that maps between the characters and the DNA triples. As an illustration, if we occur to know that the textual content “LIFE” corresponds to the DNA “AAC CTG GGC TAA”, then we will conclude that AAA goes to L, CTG goes to I, GGC goes to F, and TAA goes to E. However how will we get began?
Conveniently, the Singularity Hub article offers the quotes that seem within the DNA, as reported by JCVI:
"TO LIVE, TO ERR, TO FALL, TO TRIUMPH, TO RECREATE LIFE OUT OF LIFE." "SEE THINGS NOT AS THEY ARE, BUT AS THEY MIGHT BE." "WHAT I CANNOT BUILD, I CANNOT UNDERSTAND."
We will strive matching the quotes in opposition to every place within the DNA and see if there may be any place that works. A match can fail in two methods. First, if the identical DNA triple corresponds to 2 totally different letters, one thing should be flawed. As an illustration, if we attempt to match “LIFE” in opposition to “AAC CTG GGC AAC”, we might conclude that AAC means each L and E. Second, if the identical letter corresponds to 2 DNA triples, we will reject it. As an illustration, if we attempt to match “ERR” in opposition to “TAA CTA GTC”, then each CTA and GTC would imply R. (Apparently, the actual genetic code has this second kind of ambiguity. As a result of there are solely 18 amino acids and 64 DNA triples, many DNA triples point out the identical amino acid.)
Utilizing Arc to interrupt the code
To interrupt the code, first download the DNA sequences of the 4 watermarks and assign them to variables w1, w2, w3, w4. (Obtain full code here.)
(= w1 "GTTCGAATATTTCTATAGCTGTACA...") (= w2 "CAACTGGCAGCATAAAACATATAGA...") (= w3 "TTTAACCATATTTAAATATCATCCT...") (= w4 "TTTCATTGCTGATCACTGTAGATAT...")
Subsequent, lets break the 4 watermarks into triples by first changing to an inventory after which taking triples of size 3, which we name t1 by means of t4:
(def strtolist (s) (accum accfn (every x s (accfn (string x))))) (def tripleize (s) (map string (tuples (strtolist s) 3))) (= t1 (tripleize w1)) (= t2 (tripleize w2)) (= t3 (tripleize w3)) (= t4 (tripleize w4))
The subsequent operate is a very powerful. It takes triples and textual content, matches the triples in opposition to the textual content, and generates a desk that maps from every triple to the corresponding characters. If it runs into an issue (both two characters assigned to the identical triple, or the identical character assigned to 2 triples) then it fails. In any other case it returns the code desk. It makes use of tail recursion to scan by means of the triples and enter textual content collectively.
; Match the characters in textual content in opposition to the triples. ; codetable is a desk mapping triples to characters. ; offset is the offset into textual content ; Return codetable or nil if no match. (def matchtext (triples textual content (o codetable (desk)) (o offset 0)) (if (>= offset (len textual content)) codetable ; success (with (journey (automobile triples) ch (textual content offset)) ; if ch is assigned to one thing else within the desk ; then not a match (if (and (mem ch (vals codetable)) (isnt (codetable journey) ch)) nil (empty (codetable journey)) (do (= (codetable journey) ch) ; new char (matchtext (cdr triples) textual content codetable (+ 1 offset))) (isnt (codetable journey) ch) nil ; mismatch (matchtext (cdr triples) textual content codetable (+ 1 offset))))))
Lastly, the operate findtext finds a match anyplace within the triple record. In different phrases, the earlier operate matchtext assumes the beginning of the triples corresponds to the beginning of the textual content. However findtext tries to discover a match at any level within the triples. It calls matchtext, and if there isn’t any match then it drops the primary triple (utilizing cdr) and calls itself recursively, till it runs out of triples.
(def findtext (triples textual content) (if (< (len triples) (len textual content)) nil (let outcome (matchtext triples textual content) (if outcome outcome (findtext (cdr triples) textual content)))))
The Singularity Hub article mentioned that the DNA incorporates the quote:
"TO LIVE, TO ERR, TO FALL, TO TRIUMPH, TO RECREATE LIFE OUT OF LIFE." - JAMES JOYCE
Let’s begin by simply in search of “LIFE OUT OF LIFE”, since we’re unlikely to get LIFE to match twice simply by likelihood:
arc> (findtext t1 "LIFE OUT OF LIFE") nil
No match within the first watermark, so let’s strive the second.
arc> (findtext t2 "LIFE OUT OF LIFE") #hash(("CGT" . #O) ("TGA" . #T) ("CTG" . #I) ("ATA" . #area) ("TAA" . #E) ("AAC" . #L) ("TCC" . #U) ("GGC" . #F))
How about that! Appears to be like like a match in watermark 2, with DNA “AAC” similar to L, “CGT” similar to “I”, “GGC” similar to “F”, and so forth. (The Arc format for hash tables is fairly ugly, however hopefully you’ll be able to see that within the output.) Let’s strive matching the complete quote and retailer the desk in code2:
arc> (= code2 (findtext t2 "TO LIVE, TO ERR, TO FALL, TO TRIUMPH, TO RECREATE LIFE OUT OF LIFE.")) #hash(("TCC" . #U) ("TCA" . #H) ("TTG" . #V) ("CTG" . #I) ("ACA" . #P) ("CTA" . #R) ("CGA" . #.) ("ATA" . #area) ("CGT" . #O) ("TTT" . #C) ("TGA" . #T) ("TAA" . #E) ("AAC" . #L) ("CAA" . #M) ("GTG" . #,) ("TAG" . #A) ("GGC" . #F))
Likewise, we will strive matching the opposite quotes:
arc> (= code3 (findtext t3 "SEE THINGS NOT AS THEY ARE, BUT AS THEY MIGHT BE.")) #hash(("CTA" . #R) ("TCA" . #H) ("CTG" . #I) ("GCT" . #S) ("CGA" . #.) ("ATA" . #area) ("CAT" . #Y) ("CGT" . #O) ("TCC" . #U) ("AGT" . #B) ("TGA" . #T) ("TAA" . #E) ("TGC" . #N) ("TAC" . #G) ("CAA" . #M) ("GTG" . #,) ("TAG" . #A)) arc> (= code4 (findtext t4 "WHAT I CANNOT BUILD, I CANNOT UNDERSTAND")) #hash(("TCC" . #U) ("TCA" . #H) ("ATT" . #D) ("CTG" . #I) ("GCT" . #S) ("TTT" . #C) ("ATA" . #area) ("TAA" . #E) ("CGT" . #O) ("CTA" . #R) ("TGA" . #T) ("AGT" . #B) ("AAC" . #L) ("TGC" . #N) ("GTG" . #,) ("TAG" . #A) ("GTC" . #W))
Fortunately, all of them decode the identical triples to the identical letters, or else we might have an issue. We will merge these three decode tables into one:
arc> (= code (listtab (be a part of (tablist code2) (tablist code3) (tablist code4))))#hash(("TCC" . #U) ("TCA" . #H) ("CTA" . #R) ("CGT" . #O) ("TGA" . #T) ("CGA" . #.) ("TAA" . #E) ("AAC" . #L) ("TAC" . #G) ("TAG" . #A) ("GGC" . #F) ("ACA" . #P) ("ATT" . #D) ("TTG" . #V) ("CTG" . #I) ("GCT" . #S) ("TTT" . #C) ("CAT" . #Y) ("ATA" . #area) ("AGT" . #B) ("GTG" . #,) ("TGC" . #N) ("CAA" . #M) ("GTC" . #W))
Now, let’s make a decode operate that may apply the string to the unknown DNA and see what we get. (We’ll go away unknown triples unconverted.)
(def decode (decodemap triples) (string (accum accfn (every triple triples (accfn (if (decodemap triple) (decodemap triple) (string "(" triple ")" ))))))) arc> (decode code t1) "(GTT). CRAIG VENTER INSTITUTE (ACT)(TCT)(TCT)(GTA)(GGG)ABCDEFGHI(GTT)(GCA)LMNOP(TTA)RSTUVW(GGT)Y(TGG)(GGG) (TCT)(CTT)(ACT)(AAT)(AGA)(GCG)(GCC)(TAT)(CGC)(GTA)(TTC)(TCG)(CCG)(GAC)(CCC)(CCT)(CTC)(CCA)(CAC)(CAG)(CGG)(TGT)(AGC)(ATC)(ACC)(AAG)(AAA)(ATG)(AGG)(GGA)(ACG)(GAT)(GAG)(GAA).,(GGG)SYNTHETIC GENOMICS, INC.(GGG)(CGG)(GAG)DOCTYPE HTML(AGC)(CGG)HTML(AGC)(CGG)HEAD(AGC)(CGG)TITLE(AGC)GENOME TEAM(CGG)(CAC)TITLE(AGC)(CGG)(CAC)HEAD(AGC)(CGG)BODY(AGC)(CGG)A HREF(CCA)(GGA)HTTP(CAG)(CAC)(CAC)WWW.(GTT)CVI.ORG(CAC)(GGA)(AGC)THE (GTT)CVI(CGG)(CAC)A(AGC)(CGG)P(AGC)PROVE YOU(GAA)VE DECODED THIS WATERMAR(GCA) BY EMAILING US (CGG)A HREF(CCA)(GGA)MAILTO(CAG)MRO(TTA)STI(TGG)(TCG)(GTT)CVI.ORG(GGA)(AGC)HERE(GAG)(CGG)(CAC)A(AGC)(CGG)(CAC)P(AGC)(CGG)(CAC)BODY(AGC)(CGG)(CAC)HTML(AGC)"
It seems to be like we’re doing fairly effectively with the decoding, as there’s lots of recognizable textual content and a few HTML in there. There’s additionally conveniently the whole alphabet as a decoding help. From this, we will fill in lots of the gaps, e.g. GTT is J and GCA is Ok. From the HTML tags we will determine angle brackets, quotes, and slash. We will guess that there are numbers in there and determine that ACT TCT TCT GTA is 2009. These deductions could be added manually to the decode desk:
arc> (= (code "GTT") #J) arc> (= (code "GCA") #Ok) arc> (= (code "ACT") #2) ...
After a pair cycles of including lacking characters and searching on the decodings, we get the almost-complete decodings of the 4 watermarks:
The contents of the watermarks
J. CRAIG VENTER INSTITUTE 2009
ABCDEFGHIJKLMNOPQRSTUVWXYZ
0123456789(TTC)@(CCG)(GAC)-(CCT)(CTC)=/:<(TGT)>(ATC)(ACC)(AAG)(AAA)(ATG)(AGG)”(ACG)(GAT)!’.,
SYNTHETIC GENOMICS, INC.
<!DOCTYPE HTML><HTML><HEAD><TITLE>GENOME TEAM</TITLE></HEAD><BODY><A HREF=”HTTP://WWW.JCVI.ORG/”>THE JCVI</A><P>PROVE YOU’VE DECODED THIS WATERMARK BY EMAILING US <A HREF=”MAILTO:[email protected]“>HERE!</A></P></BODY></HTML>MIKKEL ALGIRE, MICHAEL MONTAGUE, SANJAY VASHEE, CAROLE LARTIGUE, CHUCK MERRYMAN, NINA ALPEROVICH, NACYRA ASSAD-GARCIA, GWYN BENDERS, RAY-YUAN CHUANG, EVGENIA DENISOVA, DANIEL GIBSON, JOHN GLASS, ZHI-QING QI.
“TO LIVE, TO ERR, TO FALL, TO TRIUMPH, TO RECREATE LIFE OUT OF LIFE.” – JAMES JOYCECLYDE HUTCHISON, ADRIANA JIGA, RADHA KRISHNAKUMAR, JAN MOY, MONZIA MOODIE, MARVIN FRAZIER, HOLLY BADEN-TILSON, JASON MITCHELL, DANA BUSAM, JUSTIN JOHNSON, LAKSHMI DEVI VISWANATHAN, JESSICA HOSTETLER, ROBERT FRIEDMAN, VLADIMIR NOSKOV, JAYSHREE ZAVERI.
“SEE THINGS NOT AS THEY ARE, BUT AS THEY MIGHT BE.”CYNTHIA ANDREWS-PFANNKOCH, QUANG PHAN, LI MA, HAMILTON SMITH, ADI RAMON, CHRISTIAN TAGWERKER, J CRAIG VENTER, EULA WILTURNER, LEI YOUNG, SHIBU YOOSEPH, PRABHA IYER, TIM STOCKWELL, DIANA RADUNE, BRIDGET SZCZYPINSKI, SCOTT DURKIN, NADIA FEDOROVA, JAVIER QUINONES, HANNA TEKLEAB.
“WHAT I CANNOT BUILD, I CANNOT UNDERSTAND.” – RICHARD FEYNMAN
The primary watermark consists of a copyright-like assertion, an inventory of all of the characters, and the hidden HTML web page (which I confirmed above.) The second, third, and fourth watermarks include an inventory of the authors and three quotations.
Word that there are 14 undecoded triples that solely seem as soon as within the record of characters. They are not in ASCII order, keyboard order, or every other order I can determine, so I can not decide what they’re, however I assume they’re lacking punctuation and particular characters.
The DNA decoding desk
The next desk summarizes the ‘secret’ DNA to character code:
AAA ? | AAC L | AAG ? | AAT 3 |
ACA P | ACC ? | ACG ? | ACT 2 |
AGA 4 | AGC > | AGG ? | AGT B |
ATA area | ATC ? | ATG ? | ATT D |
CAA M | CAC / | CAG : | CAT Y |
CCA = | CCC – | CCG ? | CCT ? |
CGA . | CGC 8 | CGG < | CGT O |
CTA R | CTC ? | CTG I | CTT 1 |
GAA ‘ | GAC ? | GAG ! | GAT ? |
GCA Ok | GCC 6 | GCG 5 | GCT S |
GGA “ | GGC F | GGG newline | GGT X |
GTA 9 | GTC W | GTG , | GTT J |
TAA E | TAC G | TAG A | TAT 7 |
TCA H | TCC U | TCG @ | TCT 0 |
TGA T | TGC N | TGG Z | TGT ? |
TTA Q | TTC ? | TTG V | TTT C |
So far as I can inform, this desk is in random order. I analyzed it a bunch of how, from character frequency and DNA frequency to ASCII order and keyboard order, and could not determine any rhyme or cause to it. I hoped to seek out both some construction or one other coded message, however I could not discover something.
Extra on breaking the code
It was handy that the JCVI mentioned upfront what quotations had been within the DNA, making it simpler to interrupt the code. However may the code nonetheless be damaged in the event that they hadn’t?
One solution to break the code is to take a look at statistics of how typically totally different triples seem. We depend the triples, convert the desk to an inventory, after which kind the record. Within the kind we use a particular comparability operate that compares the counts, to kind on the counts, not on the triples.
arc> (kind (fn ((k1 v1) (k2 v2)) (> v1 v2)) (tablist (counts t2))) (("ATA" 41) ("TAG" 27) ("TAA" 25) ("CTG" 18) ("TGC" 16) ("GTG" 16) ("CTA" 15) ("CGT" 14) ("AAC" 13) ("TGA" 10) ("TTT" 10) ("GCT" 10) ("TAC" 10) ("TCA" 8) ("CAA" 7) ("CAT" 7) ("TCC" 7) ("TTG" 5) ("GGC" 4) ("GTT" 4) ("ATT" 4) ("CCC" 4) ("GCA" 3) ("AGT" 2) ("TTA" 2) ("ACA" 2) ("CGA" 2) ("GGA" 2) ("GTC" 1) ("TGG" 1) ("GGG" 1))
This tells us that ATA seems 41 occasions within the second watermark, TAG 27 occasions, and so forth. If we assume that this encodes English textual content, then the commonest characters can be area, E, T, A, O, and so forth. Then it is a matter of trial-and-error attempting the high-frequency letters for the high-frequency triples till we discover a mixture that yields actual phrases. (It is similar to fixing a newspaper cryptogram.) You may discover that the high-frequency triples do end up to match high-frequency letters, however not within the precise order. (I’ve written earlier than on simple cryptanalysis with Arc.)
One other doable technique is to guess {that a} phrase comparable to “CRAIG VENTER” seems within the resolution, and attempt to match that in opposition to the triples. This seems to be the case. It does not give lots of letters to work on, however it’s a begin.
Arc: nonetheless unhealthy for exploratory programming
A pair years in the past I wrote that Arc is bad for exploratory programming, which turned out to be hugely controversial. I did this DNA exploration utilizing each Python and Arc, and located Python to be significantly better for a lot of the causes I described in my earlier article.
- Libraries: Matching DNA triples was trivial in Python as a result of I may use common expressions. Arc does not have common expressions (until you employ a nonstandard variant comparable to Anarki). One other subject was once I wished to do graphical show of watermark contents; Arc has no graphics help.
- Efficiency: for the types of exploratory programming I do, efficiency is vital. As an illustration, one factor I did when attempting to determine the code was match all substrings of 1 watermark in opposition to one other, to see if there have been commonalities. That is O(N^3) and was tolerably quick in Python, however Arc can be too painful.
- Ease of and pace of programming. Quite a lot of the programming pace profit is that I’ve extra familiarity with Python, however whereas programming in Arc I’d continuously get derailed by cryptic error messages, or attempting to determine methods to do easy issues like breaking a string into triples. It does not assist that a lot of the Arc documentation I wrote myself.
To summarize, I began off in Arc, switched to Python once I realized it will take me manner too lengthy to determine the DNA code utilizing Arc, after which went again to Arc for this writeup after I discovered what I wished to do. In different phrases, Python was significantly better for the exploratory half.
Ideas on the DNA coding approach
I see some potential ways in which the DNA coding approach might be improved and prolonged. Because the the complete particulars of the DNA coding approach have not been printed by the JCVI but, they could have already thought-about these concepts.
Having the ability to embed textual content within the DNA of a residing organism is extraordinarily cool. Nonetheless, I am undecided how sensible it’s. The information density in DNA could be very excessive, possibly 500 occasions that of a tough drive (ref), however a lot of the DNA is dedicated to preserving the bacterium alive and solely about 1K of textual content is saved within the watermarks.
I obtained e mail from JCVI that mentioned the coding mechanism additionally helps Java, math (LaTeX?), and Perl in addition to HTML. Embedding Perl code in a residing organism appears much more loopy than textual content or HTML.
If encoding textual content in DNA catches on, I might count on that error correction can be essential to deal with mutations. An error-correction code like Reed-Solomon will not actually work as a result of DNA can endure deletions and insertions in addition to mutations, so that you’d want some form of generalized deletion/insertion correcting code.
I simply know {that a} 6-bit code is not going to be sufficient. In the end individuals will need lower-case, accented character, Japanese, and so on. So that you may as effectively give you a solution to put Unicode into DNA, in all probability as UTF-8. And other people will need arbitrary byte information. My strategy can be to make use of 4 DNA base pairs as a substitute of three, and encode bytes. Merely assign A=00, C=01, G=10, and T=11 (as an example), and encode your bytes.
If I had been writing an RFC for information storage in DNA, I might recommend that more often than not you’d wish to retailer the precise information on the net, and simply retailer a hyperlink within the DNA. (Much like how a QR code or tinyurl can hyperlink to a web site.) The encoded hyperlink might be surrounded by a particular DNA sequence that signifies a tiny DNA hyperlink code. This sequence may be used to seek out the hyperlink code through the use of PCR, so that you need not do a genome sequence on the whole organism to seek out the watermark. (The prevailing watermarks are in truth surrounded by mounted sequences, which can be for that goal.)
I feel there also needs to be some solution to tag and construction information that’s saved DNA, to point what’s identification, description, copyright, HTML, homeowners, and so forth. XML looks like an apparent alternative, however embedding XML in residing organisms makes me queasy.
Conclusions
Having the ability to fabricate a residing organism from a DNA pc file is an incredible accomplishment of Craig Venter and the JCVI. Embedding the textual content of an internet web page in a residing organism looks like science fiction, however it’s truly occurred. After all the bacterium does not truly serve the online web page… but.