Encoding technical information in GM organisms nature.com March 2003 Volume 21 Number 3 pp 224 - 226
[Weirdness is upon us... --fl]
1. Sylvestre Marillonnet is a senior scientist at Icon Genetics AG, Maximilianstr. 38/40, Munich 80539, Germany 2. Victor Klimyuk is a research director at Icon Genetics AG, Maximilianstr. 38/40, Munich 80539, Germany 3. Yuri Gleba is chief executive officer at Icon Genetics AG, Maximilianstr. 38/40, Munich 80539, Germany. Correspondence should be addressed to Y Gleba. e-mail: gleba@icongenetics.com
The recent decision of the European Union to curb the moratorium on transgenic plants 1is conditional on the introduction of extra rules to ensure both the traceability of genetically modified (GM) organisms and the labeling of GM organism-derived products 2. Currently, most detection protocols rely on the detection of signature sequences (e.g., antibiotic-resistance markers or promoters) that are not unique to particular GM products and that may pose environmental and health concerns?although such concerns remain unproven. Here, we propose a standardized procedure whereby nontranscribed DNA-encoded technical information can be inserted next to a transgene in a recombinant organism's genome to allow rapid product identification. The message would be based on a non-genetic code or encryption protocol, would be biologically "neutral", would be easily retrieved and read, and would have sufficient capacity for storing the descriptive information required. We suggest the procedure could be adopted as a universal coding standard for GM organisms.
Coding protocol or languages The base composition of DNA dictates that any information encrypted within the molecule must exploit a four-letter DNA "storage language". To make the actual "user language" as friendly as possible, we propose using alphanumeric characters. In this case, we would need an artificial code that translates words based on four-symbol letters into a language that has 26 Latin alphabet letters, 10 Arabic numerals, and one or two space/stop characters (i.e., a total of 37?38 characters). It is obvious that we are best served by an encryption protocol that translates nucleotide triplets, very much like nature's genetic code. Such a protocol has a capacity to code for 4 3= 64 characters, a number that is more than sufficient for our purposes. (Theoretically, we could use also longer words such as quadruplets, but such a protocol would have 33% lower information density and would consequently require longer label DNA inserts.)
Our proposed triplet-based code is, to a certain extent, degenerate, in that some characters will be coded by more than one triplet. This characteristic should be useful to avoid the creation of unwanted motifs, such as commonly used restriction sites or destabilizing repeats that would complicate the actual genetic engineering process or affect the genetic stability of the message.
Although we initially considered using a system that merely added artificial codons to the existing triplet code (which would then be translated in terms of single-letter alphabet designations for the 20 amino acids; see "A Proof of Principle" ), we reasoned that defining a completely new artificial triplet code ( Fig. 1 ) would allow us to optimize sequence variability in duplicate characters (nucleotide composition and GC content), thereby providing more flexibility when designing new messages (in contrast, codons in nature are usually degenerate at the third position only).
The message To define a useful message content and length, analogies with other existing products on the market are worthwhile. On the back of a machine, for example a computer, one finds data such as the name of the maker or company, production date, place of production, product model, and serial number. Assuming that we would use the same classification for transgenic products (perhaps we would want to add a name of the registered product owner, which may or may not be the producer, and one additional message section that could be reserved for future information needs), we conclude that a sufficient message length for the foreseeable future could be five to seven words, each having sufficient coding capacity to allow adequate description of product specifications. Assuming the average length of the words currently used on the labels of technical products is six to seven letters, we arrive at a total message length of 100 or so characters ( 300 nucleotides). A hypothetical example of such a label is shown in Figure 2 .
Over a prolonged period of time (multiple organism generations), the quality of the information encoded in a technical message may deteriorate as a result of spontaneous mutations. To avoid this problem, we propose to repeat the essential elements, or the whole message. Encoding this duplication on the antisense strand, downstream of the first message, would result in the creation of an inverted repeat if no degenerate triplets are used. Even if degenerate triplets were used, whenever possible an imperfect inverted repeat might still be created. Therefore, we propose to encode the duplicated message on the same strand, but in opposite direction ( Fig. 2 ).
Reading initiation and termination For technically simple and reliable reading of any DNA-encoded information, one needs convenient and universal initiation and termination signals. Taking into account the current state of the art of DNA manipulation, one is best served by unique sets of short (18?30 nucleotides) sequences that allow easy amplification of the message DNA in question by PCR and that are mutually accepted and declared a standard.
A PCR product amplified using primers homologous to the conserved sequences could be sequenced directly or after cloning. Considering that several nucleotides following a primer are often difficult to read using conventional dideoxy-sequencing methods, it would be useful to include a spacer region (in the standard conserved sequence), before the start of the information-containing region (see Fig. 2 ). Because of the extreme sensitivity of PCR, transgenes could be identified with this approach not only in tissues from transgenic organisms, but also in processed products derived from such organisms, because they often contain DNA in trace amounts that are sufficient for amplification.
Multiple messages Whereas identification of an information label by direct sequencing of a PCR product will be possible for plants containing a single transgene, alternative strategies are required for plants with multiple transgenes. One general solution might be to clone the PCR product and randomly sequence several clones. Such a strategy will, however, become more laborious when multiple genes have been engineered into the same organism. To reduce the complexity of a mix of amplified sequences, or to conduct more specific searches, one might do PCR with primers designed to anneal within the variable region. For example, a PCR done with company-specific primers (that are overlapping the nine nucleotides of the company code label) would allow one to determine whether the analyzed plant contains constructs from a given company.
A more general solution would be to analyze the mix of amplified products with microarray hybridization 3,4. Sequencing by hybridization of a microarray would have the advantage of allowing the simultaneous determination of all messages in a single hybridization. This approach is, however, somewhat limited by the future availability and cost of sequencing microarrays. A similar, but more simple approach would be to use specially designed microarrays that would contain sets of oligonucleotides specific for the variable region, for all constructs known to have been transformed in plants (or any other organism). This approach would not allow the complete determination of the entire technical message, but would at least identify all the constructs present in the biological sample.
Biological neutrality It is evident that adding a noncoding, technical label to a transgene is unlikely to have important biological consequences for the GM organism, in that there should be no increased mutability (in particular no recombination hotspots within the technical label or the "linker" between the label and the transgene), no effect of the label on the stability or expression of the transgene per se, and no changes in overall biological fitness of the organism.
To avoid a functional conflict defined early by von Neumann in his theory of self-replicating automata 5, the proposed information message would be placed outside of the transgene itself; alternatively, it can be placed on an intron inside the linear molecule of the transgene. To remain linked to the transgene during organism reproduction, the DNA encoding the accompanying technical message has to be closely or directly linked to the transgene DNA, a requirement that is easy to accommodate without complicating the engineering process. Given the rate of genome evolution of multicellular organisms such as animals and plants, such a linkage will certainly be maintained for a technologically relevant period of time, provided the added segment has been engineered so as not to cause increased recombination.
To avoid any serious effect of the label on the neighboring transgene or the organism itself, the introduced segment should be engineered without the use of sequences that are known as mutation hotspots (repeats, palindromes, recombination sites, etc.) or transcription elements ("hairpins" affecting transcription, enhancers, etc.). Ultimate proof will most likely be gained only through a practical evaluation of the specific label sequences.
With regard to the potential biological implications of introducing an additional DNA fragment into an organism, we see no obvious negative consequences. The added DNA fragment of proposed size (300 or so nucleotides) will increase the total length of the DNA construct to be integrated into an organism's genome by no more than 10?20%. Unlike accompanying selectable or phenotypic markers used for genetic transformation, it does not contain information expressed by the organism, and in this regard, it should be as harmless as other noncoding DNA that is abundant in genomes of most eukaryotes. For example, in plants, the amount of such DNA can vary between 70 Mb ( Arabidopsis thaliana ) and 100 Gb (onion); the latter, in principle, is sufficient to store in a single onion cell as much information as 10,000 books the size of the Bible.
Last thoughts An alternative to the coding protocol proposed here would be the use of "universal" recognition sequences to flank random sequences, which would then be used as a "password" allowing an authorized reader entry to a specific computer database, thus providing necessary information. We don't favor such a solution, for several reasons: first, such an unstructured coding based on unique writing?meaning relationships would be much more sensitive to information corruption; second, it would be less user-friendly; third, it would pose many more problems if changes or upgrades in standards and technologies that define the labeling process were ever introduced.
Would an expressible message be useful as a part of the label? To facilitate further the detection of a GM organism, one might consider adding a segment of the label that is expressible by the living organism's genetic machinery. Such information, after transcription and translation, can produce a short "universal" polypeptide that would be easily detectable by fast and sensitive immunological techniques and would be indicative of the transgenic nature of the biological material in question. Of course, such a marker would have to be a protein with a proven health and environmental safety record. Given the general unpredictability of the plant expression machinery?in particular anomalies arising from silencing phenomena?and the likelihood that any new expressed polypeptide will cause added safety concerns, we don't consider such an approach as an a priori useful solution.
Agencies interested in adopting our proposal of course would have to address the problem of coordinating and standardizing terminology used to describe GM products. One means of identifying companies, for example, would be to use the three- to four-letter tickers that designate companies on stock exchanges (although this nomenclature would obviously have to be adapted to allow inclusion of private companies).
For our proposal to be feasible, such agencies would also likely have to coordinate the building and hosting of a central database repository for GM product information. Such databases would include a much more extensive description and annotation of products and/or cross-references or links to product information stored on other databases (e.g., those of the producer or of other governmental agencies).
We believe our scheme provides a working process and code that could make the labeling of GM organisms feasible.
REFERENCES Schiermeier, Q. Nature 409 , 967 (2001). | Article | PubMed | Dorey, E. Nat. Biotechnol. 19 , 795 (2001). | Article | PubMed | Pease, A.C. et al .Proc. Natl. Acad. Sci. USA 91 , 5022-5026 (1994). | PubMed | Tilliv, S.V. & Mirzabekov, A.D. Curr. Opin. Biotechnol. 12 , 53-58 (2001). | Article | PubMed | Von Neumann, J. & Burks, A.W. Theory of Self-Reproducing Automata (Univ. of Illinois Press, Urbana, IL, 1966). Thomas, R.F. (ed.) Virgil: The Georgics (Cambridge Univ. Press, Cambridge, 1988). |