Saturday, April 26, 2014

The GGAGG Reflex

A common myth in biology is that genes coding for proteins need to have a Shine Dalgarno sequence upstream of the start codon. Students sometimes spout this as an inarguable fact; a kind of molecular biology catechism. I call it the GGAGG reflex.

In fact, no SD sequence is required. None at all. It's important to be clear on this.

In case you're not a biogeek: In the 1970s,  Australian scientists John Shine and Lynn Dalgarno were the first to notice that the tail end of the 16S bacterial ribosomal RNA contains a short nucleotide sequence whose reverse complement is often found immediately upstream of a protein gene's start codon. The exact sequence varies from organism to organism, but the rRNA trailer sequence is usually pyrimidine-rich. In E. coli, the sequence is CACCTCCTTA. (Here, I am of course talking about the DNA sequence. In RNA it's CUCCUCCUUA.)  If you reverse the sequence, the Watson-Crick complement is TAAGGAGGTG. Some portion of the latter is often found a few nucleotides upstream of a start codon; not 100% of the time, but too often to be by chance.

The key intuition here is that Watson-Crick binding of the tail end of the 16S rRNA to the corresponding antisequence ahead of the start codon helps stabilize the ribosome so that it is more likely to translate the gene. The degree of binding depends, of course, on the fidelity of the SD sequence ahead of the gene. Usually, the purine-rich SD area is not an exact match for the 16S rRNA trailer, and in fact the SD region quite often has no detectable SD signature whatsoever.

How often is "quite often"? In 2002, Ma et al. undertook a survey of 30 organisms representing bacteria from all major taxonomic groups. Somewhat surprisingly, they found that in 17 out of 30 organisms, a Shine Dalgarno sequence was present at fewer than half of all CDS (protein-encoding) genes. Among the bacteria most likely to use SD sequences were Bacillus subtilis and Thermotoga thermophilus, in which 90% of known protein genes have an upstream SD signal. Among those least likely to use SD sequences were low-GC/small-genome organisms (intracellular parasites, Mycoplasmas, and pathogens), with many groups, like the Actinobacteria (47%), falling somewhere in the middle.

Before taking these findings to heart, though, it's worth noting some serious weaknesses in the Ma et al. study. In obtaining the above numbers, Ma et al. used a rather permissive definition of "SD sequence," based on a minimum binding-energy cutoff (∆G) of -4.4 kcal/mol, which means they counted GAGG as a SD sequence (and also GGAG and AGGA). If one were to count only GGAGG (length 5) and longer motifs, the percentages given by Ma et al. would be much lower. (I present some data of my own on this further below.) The reason this is a very serious issue is that the probability of random occurrence of short (length-4) sequences like GAGG is substantial. Ma et al. failed to report the expectation odds for the various "signals" they looked for. Hence, for short motifs, we have no way of knowing, for the various organisms, what the expected rate of occurrence of short signals was. If an organism with genomic GC content of 66% has a putative SD motif of GAGG, AGGA, or GAGG in the 20-bp target region for 20% of its genes, how does that compare with the random occurrence rate for those sequences, given the organism's DNA base composition? We're not told.

Bearing in mind the weaknesses of the study, a number of nonethelesss interesting findings came out of the Ma et al. survey, including:
  • A SD sequence is rarely long or canonical; many times it's just GAGG or GGAG or AGGA (putatively) or a corruption of the expected form (e.g., GGTGG instead of GGAGG)
  • SD sequences occur more often with highly expressed genes (such as genes for ribosomal proteins and core energy metabolism genes) than with low-expression genes
  • In some (not all) organisms, the SD sequence is more likely to occur in conjunction with an ATG start codon and less likely to occur with GTG or TTG
  • Vanishingly few SD signals are located further than 14 bases or closer than 4 bases away from a start codon
Anybody with modest JavaScript skills can write scripts that verify some of these findings against public genomes. I took a quick look at the genome for Rothia mucilaginosa DY-18 (a member of the Actinobacteria family and a common inhabitant of the human mouth). First, I determined the most likely SD sequence for Rothia based on the 16S rRNA trailer of CCTCCTTTCT (implying a SD sequence of AGAAAGGAGG), then I had my script scan the genome in both directions, looking for any of the six possible length-5 motifs within the full-length sequence (so, AGAAA, GAAAG, etc.), in the 20 base pairs upstream of every annotated open reading frame start codon. In total, I found 686 putative SD sequences within 4 to 14 bases of an annotated start codon. Since Rothia mucilaginosa has 1905 CDS genes, this means 36.0% of protein genes carry a putative length-5 Shine Dalgarno signal. When I re-ran the check using all possible length-4 SD signal variants (using the relaxed criteria of Ma et al.), I found 1160 positives. Thus, 60.9% of CDS genes in R. mucilaginosa have a length-4 SD signal per Ma et al.

On a probability of abundance basis (given Rothia's actual base composition stats), we would expect to see 203 length-5 SD motifs by pure chance in the genome's 1905 20-bp regions. The actual number (686) is obviously quite a bit higher than expected, tending to validate the notion that these are, indeed, SD motifs we're looking at. For length-6 motifs, the trend is even sharper: The expectation is 40 occurrences by chance; the actual number is 340. So at a length of 6, a motif has high odds (~90% chance) of being real.

By contrast, the statistical expectation for length-4 motifs calculates out at 957, which is only slightly less than the number found (1160). Therefore, in dealing with Shine Dalgarno sequences, at least in Rothia, it's meaningful to deal with length-5 and longer motifs, but probably not meaningful to deal with length-4. When you spot a length-4 motif, odds are very high you're looking at a randomly occurring pattern.

If you enjoyed this post, please share the URL with a friend. Thank you!