|
|
|
|
| Lecture Six
The Genetic Code (I)
In the previous lectures, we have
used Perl to search for motifs, simulate DNA mutations, generate random sequences,
and transcribe DNA to RNA. In this lecture, we will write Perl programs
to simulate how the genetic code directs the translation of DNA into protein.
We will start with the hash data type. In the next lecture, we will write
a program to translate DNA to protein. In addition, we will also continue
exploring regular expressions and writing code to handle FASTA files. Finally,
we will examine all six open reading frames (ORFs) of a DNA sequence during
the translation of DNA into protein.
~ Please use Internet Explorer |
Introduction Menu Summary |
| Hashes There are three data types in Perl.
You have already seen the first two data types: scalar variables and arrays.
Now, we will start to use the other one: hashes. A hash in Perl is a collection
of zero or more pairs of scalar values, called keys and values. It provides
very fast lookup of the value associated with a key. For example, you may
have a hash called %genes with variable names that begin with a percent sign
(%):
%genes = ( 'gene1', 'ATTCGT', 'gene2', 'CTGCCATGA'); The values are indexed by the keys, so that - Given a key, the hash returns the corresponding value $seq = $genes{'gene2'}; - $seq now contains 'CTGCCATGA' - Note that $genes{'gene2'} is a scalar, so it starts with $ Hashes can be assigned values use key=>value notation: %genes = ( 'gene1', 'ATTCGT', 'gene2', 'CTGCCATGA'); %genes = ( gene1=>'ATTCGT', gene2=>'CTGCCATGA'); The keys function returns a list of all keys in a hash: @key_list = keys(%genes); foreach $key (@key_list) { print "The value of $key is $genes{$key}\n"; } It renders the output as follows: The value of gene1 is ATTCGT The value of gene2 is CTGCCATGA As you may notice from this example, hashes (like arrays) change their leading character to a dollar sign when you access a single element, because the value returned from a hash lookup is a scalar value. You can tell a hash lookup from an array element by the type of braces they use: arrays use square brackets [ ]; hashes use curly braces { }. Also, if you want to initialize a hash with some key-value pairs, it gets done much like initializing arrays, but every pair becomes a key-value: %genes = ( 'gene1', 'ATTCGT', 'gene2', 'CTGCCATGA'); %genes = ( gene1=>'ATTCGT', gene2=>'CTGCCATGA'); Because hashes are built into Perl as a basic data type, they are easy to use, and you will not have to do much programming to accomplish your goal. However, for instance, they do not store their elements in a sorted order, so if you need to look at the data in a sorted order, you have to explicitly sort it: @sorted_keys = sort keys %my_hash; Later in this lecture, we will develop programs that use hashes to retrieve information about a piece of gene. The gene name is the key; the information about the gene is the value of that key. Gene Expression Data Using Hashes You can use hashes to find a gene in your data. To do so, you can load the hash so that the keys are the gene names and the values are the expression measurement. Thus, a single call on the hash, with the name of the desired gene as a key, returns the results of the experiment for that gene, and you will have gotten your answer. |
|
|
The genetic code is about how a cell
translates the information contained in its DNA into amino acids and then
proteins, which do the real work in the cell.
Background DNA encodes the primary structure (i.e., the amino acid sequence) of proteins. DNA has four nucleotides, and proteins have 20 amino acids. The encoding works by taking each group of three nucleotides from the DNA and "translating" them to an amino acid or a stop signal. Each group of three nucleotides is called a codon. Actually, transcription first uses DNA to make RNA, and then translation uses RNA to make proteins. This is called the central dogma of molecular biology. But in this course, we will abbreviate the process and somewhat inaccurately call the entire process from DNA to protein "translation." The reason for this cavalier distinction is that the whole business is much easier to simulate on computer using strings to represent the DNA, RNA, and proteins. Note that with four kinds of bases, each group of three bases of DNA can represent as many as 4 x 4 x 4 = 64 possible amino acids. Since there are only 20 amino acids plus a stop signal, the genetic code has evolved some redundancy, so that some amino acids are represented by more than one codon. The chart in Figure 6.1 shows how the various bases combine to form amino acids. There are many interesting things to note about the genetic code. For our purposes, the most important is redundancy-the way more than one codon translates to the same amino acid. We will program this using character classes and regular expressions. Figure 6.1 The Genetic Code
Translating Codons to Amino Acids We will look at three different versions of translating DNA using the genetic code:
# codon2aa
# # A subroutine to translate a DNA 3-character codon to an amino acid sub codon2aa { my($codon) = @_; if ( $codon =~ /TCA/i ) { return 'S' } # Serine elsif ( $codon =~ /TCC/i ) { return 'S' } # Serine elsif ( $codon =~ /TCG/i ) { return 'S' } # Serine elsif ( $codon =~ /TCT/i ) { return 'S' } # Serine elsif ( $codon =~ /TTC/i ) { return 'F' } # Phenylalanine elsif ( $codon =~ /TTT/i ) { return 'F' } # Phenylalanine elsif ( $codon =~ /TTA/i ) { return 'L' } # Leucine elsif ( $codon =~ /TTG/i ) { return 'L' } # Leucine elsif ( $codon =~ /TAC/i ) { return 'Y' } # Tyrosine elsif ( $codon =~ /TAT/i ) { return 'Y' } # Tyrosine elsif ( $codon =~ /TAA/i ) { return '_' } # Stop elsif ( $codon =~ /TAG/i ) { return '_' } # Stop elsif ( $codon =~ /TGC/i ) { return 'C' } # Cysteine elsif ( $codon =~ /TGT/i ) { return 'C' } # Cysteine elsif ( $codon =~ /TGA/i ) { return '_' } # Stop elsif ( $codon =~ /TGG/i ) { return 'W' } # Tryptophan elsif ( $codon =~ /CTA/i ) { return 'L' } # Leucine elsif ( $codon =~ /CTC/i ) { return 'L' } # Leucine elsif ( $codon =~ /CTG/i ) { return 'L' } # Leucine elsif ( $codon =~ /CTT/i ) { return 'L' } # Leucine elsif ( $codon =~ /CCA/i ) { return 'P' } # Proline elsif ( $codon =~ /CCC/i ) { return 'P' } # Proline elsif ( $codon =~ /CCG/i ) { return 'P' } # Proline elsif ( $codon =~ /CCT/i ) { return 'P' } # Proline elsif ( $codon =~ /CAC/i ) { return 'H' } # Histidine elsif ( $codon =~ /CAT/i ) { return 'H' } # Histidine elsif ( $codon =~ /CAA/i ) { return 'Q' } # Glutamine elsif ( $codon =~ /CAG/i ) { return 'Q' } # Glutamine elsif ( $codon =~ /CGA/i ) { return 'R' } # Arginine elsif ( $codon =~ /CGC/i ) { return 'R' } # Arginine elsif ( $codon =~ /CGG/i ) { return 'R' } # Arginine elsif ( $codon =~ /CGT/i ) { return 'R' } # Arginine elsif ( $codon =~ /ATA/i ) { return 'I' } # Isoleucine elsif ( $codon =~ /ATC/i ) { return 'I' } # Isoleucine elsif ( $codon =~ /ATT/i ) { return 'I' } # Isoleucine elsif ( $codon =~ /ATG/i ) { return 'M' } # Methionine elsif ( $codon =~ /ACA/i ) { return 'T' } # Threonine elsif ( $codon =~ /ACC/i ) { return 'T' } # Threonine elsif ( $codon =~ /ACG/i ) { return 'T' } # Threonine elsif ( $codon =~ /ACT/i ) { return 'T' } # Threonine elsif ( $codon =~ /AAC/i ) { return 'N' } # Asparagine elsif ( $codon =~ /AAT/i ) { return 'N' } # Asparagine elsif ( $codon =~ /AAA/i ) { return 'K' } # Lysine elsif ( $codon =~ /AAG/i ) { return 'K' } # Lysine elsif ( $codon =~ /AGC/i ) { return 'S' } # Serine elsif ( $codon =~ /AGT/i ) { return 'S' } # Serine elsif ( $codon =~ /AGA/i ) { return 'R' } # Arginine elsif ( $codon =~ /AGG/i ) { return 'R' } # Arginine elsif ( $codon =~ /GTA/i ) { return 'V' } # Valine elsif ( $codon =~ /GTC/i ) { return 'V' } # Valine elsif ( $codon =~ /GTG/i ) { return 'V' } # Valine elsif ( $codon =~ /GTT/i ) { return 'V' } # Valine elsif ( $codon =~ /GCA/i ) { return 'A' } # Alanine elsif ( $codon =~ /GCC/i ) { return 'A' } # Alanine elsif ( $codon =~ /GCG/i ) { return 'A' } # Alanine elsif ( $codon =~ /GCT/i ) { return 'A' } # Alanine elsif ( $codon =~ /GAC/i ) { return 'D' } # Aspartic Acid elsif ( $codon =~ /GAT/i ) { return 'D' } # Aspartic Acid elsif ( $codon =~ /GAA/i ) { return 'E' } # Glutamic Acid elsif ( $codon =~ /GAG/i ) { return 'E' } # Glutamic Acid elsif ( $codon =~ /GGA/i ) { return 'G' } # Glycine elsif ( $codon =~ /GGC/i ) { return 'G' } # Glycine elsif ( $codon =~ /GGG/i ) { return 'G' } # Glycine elsif ( $codon =~ /GGT/i ) { return 'G' } # Glycine else { print STDERR "Bad codon \"$codon\"!!\n"; exit; } } This piece of code is clear and simple,
and the layout makes it obvious what is happening.
Let us recall that filehandles from previous lectures and how they access data in files. Additionally, the special filehandles STDIN that reads user input from the keyboard. STDOUT and STDERR are also special filehandles that are always available to Perl programs. STDOUT directs output to the screen (usually) or another standard display device. When a filehandles is missing from a print statement, STDOUT is assumed. The print statement accepts a filehandle as an optional argument. (Cf. Appendix B of the textbook). Here, error messages are directed to STDERR, which usually prints to the screen, but on many computer systems they can be re-directed to a special error file or other location. (Cf. Appendix B of the textbook). The Redundancy of the Genetic Code We should take a note on the redundancy of the genetic code. The next subroutine clearly displays this redundancy. Notice that groups of redundant codons almost always have the same first and second bases and vary in the third. # codon2aa # # A subroutine to translate a DNA 3-character codon to an amino acid # Version 2 sub codon2aa { my($codon) = @_; if ( $codon =~ /GC./i) { return 'A' } # Alanine elsif ( $codon =~ /TG[TC]/i) { return 'C' } # Cysteine elsif ( $codon =~ /GA[TC]/i) { return 'D' } # Aspartic Acid elsif ( $codon =~ /GA[AG]/i) { return 'E' } # Glutamic Acid elsif ( $codon =~ /TT[TC]/i) { return 'F' } # Phenylalanine elsif ( $codon =~ /GG./i) { return 'G' } # Glycine elsif ( $codon =~ /CA[TC]/i) { return 'H' } # Histidine elsif ( $codon =~ /AT[TCA]/i) { return 'I' } # Isoleucine elsif ( $codon =~ /AA[AG]/i) { return 'K' } # Lysine elsif ( $codon =~ /TT[AG]|CT./i) { return 'L' } # Leucine elsif ( $codon =~ /ATG/i) { return 'M' } # Methionine elsif ( $codon =~ /AA[TC]/i) { return 'N' } # Asparagine elsif ( $codon =~ /CC./i) { return 'P' } # Proline elsif ( $codon =~ /CA[AG]/i) { return 'Q' } # Glutamine elsif ( $codon =~ /CG.|AG[AG]/i) { return 'R' } # Arginine elsif ( $codon =~ /TC.|AG[TC]/i) { return 'S' } # Serine elsif ( $codon =~ /AC./i) { return 'T' } # Threonine elsif ( $codon =~ /GT./i) { return 'V' } # Valine elsif ( $codon =~ /TGG/i) { return 'W' } # Tryptophan elsif ( $codon =~ /TA[TC]/i) { return 'Y' } # Tyrosine elsif ( $codon =~ /TA[AG]|TGA/i) { return '_' } # Stop else { print STDERR "Bad codon \"$codon\"!!\n"; exit; } } Using character classes and regular
expressions, this code clearly shows the redundancy of the genetic code.
A character class such as [TC] matches a single character, either T or C. The period "." is the regular expression that matches any character except a newline. The /GT./i expression for valine matches GTA, GTC, GTG, and GTT, all of which are codons for valine. (Certiainly, the period matches any other character, but the $codon is assumed to have only A,C,G, or T characters.) The i after the regular expression means match uppercase or lowercase, for instance /T/i matches T or t. The new feature in these regular expressions is the use of the vertical bar or pipe ( | ) to separate two choices. Thus for serine, / TC.|AG[TC] / matches / TC./ or / AG[TC] /. Using Hashes for the Genetic Code Now, let us keep on using a hash
for this translation, you will see it is a natural way to proceed. For each
codon key the amino acid value is returned. Here is the code:
# # codon2aa # # A subroutine to translate a DNA 3-character codon to an amino acid # Version 3, using hash lookup sub codon2aa { my($codon) = @_; $codon = uc $codon; my(%genetic_code) = ( 'TCA' => 'S', # Serine 'TCC' => 'S', # Serine 'TCG' => 'S', # Serine 'TCT' => 'S', # Serine 'TTC' => 'F', # Phenylalanine 'TTT' => 'F', # Phenylalanine 'TTA' => 'L', # Leucine 'TTG' => 'L', # Leucine 'TAC' => 'Y', # Tyrosine 'TAT' => 'Y', # Tyrosine 'TAA' => '_', # Stop 'TAG' => '_', # Stop 'TGC' => 'C', # Cysteine 'TGT' => 'C', # Cysteine 'TGA' => '_', # Stop 'TGG' => 'W', # Tryptophan 'CTA' => 'L', # Leucine 'CTC' => 'L', # Leucine 'CTG' => 'L', # Leucine 'CTT' => 'L', # Leucine 'CCA' => 'P', # Proline 'CCC' => 'P', # Proline 'CCG' => 'P', # Proline 'CCT' => 'P', # Proline 'CAC' => 'H', # Histidine 'CAT' => 'H', # Histidine 'CAA' => 'Q', # Glutamine 'CAG' => 'Q', # Glutamine 'CGA' => 'R', # Arginine 'CGC' => 'R', # Arginine 'CGG' => 'R', # Arginine 'CGT' => 'R', # Arginine 'ATA' => 'I', # Isoleucine 'ATC' => 'I', # Isoleucine 'ATT' => 'I', # Isoleucine 'ATG' => 'M', # Methionine 'ACA' => 'T', # Threonine 'ACC' => 'T', # Threonine 'ACG' => 'T', # Threonine 'ACT' => 'T', # Threonine 'AAC' => 'N', # Asparagine 'AAT' => 'N', # Asparagine 'AAA' => 'K', # Lysine 'AAG' => 'K', # Lysine 'AGC' => 'S', # Serine 'AGT' => 'S', # Serine 'AGA' => 'R', # Arginine 'AGG' => 'R', # Arginine 'GTA' => 'V', # Valine 'GTC' => 'V', # Valine 'GTG' => 'V', # Valine 'GTT' => 'V', # Valine 'GCA' => 'A', # Alanine 'GCC' => 'A', # Alanine 'GCG' => 'A', # Alanine 'GCT' => 'A', # Alanine 'GAC' => 'D', # Aspartic Acid 'GAT' => 'D', # Aspartic Acid 'GAA' => 'E', # Glutamic Acid 'GAG' => 'E', # Glutamic Acid 'GGA' => 'G', # Glycine 'GGC' => 'G', # Glycine 'GGG' => 'G', # Glycine 'GGT' => 'G', # Glycine ); if(exists $genetic_code{$codon}) { return $genetic_code{$codon}; }else{ print STDERR "Bad codon \"$codon\"!!\n"; exit; } } This subroutine is simple: it initializes
a hash and then performs a single lookup of its single argument in the hash.
The hash has 64 keys, one for each codon.
Notice there is a function that returns true if the key $codon exists in the hash. It is equivalent to the ELSE statement in the two previous versions of the codon2aa subroutine. A key might exist in a hash, but its value can be undefined. The defined function checks for defined values. Also, the value might be 0 or the empty string, in which case, it fails a test such as if ($hash{$key}) because, even though the key exists and the value is defined, the value evaluates to false in a conditional test. Also notice that to make this subroutine work on lowercase DNA as well as uppercase, you will need to translate the incoming argument into uppercase to match the data in the %genetic_code hash. In addition, you cannot give a regular expression to a hash as a key; it must be a simple scalar value, such as a string or a number, so the case translation must be done first. Alternatively, you can make the hash twice as big. Similarly, character classes do not work in the keys for hashes, so you have to specify each one of the 64 codons individually. Now that we have gotten a satisfactory way to translate codons to amino acids, we will start to use it in the next section and in the examples. |
|
|
The following example is intended
to show how the new codon2aa subroutine translates a whole DNA sequence into
protein.
Example 1. Translate DNA into protein #!/usr/bin/perl # Translate DNA into protein use strict; use warnings; use BeginPerlBioinfo; # Initialize variables my $dna = 'CGACGTCTTCGTACGGGACTAGCTCGTGTCGGTCGC'; my $protein = ''; my $codon; # Translate each three-base codon into an amino acid, and append to a protein for(my $i=0; $i < (length($dna) - 2) ; $i += 3) { $codon = substr($dna,$i,3); $protein .= codon2aa($codon); } print "I translated the DNA\n\n$dna\n\n into the protein\n\n$protein\n\n"; exit; To make this work, you wil need the
BeginPerlBioinfo.pm module for your subroutines in a separate file that the
program can find. You also have to add the codon2aa subroutine to it. Alternatively,
you can add the code for the subroutine condon2aa directly to the program
in the example and remove the reference to the BeginPerlBioinfo.pm module.
Here is the output from Example 1: I translated the DNA CGACGTCTTCGTACGGGACTAGCTCGTGTCGGTCGC into the protein RRLRTGLARVGR You have seen all the elements in
Example 1 before, except for the way it loops through the DNA with this statement:
for(my $i=0; $i < (length($dna) - 2) ; $i += 3) { Recall that a for loop has three parts, delimited by the two semicolons. The first part initializes a counter: my $i=0 statically scopes the $i variable so it is visible only inside this block, and any other $i elsewhere in the code is now invisible inside the block. That said, although there aren't any in this case, it can happen. The third part of the for loop increments the counter after all the statements in the block are executed and before returning to the beginning of the loop: $i += 3 Since you are trying to march through the DNA three bases at a shot, you increment by three. The second, middle part of the for loop tests whether the loop should continue: $i < (length($dna) - 2) The point is that if there are none, one, or two bases left, you should quit, because there is not enough to make a codon. Now, the positions in a string of DNA of a certain length are numbered from 0 to length-1. So, if the position counter $i has reached length-2, there is only two more bases (at positions length-2 and length-1), and you should quit. Only if the position counter $i is less than length-2 will you still have at least three bases left, enough for a codon. So, the test succeeds only if: $i < (length($dna) -2) The line of code: $codon = substr ($dna, $i 3); Actually extracts the 3-base codon from the DNA. The call to the substr function specifies a substring of $dna at position $i of length 3, and saves it in the variable $codon. |
|
FASTA format is basically just lines of
sequence data with newlines at the end so it can be printed on a page or displayed
on a computer screen.
GenBank is a collection of all publicly released genetic data. It includes lots of information in addition to the DNA sequence.
|
|
|
1. What are three main data tyeps in Perl? 2. What is the central dogma of molecular biology? 3. Which data structure is a natural way to represent the genetic code? Answer: 1. Scalar, array, and hash.
2. Transcription first uses DNA to make RNA, and then translation uses RNA to make proteins. This is called the central dogma of molecular biology. 3. Hash is a convenient data structure to represent the genetic code |
|
|
1. A hash has two componets: key and value. A) True B) False 2. The following initialization of hash is correct: %classification
= (
'dog', 'mammal', 'robin', 'bird', 'asp', 'reptile', ); A) True B) False 3. There is only one codon translated to each amino acid. A) True B) False 4. In the lecture notes, which version of subroutine codn2aa is fastest? A) Version 1 B) Version 2 C) Version 3 Answer: 1. A 2. A 3. B 4. C
|
|
|
Chapter 8. |
|
|
|
|