CIS524 Bioinformatics I

course homepage      email instructor           email helpdesk      syllabus      course outline           texts/software
Lecture Six
The Genetic Code
(I)


About this lecture

    In the previous lectures, we have used Perl to search for motifs, simulate DNA mutations, generate random sequences, and transcribe DNA to RNA. In this lecture, we will write Perl programs to simulate how the genetic code directs the translation of DNA into protein. We will start with the hash data type. In the next lecture, we will write a program to translate DNA to protein. In addition, we will also continue exploring regular expressions and writing code to handle FASTA files. Finally, we will examine all six open reading frames (ORFs) of a DNA sequence during the translation of DNA into protein.


James Tisdall
Beginning Perl for Bioinformatics
O'Reilly
ISBN: 0137436424

Learning Objectives          

  •  Learn basics of hashes.
  •  Learn basics of the genetic code.
  •  Use hashes for the genetic code.
  •  Read a DNA FASTA file
  •  Translating DNA into proteins.
  •  Translating Reading Frames

~ Please use Internet Explorer
to view this lecture and the accompanying Figures ~
Note: The material in this lecture has been developed partially based on material and/or figures that are linked to from the lecture come from material supplied in conjunction with the required and supplementary texts for the course. The copyright for such material is held by O'Reilly 2001, Beginning Perl for Bioinforamtics by James Tisdall, ISBN# 0596000804

Introduction Menu

Learning Objectives

Hashes

The Genetic Code

Translating DNA into Porteins

Summary

Review

Practice Test

Required Readings

Assignment

Hashes

There are three data types in Perl. You have already seen the first two data types: scalar variables and arrays. Now, we will start to use the other one: hashes. A hash in Perl is a collection of zero or more pairs of scalar values, called keys and values. It provides very fast lookup of the value associated with a key. For example, you may have a hash called %genes with variable names that begin with a percent sign (%):

%genes = ( 'gene1', 'ATTCGT', 'gene2', 'CTGCCATGA');

The values are indexed by the keys, so that
- Given a key, the hash returns the corresponding value
   $seq = $genes{'gene2'};
- $seq now contains 'CTGCCATGA'
- Note that $genes{'gene2'} is a scalar, so it starts with $

Hashes can be assigned values use key=>value notation:
%genes = ( 'gene1', 'ATTCGT', 'gene2', 'CTGCCATGA');
%genes = ( gene1=>'ATTCGT', gene2=>'CTGCCATGA');

The keys function returns a list of all keys in a hash:

@key_list = keys(%genes);

foreach $key (@key_list) {
             print "The value of $key is $genes{$key}\n";
}

It renders the output as follows:

The value of gene1 is ATTCGT
The value of gene2 is CTGCCATGA

As you may notice from this example, hashes (like arrays) change their leading character to a dollar sign when you access a single element, because the value returned from a hash lookup is a scalar value. You can tell a hash lookup from an array element by the type of braces they use: arrays use square brackets [ ]; hashes use curly braces { }.

Also, if you want to initialize a hash with some key-value pairs, it gets done much like initializing arrays, but every pair becomes a key-value:

%genes = ( 'gene1', 'ATTCGT', 'gene2', 'CTGCCATGA');

which initializes the key 'gene1' with the value 'ATTCGT ', and so on. The following does exactly the same thing as the preceding code, while showing the key-value relationship more clearly:

%genes = ( gene1=>'ATTCGT', gene2=>'CTGCCATGA');

Because hashes are built into Perl as a basic data type, they are easy to use, and you will not have to do much programming to accomplish your goal. However, for instance, they do not store their elements in a sorted order, so if you need to look at the data in a sorted order, you have to explicitly sort it:

@sorted_keys = sort keys %my_hash;

Later in this lecture, we will develop programs that use hashes to retrieve information about a piece of gene. The gene name is the key; the information about the gene is the value of that key.


Gene Expression Data Using Hashes

You can use hashes to find a gene in your data. To do so, you can load the hash so that the keys are the gene names and the values are the expression measurement. Thus, a single call on the hash, with the name of the desired gene as a key, returns the results of the experiment for that gene, and you will have gotten your answer.
 

The Genetic Code

The genetic code is about how a cell translates the information contained in its DNA into amino acids and then proteins, which do the real work in the cell.


Background

DNA encodes the primary structure (i.e., the amino acid sequence) of proteins. DNA has four nucleotides, and proteins have 20 amino acids. The encoding works by taking each group of three nucleotides from the DNA and "translating" them to an amino acid or a stop signal. Each group of three nucleotides is called a codon.

Actually, transcription first uses DNA to make RNA, and then translation uses RNA to make proteins. This is called the central dogma of molecular biology. But in this course, we will abbreviate the process and somewhat inaccurately call the entire process from DNA to protein "translation."  The reason for this cavalier distinction is that the whole business is much easier to simulate on computer using strings to represent the DNA, RNA, and proteins.

Note that with four kinds of bases, each group of three bases of DNA can represent as many as 4 x 4 x 4 = 64 possible amino acids. Since there are only 20 amino acids plus a stop signal, the genetic code has evolved some redundancy, so that some amino acids are represented by more than one codon.

The chart in Figure 6.1 shows how the various bases combine to form amino acids. There are many interesting things to note about the genetic code. For our purposes, the most important is redundancy-the way more than one codon translates to the same amino acid. We will program this using character classes and regular expressions.


Figure 6.1  The Genetic Code



Translating Codons to Amino Acids

We will look at three different versions of translating DNA using the genetic code:
  1. Look up the codon using if-then-else.
  2. Same as above, but use patterns to reflect redundancy of genetic code.
  3. Use a hash to look up each codon.
The first task is to enable the following programs to do the translation from the three-nucleotide codons to the amino acids. This is the most important step in implementing the genetic code, which is the encoding of amino acids by three-nucleotide codons. Here is a subroutine that returns an amino acid (represented by a one-letter abbreviation) given a three-letter DNA codon:

# codon2aa
#
# A subroutine to translate a DNA 3-character codon to an amino acid

sub codon2aa {
my($codon) = @_;

if ( $codon =~ /TCA/i ) { return 'S' } # Serine
elsif ( $codon =~ /TCC/i ) { return 'S' } # Serine
elsif ( $codon =~ /TCG/i ) { return 'S' } # Serine
elsif ( $codon =~ /TCT/i ) { return 'S' } # Serine
elsif ( $codon =~ /TTC/i ) { return 'F' } # Phenylalanine
elsif ( $codon =~ /TTT/i ) { return 'F' } # Phenylalanine
elsif ( $codon =~ /TTA/i ) { return 'L' } # Leucine
elsif ( $codon =~ /TTG/i ) { return 'L' } # Leucine
elsif ( $codon =~ /TAC/i ) { return 'Y' } # Tyrosine
elsif ( $codon =~ /TAT/i ) { return 'Y' } # Tyrosine
elsif ( $codon =~ /TAA/i ) { return '_' } # Stop
elsif ( $codon =~ /TAG/i ) { return '_' } # Stop
elsif ( $codon =~ /TGC/i ) { return 'C' } # Cysteine
elsif ( $codon =~ /TGT/i ) { return 'C' } # Cysteine
elsif ( $codon =~ /TGA/i ) { return '_' } # Stop
elsif ( $codon =~ /TGG/i ) { return 'W' } # Tryptophan
elsif ( $codon =~ /CTA/i ) { return 'L' } # Leucine
elsif ( $codon =~ /CTC/i ) { return 'L' } # Leucine
elsif ( $codon =~ /CTG/i ) { return 'L' } # Leucine
elsif ( $codon =~ /CTT/i ) { return 'L' } # Leucine
elsif ( $codon =~ /CCA/i ) { return 'P' } # Proline
elsif ( $codon =~ /CCC/i ) { return 'P' } # Proline
elsif ( $codon =~ /CCG/i ) { return 'P' } # Proline
elsif ( $codon =~ /CCT/i ) { return 'P' } # Proline
elsif ( $codon =~ /CAC/i ) { return 'H' } # Histidine
elsif ( $codon =~ /CAT/i ) { return 'H' } # Histidine
elsif ( $codon =~ /CAA/i ) { return 'Q' } # Glutamine
elsif ( $codon =~ /CAG/i ) { return 'Q' } # Glutamine
elsif ( $codon =~ /CGA/i ) { return 'R' } # Arginine
elsif ( $codon =~ /CGC/i ) { return 'R' } # Arginine
elsif ( $codon =~ /CGG/i ) { return 'R' } # Arginine
elsif ( $codon =~ /CGT/i ) { return 'R' } # Arginine
elsif ( $codon =~ /ATA/i ) { return 'I' } # Isoleucine
elsif ( $codon =~ /ATC/i ) { return 'I' } # Isoleucine
elsif ( $codon =~ /ATT/i ) { return 'I' } # Isoleucine
elsif ( $codon =~ /ATG/i ) { return 'M' } # Methionine
elsif ( $codon =~ /ACA/i ) { return 'T' } # Threonine
elsif ( $codon =~ /ACC/i ) { return 'T' } # Threonine
elsif ( $codon =~ /ACG/i ) { return 'T' } # Threonine
elsif ( $codon =~ /ACT/i ) { return 'T' } # Threonine
elsif ( $codon =~ /AAC/i ) { return 'N' } # Asparagine
elsif ( $codon =~ /AAT/i ) { return 'N' } # Asparagine
elsif ( $codon =~ /AAA/i ) { return 'K' } # Lysine
elsif ( $codon =~ /AAG/i ) { return 'K' } # Lysine
elsif ( $codon =~ /AGC/i ) { return 'S' } # Serine
elsif ( $codon =~ /AGT/i ) { return 'S' } # Serine
elsif ( $codon =~ /AGA/i ) { return 'R' } # Arginine
elsif ( $codon =~ /AGG/i ) { return 'R' } # Arginine
elsif ( $codon =~ /GTA/i ) { return 'V' } # Valine
elsif ( $codon =~ /GTC/i ) { return 'V' } # Valine
elsif ( $codon =~ /GTG/i ) { return 'V' } # Valine
elsif ( $codon =~ /GTT/i ) { return 'V' } # Valine
elsif ( $codon =~ /GCA/i ) { return 'A' } # Alanine
elsif ( $codon =~ /GCC/i ) { return 'A' } # Alanine
elsif ( $codon =~ /GCG/i ) { return 'A' } # Alanine
elsif ( $codon =~ /GCT/i ) { return 'A' } # Alanine
elsif ( $codon =~ /GAC/i ) { return 'D' } # Aspartic Acid
elsif ( $codon =~ /GAT/i ) { return 'D' } # Aspartic Acid
elsif ( $codon =~ /GAA/i ) { return 'E' } # Glutamic Acid
elsif ( $codon =~ /GAG/i ) { return 'E' } # Glutamic Acid
elsif ( $codon =~ /GGA/i ) { return 'G' } # Glycine
elsif ( $codon =~ /GGC/i ) { return 'G' } # Glycine
elsif ( $codon =~ /GGG/i ) { return 'G' } # Glycine
elsif ( $codon =~ /GGT/i ) { return 'G' } # Glycine
else {
print STDERR "Bad codon \"$codon\"!!\n";
exit;
}
}

This piece of code is clear and simple, and the layout makes it obvious what is happening.

Let us recall that filehandles from previous lectures and how they access data in files. Additionally, the special filehandles STDIN that reads user input from the keyboard. STDOUT and STDERR are also special filehandles that are always available to Perl programs. STDOUT directs output to the screen (usually) or another standard display device. When a filehandles is missing from a print statement, STDOUT is assumed. The print statement accepts a filehandle as an optional argument. (Cf. Appendix B of the textbook).

Here, error messages are directed to STDERR, which usually prints to the screen, but on many computer systems they can be re-directed to a special error file or other location. (Cf. Appendix B of the textbook).


The Redundancy of the Genetic Code

We should take a note on the redundancy of the genetic code. The next subroutine clearly displays this redundancy. Notice that groups of redundant codons almost always have the same first and second bases and vary in the third.

# codon2aa
#
# A subroutine to translate a DNA 3-character codon to an amino acid
# Version 2

sub codon2aa {
my($codon) = @_;

if ( $codon =~ /GC./i) { return 'A' } # Alanine
elsif ( $codon =~ /TG[TC]/i) { return 'C' } # Cysteine
elsif ( $codon =~ /GA[TC]/i) { return 'D' } # Aspartic Acid
elsif ( $codon =~ /GA[AG]/i) { return 'E' } # Glutamic Acid
elsif ( $codon =~ /TT[TC]/i) { return 'F' } # Phenylalanine
elsif ( $codon =~ /GG./i) { return 'G' } # Glycine
elsif ( $codon =~ /CA[TC]/i) { return 'H' } # Histidine
elsif ( $codon =~ /AT[TCA]/i) { return 'I' } # Isoleucine
elsif ( $codon =~ /AA[AG]/i) { return 'K' } # Lysine
elsif ( $codon =~ /TT[AG]|CT./i) { return 'L' } # Leucine
elsif ( $codon =~ /ATG/i) { return 'M' } # Methionine
elsif ( $codon =~ /AA[TC]/i) { return 'N' } # Asparagine
elsif ( $codon =~ /CC./i) { return 'P' } # Proline
elsif ( $codon =~ /CA[AG]/i) { return 'Q' } # Glutamine
elsif ( $codon =~ /CG.|AG[AG]/i) { return 'R' } # Arginine
elsif ( $codon =~ /TC.|AG[TC]/i) { return 'S' } # Serine
elsif ( $codon =~ /AC./i) { return 'T' } # Threonine
elsif ( $codon =~ /GT./i) { return 'V' } # Valine
elsif ( $codon =~ /TGG/i) { return 'W' } # Tryptophan
elsif ( $codon =~ /TA[TC]/i) { return 'Y' } # Tyrosine
elsif ( $codon =~ /TA[AG]|TGA/i) { return '_' } # Stop
else {
print STDERR "Bad codon \"$codon\"!!\n";
exit;
}
}

Using character classes and regular expressions, this code clearly shows the redundancy of the genetic code.

A character class such as [TC] matches a single character, either T or C. The period "." is the regular expression that matches any character except a newline. The /GT./i expression for valine matches GTA, GTC, GTG, and GTT, all of which are codons for valine. (Certiainly, the period matches any other character, but the $codon is assumed to have only A,C,G, or T characters.) The i after the regular expression means match uppercase or lowercase, for instance /T/i matches T or t. The new feature in these regular expressions is the use of the vertical bar or pipe ( | ) to separate two choices. Thus for serine, / TC.|AG[TC] / matches / TC./ or / AG[TC] /.


Using Hashes for the Genetic Code

Now, let us keep on using a hash for this translation, you will see it is a natural way to proceed. For each codon key the amino acid value is returned. Here is the code:

#
# codon2aa
#
# A subroutine to translate a DNA 3-character codon to an amino acid
# Version 3, using hash lookup

sub codon2aa {
my($codon) = @_;

$codon = uc $codon;

my(%genetic_code) = (

'TCA' => 'S', # Serine
'TCC' => 'S', # Serine
'TCG' => 'S', # Serine
'TCT' => 'S', # Serine
'TTC' => 'F', # Phenylalanine
'TTT' => 'F', # Phenylalanine
'TTA' => 'L', # Leucine
'TTG' => 'L', # Leucine
'TAC' => 'Y', # Tyrosine
'TAT' => 'Y', # Tyrosine
'TAA' => '_', # Stop
'TAG' => '_', # Stop
'TGC' => 'C', # Cysteine
'TGT' => 'C', # Cysteine
'TGA' => '_', # Stop
'TGG' => 'W', # Tryptophan
'CTA' => 'L', # Leucine
'CTC' => 'L', # Leucine
'CTG' => 'L', # Leucine
'CTT' => 'L', # Leucine
'CCA' => 'P', # Proline
'CCC' => 'P', # Proline
'CCG' => 'P', # Proline
'CCT' => 'P', # Proline
'CAC' => 'H', # Histidine
'CAT' => 'H', # Histidine
'CAA' => 'Q', # Glutamine
'CAG' => 'Q', # Glutamine
'CGA' => 'R', # Arginine
'CGC' => 'R', # Arginine
'CGG' => 'R', # Arginine
'CGT' => 'R', # Arginine
'ATA' => 'I', # Isoleucine
'ATC' => 'I', # Isoleucine
'ATT' => 'I', # Isoleucine
'ATG' => 'M', # Methionine
'ACA' => 'T', # Threonine
'ACC' => 'T', # Threonine
'ACG' => 'T', # Threonine
'ACT' => 'T', # Threonine
'AAC' => 'N', # Asparagine
'AAT' => 'N', # Asparagine
'AAA' => 'K', # Lysine
'AAG' => 'K', # Lysine
'AGC' => 'S', # Serine
'AGT' => 'S', # Serine
'AGA' => 'R', # Arginine
'AGG' => 'R', # Arginine
'GTA' => 'V', # Valine
'GTC' => 'V', # Valine
'GTG' => 'V', # Valine
'GTT' => 'V', # Valine
'GCA' => 'A', # Alanine
'GCC' => 'A', # Alanine
'GCG' => 'A', # Alanine
'GCT' => 'A', # Alanine
'GAC' => 'D', # Aspartic Acid
'GAT' => 'D', # Aspartic Acid
'GAA' => 'E', # Glutamic Acid
'GAG' => 'E', # Glutamic Acid
'GGA' => 'G', # Glycine
'GGC' => 'G', # Glycine
'GGG' => 'G', # Glycine
'GGT' => 'G', # Glycine
);

if(exists $genetic_code{$codon}) {
return $genetic_code{$codon};
}else{

print STDERR "Bad codon \"$codon\"!!\n";
exit;
}
}

This subroutine is simple: it initializes a hash and then performs a single lookup of its single argument in the hash. The hash has 64 keys, one for each codon.

Notice there is a function that returns true if the key $codon exists in the hash. It is equivalent to the ELSE statement in the two previous versions of the codon2aa subroutine.

A key might exist in a hash, but its value can be undefined. The defined function checks for defined values. Also, the value might be 0 or the empty string, in which case, it fails a test such as if ($hash{$key}) because, even though the key exists and the value is defined, the value evaluates to false in a conditional test.

Also notice that to make this subroutine work on lowercase DNA as well as uppercase, you will need to translate the incoming argument into uppercase to match the data in the %genetic_code hash. In addition, you cannot give a regular expression to a hash as a key; it must be a simple scalar value, such as a string or a number, so the case translation must be done first. Alternatively, you can make the hash twice as big. Similarly, character classes do not work in the keys for hashes, so you have to specify each one of the 64 codons individually. Now that we have gotten a satisfactory way to translate codons to amino acids, we will start to use it in the next section and in the examples.


Translating DNA into Proteins

The following example is intended to show how the new codon2aa subroutine translates a whole DNA sequence into protein.

Example 1. Translate DNA into protein

#!/usr/bin/perl
# Translate DNA into protein

use strict;
use warnings;
use BeginPerlBioinfo;

# Initialize variables
my $dna = 'CGACGTCTTCGTACGGGACTAGCTCGTGTCGGTCGC';
my $protein = '';
my $codon;

# Translate each three-base codon into an amino acid, and append to a protein
for(my $i=0; $i < (length($dna) - 2) ; $i += 3) {
$codon = substr($dna,$i,3);
$protein .= codon2aa($codon);
}
print "I translated the DNA\n\n$dna\n\n into the protein\n\n$protein\n\n";

exit;

To make this work, you wil need the BeginPerlBioinfo.pm module for your subroutines in a separate file that the program can find. You also have to add the codon2aa subroutine to it. Alternatively, you can add the code for the subroutine condon2aa directly to the program in the example and remove the reference to the BeginPerlBioinfo.pm module.

Here is the output from Example 1:

I translated the DNA

CGACGTCTTCGTACGGGACTAGCTCGTGTCGGTCGC

into the protein

RRLRTGLARVGR

You have seen all the elements in Example 1 before, except for the way it loops through the DNA with this statement:
for(my $i=0; $i < (length($dna) - 2) ; $i += 3) {

Recall that a for loop has three parts, delimited by the two semicolons. The first part initializes a counter: my $i=0 statically scopes the $i variable so it is visible only inside this block, and any other $i elsewhere in the code is now invisible inside the block. That said, although there aren't any in this case, it can happen. The third part of the for loop increments the counter after all the statements in the block are executed and before returning to the beginning of the loop:
$i += 3
Since you are trying to march through the DNA three bases at a shot, you increment by three.

The second, middle part of the for loop tests whether the loop should continue:
$i < (length($dna) - 2)

The point is that if there are none, one, or two bases left, you should quit, because there is not enough to make a codon. Now, the positions in a string of DNA of a certain length are numbered from 0 to length-1. So, if the position counter $i has reached length-2, there is only two more bases (at positions length-2 and length-1), and you should quit. Only if the position counter $i is less than length-2 will you still have at least three bases left, enough for a codon. So, the test succeeds only if:
$i < (length($dna) -2)

The line of code:
$codon = substr ($dna, $i 3);

Actually extracts the 3-base codon from the DNA. The call to the substr function specifies a substring of $dna at position $i of length 3, and saves it in the variable $codon.

 

Summary

  • A Hash is a collection of zero or more pairs of scalar values, called keys and values.
  • The genetic code is about how a cell translates the information contained in its DNA into amino acids and then proteins.
  • Hash is a natural data structure to define the genetic code
  • FASTA and GenBank are by far the most widely used format.:
FASTA format is basically just lines of sequence data with newlines at the end so it can be printed on a page or displayed on a computer screen.

GenBank is a collection of all publicly released genetic data. It includes lots of information in addition to the DNA sequence.
  • Long stretches of DNA that don't contain any stop codons are called open reading frames (ORFs). It is therefore quite common to examine all six reading frames of a DNA sequence.

 

Review Questions

1. What are three main data tyeps in Perl?

2. What is the central dogma of molecular biology?

3. Which data structure is a natural way to represent the genetic code?


Answer:
1. Scalar, array, and hash.
2. Transcription first uses DNA to make RNA, and then translation uses RNA to make proteins. This is called the central dogma of molecular biology.
3. Hash is a convenient data structure to represent the genetic code

 

Practice Test

1. A hash has two componets: key and value.
A) True
B) False

2. The following initialization of hash is correct:

%classification = (
'dog', 'mammal', 'robin', 'bird', 'asp', 'reptile',
);

A) True
B) False


3. There is only one codon translated to each amino acid.
A) True
B) False


4. In the lecture notes, which version of subroutine codn2aa is fastest?
A) Version 1
B) Version 2
C) Version 3



Answer:
1. A   2. A   3. B   4. C

 

Required Readings

Chapter 8.

 

Assignment