~ Please use Internet Explorer
to view this lecture and the accompanying Figures ~
Note: The material in this lecture has been developed partially based
on material and/or figures that are linked to from the lecture come
from material supplied in conjunction with the required and
supplementary texts for the course. The copyright for such material is
held by O'Reilly 2001, Beginning Perl for Bioinforamtics by James
Tisdall, ISBN# 0596000804
2.1 What Is a
Module
A
Perl module is a library file that uses package declarations to create
its own namespace. Perl modules provide an extra level of protection
from name collisions. They also serve as the basic mechanism for
defining object-oriented classes.
2.2 Subroutines and Software
Engineering
Subroutines
divide a large programming job into more manageable pieces. A
subroutine lets you write a piece of code that performs some part of a
desired computation (e.g., determining the length of DNA sequence).
This code is written once and then can be called frequently throughout
the main program. Using subroutines speeds the time it takes to write
the main program, makes it more reliable by avoiding duplicated
sections, and makes the entire program easier to test.
An useful subroutine can be used by other programs
as well, saving you development time in the future. As long as the
inputs and outputs to the subroutine remain the same, its internal
workings can be altered and improved without worrying about how the
changes will affect the rest of the program. This is known as
encapsulation. The benefits of subroutines that we have just outlined
also apply to other approaches in software engineering. Perl modules
are a technique within a larger umbrella of techniques known as
software encapsulation and reuse. Software encapsulation and reuse are
fundamental to object-oriented programming.
A related design principle is abstraction, which
involves writing code that is usable in many different situations.
Let's say you write a subroutine that adds the fragment TTTTT to the
end of a string of DNA. If you then want to add the fragment AAAAA to
the end of a string of DNA, you have to write another subroutine. To
avoid writing two subroutines, you can write one that's more abstract
and adds to the end of a string of DNA whatever fragment you give it as
an argument. Using the principle of abstraction, you've saved yourself
half the work.
Here is an
example of a Perl subroutine that takes two strings of DNA as inputs
and returns the second one appended to the end of the first:
sub DNAappend {
my ($dna, $tail) = @_;
return($dna . $tail);
}
This subroutine can be used
as follows:
my $dna =
'ACCGGAGTTGACTCTCCGAATA';
my $polyT = 'TTTTTTTT';
print DNAappend($dna,
$polyT);
2.2.1
Modules and Libraries
Gathering subroutine definitions into
separate files is called libraries, or modules, which let me collect
subroutine definitions for use in other programs. Then, instead of
copying the subroutine definitions into the new program, I can just
insert the name of the library or module into a program, and all the
subroutines are available in their original unaltered form. This is an
example of software reuse in action.
Perl libraries were traditionally put in files
ending with .pl, which stands for perl library; the term library is
also used to refer to a collection of Perl modules. The common
denominator is that a library is a collection of reusable subroutines.
To fully understand and use modules, you need to understand the simple
concepts of namespaces and packages. From here on, think of a Perl
module as any Perl library file that uses package declarations to
create its own namespace.
2.3 Namespaces
Large
programs often accidentally use the same variable name for different
variables in different parts of the program. These identically named
variables may unintentionally interact with each other and cause
serious, hard-to-find errors. This situation is called namespace
collision. Separate namespaces are one way to avoid namespace collision.
A namespace is implemented as a table containing the
names of the variables and subroutines in a program. The table itself
is called a symbol table and is used by the running program to keep
track of variable values and subroutine definitions as the program
evolves. A namespace and a symbol table are essentially the same thing.
The package declaration described in the next
section is one way to assign separate namespaces to different parts of
your code. It gives strong protection against accidentally using a
variable name that's used in another part of the program and having the
two identically-named variables interact in unwanted ways.
2.4 Packages
Packages are
a different way to protect a program's variables from interacting
unintentionally. In Perl, you can easily assign separate namespaces to
entire sections of your code, which helps prevent namespace collisions
and lets you create modules.
Packages are very easy to use. A one-line
package
declaration puts a new namespace in effect. Here's a simple example:
$dna = 'AAAAAAAAAA';
package Mouse;
$dna = 'CCCCCCCCCC';
package Celegans;
$dna = 'GGGGGGGGGG';
In this
snippet, there are three variables, each with the same name, $dna.
However, they are in three different packages, so they appear in three
different symbol tables and are managed separately by the running Perl
program.
The first line of the code is an assignment of a
poly-A DNA fragment to a variable $dna. Because no package is
explicitly named, this $dna variable appears in the default namespace
main.
The second line of code introduces a new namespace
for variable and subroutine definitions by declaring package Mouse;. At
this point, the main namespace is no longer active, and the Mouse
namespace is brought into play. Note that the name of the namespace is
capitalized; it's a well-established convention you should follow. The
only noncapitalized namespace you should use is the default main.
Now that the Mouse namespace is in effect, the third
line of code, which declares a variable, $dna, is actually declaring a
separate variable unrelated to the first. It contains a poly-C fragment
of DNA. Finally, the last two lines of code declare a new package
called Celegans and a new variable, also called $dna, that stores a
poly-G DNA fragment.
To use these three $dna variables, you need to
explicitly state which packages you want the variables from, as the
following code fragment demonstrates:
print "The
DNA from the main
package:\n\n";
print $main::dna, "\n\n";
print "The DNA from the Mouse package:\n\n";
print $Mouse::dna, "\n\n";
print "The DNA from the Celegans package:\n\n";
print $Celegans::dna, "\n\n";
This gives the following output:
The DNA from the main package:
AAAAAAAAAA
The DNA from the Mouse package:
CCCCCCCCCC
The DNA from the Celegans package:
GGGGGGGGGG
As you can
see, the variable name can be specified as to a particular package by
putting the package name and two colons before the variable name (but
after the $, @, or % that specifies the type of variable). If you don't
specify a package in this way, Perl assumes you want the current
package, which may not necessarily be the main package, as the
following example shows:
# Define
the variables in the
packages
$dna = 'AAAAAAAAAA';
package Mouse;
$dna = 'CCCCCCCCCC';
# Print the values of the variables
print "The DNA from the current package:\n\n";
print $dna, "\n\n";
print "The DNA from the Mouse package:\n\n";
print $Mouse::dna, "\n\n";
This produces the following output:
The DNA from the current package:
CCCCCCCCCC
The DNA from the Mouse package:
CCCCCCCCCC
Both print
$dna and print $Mouse::dna reference the same variable. This is because
the last package declaration was package Mouse;, so the print $dna
statement prints the value of the variable $dna as defined in the
current package, which is Mouse. The rule is, once a package has been
declared, it becomes the current package until the next package
declaration or until the end of the file.
2.5 Defining Modules
To begin, take a file of subroutine
definitions and call it something like Celegans.pm. Now, edit the file
and give it a new first line: package Celegans; and a new last line 1;.
You've now created a Perl module. Adding “1” in the last line just
ensures that the library returns a true value when it's read in. It's
annoying, but necessary.
2.6 Storing Modules
Once you
start using multiple files for your program code, which happens if
you're defining and using modules, Perl needs to be able to find these
various files; it provides a few different ways to do so.
The simplest method is to put all your
program files, including your modules, in the same directory and run
your programs from that directory. Here's how the module file
Celegans.pm is loaded from another program:
use
Celegans;
However, it's often not so simple. Perl uses
modules extensively; many are built-in when you install Perl, and many
more are available from CPAN. Some modules are used frequently, some
rarely; many modules call other modules, which in turn call still other
modules. To organize the many modules a Perl program might need, you
should place them in certain standard directories or in your own
development directories. Perl needs to know where these directories are
so that when a module is called in a program, it can search the
directories, find the file that contains the module, and load it in.
When Perl was installed on your computer, a
list of directories in which to find modules was configured. Every time
a Perl program on your computer refers to a module, Perl looks in those
directories. To see those directories, you only need to run a Perl
program and examine the built-in array @INC, like so:
print
join("\n", @INC), "\n";
@INC is simply an array whose entries are
directories on your computer. The way it looks depends on how your
computer is configured and your operating system.
When you develop Perl software that uses modules,
you should put all the modules together in a certain directory. In
order for Perl to find this directory, and load the modules, you need
to add a line before the use MODULE directives, telling Perl to
additionally search your own module directory for any modules requested
in your <>program.
For instance, if I put a module I'm
developing for my program into a file named Celegans.pm, and put the
Celegans.pm file into my directory
/home/tisdall/MasteringPerlBio/development/lib, I need to add a use lib
directive to my program, like so:
use lib
"/home/tisdall/MasteringPerlBio/development/lib";
use Celegans;
Perl then adds my development module
directory to the @INC array and searches there for the Celegans.pm
module file. The following code demonstrates this:
use lib "/home/tisdall/MasteringPerlBio/development/lib";
print join("\n", @INC), "\n";
There's
one other detail about modules that's important. You'll sometimes see
modules in Perl programs with names such as
Genomes::Modelorganisms::Celegans, in which the name is two or more
words separated by two colons. This is how Perl looks into
subdirectories of directories named in the @INC built-in array. In the
example, Perl looks for a subdirectory named Genomes in one of the @INC
directories; then for a subdirectory named Modelorganisms within the
Genomes subdirectory; finally, for a file named Celegans.pm within the
Modelorganisms subdirectory. That is, my module is in the file:
/home/tisdall/MasteringPerlBio/development/lib/Genomes/Modelorganisms/Celegans.pm
and it's called in my Perl
program like so:
use lib
"/home/tisdall/MasteringPerlBio/development/lib";
use
Genomes::Modelorganisms::Celegans;
For all the details, consult the perlmod,
perlrun and the perlmodlib parts of the Perl documentation at
http://www.perldoc.org. You can also type ‘perldoc perlmod’ or ‘perldoc
perlmodlib’ at a shell prompt or in a command window.
2.7 Writing Your First Perl Module
Now that
you've been introduced to the basic ideas of modules,
it's time to actually examine a working example of a module. In this
section, we'll write a module called Geneticcode.pm, which implements
the genetic code that maps DNA codons to amino acids and then
translates a string of DNA sequence data to a protein fragment.
An Example: Geneticcode.pm
Let's start
by creating a file called Geneticcode.pm and using it to
define the mapping of codons to amino acids in a hash variable called
%genetic_code. We'll also discuss a subroutine called codon2aa that
uses the hash to translate its codon arguments into amino acid return
values.
Here are the contents of the first module file
Geneticcode.pm:
package Geneticcode;
use strict;
use warnings;
my(%genetic_code) = (
'TCA'
=> 'S', # Serine
'TCC'
=> 'S', # Serine
'TCG'
=> 'S', # Serine
'TCT'
=> 'S', # Serine
'TTC'
=> 'F', # Phenylalanine
'TTT'
=> 'F', # Phenylalanine
'TTA'
=> 'L', # Leucine
'TTG'
=> 'L', # Leucine
'TAC'
=> 'Y', # Tyrosine
'TAT'
=> 'Y', # Tyrosine
'TAA'
=> '_', # Stop
'TAG'
=> '_', # Stop
'TGC'
=> 'C', # Cysteine
'TGT'
=> 'C', # Cysteine
'TGA'
=> '_', # Stop
'TGG'
=> 'W', # Tryptophan
'CTA'
=> 'L', # Leucine
'CTC'
=> 'L', # Leucine
'CTG'
=> 'L', # Leucine
'CTT'
=> 'L', # Leucine
'CCA'
=> 'P', # Proline
'CCC'
=> 'P', # Proline
'CCG'
=> 'P', # Proline
'CCT'
=> 'P', # Proline
'CAC'
=> 'H', # Histidine
'CAT'
=> 'H', # Histidine
'CAA'
=> 'Q', # Glutamine
'CAG'
=> 'Q', # Glutamine
'CGA'
=> 'R', # Arginine
'CGC'
=> 'R', # Arginine
'CGG'
=> 'R', # Arginine
'CGT'
=> 'R', # Arginine
'ATA'
=> 'I', # Isoleucine
'ATC'
=> 'I', # Isoleucine
'ATT'
=> 'I', # Isoleucine
'ATG'
=> 'M', # Methionine
'ACA'
=> 'T', # Threonine
'ACC'
=> 'T', # Threonine
'ACG'
=> 'T', # Threonine
'ACT'
=> 'T', # Threonine
'AAC'
=> 'N', # Asparagine
'AAT'
=> 'N', # Asparagine
'AAA'
=> 'K', # Lysine
'AAG'
=> 'K', # Lysine
'AGC'
=> 'S', # Serine
'AGT'
=> 'S', # Serine
'AGA'
=> 'R', # Arginine
'AGG'
=> 'R', # Arginine
'GTA'
=> 'V', # Valine
'GTC'
=> 'V', # Valine
'GTG'
=> 'V', # Valine
'GTT'
=> 'V', # Valine
'GCA'
=> 'A', # Alanine
'GCC'
=> 'A', # Alanine
'GCG'
=> 'A', # Alanine
'GCT'
=> 'A', # Alanine
'GAC'
=> 'D', # Aspartic Acid
'GAT'
=> 'D', # Aspartic Acid
'GAA'
=> 'E', # Glutamic Acid
'GAG'
=> 'E', # Glutamic Acid
'GGA'
=> 'G', # Glycine
'GGC'
=> 'G', # Glycine
'GGG'
=> 'G', # Glycine
'GGT'
=> 'G', # Glycine
);
#
# codon2aa
#
# A subroutine to translate
a DNA 3-character codon to an amino acid
# Version 3,
using hash lookup
sub codon2aa {
my($codon) = @_;
$codon = uc $codon;
if(exists
$genetic_code{$codon}) {
return $genetic_code{$codon};
}else{
die "Bad codon '$codon'!!\n";
}
}
1;
Now, let's examine the code. First, the module
declares its package
with a name (Geneticcode) that is the same as the file it is in
(Geneticcode.pm), but minus the file extension .pm.
The directives:
use strict;
use warnings;
will appear
in all the code. The use strict directive enforces the use
of the my directive for all variables. The use warnings directive
produces useful messages about potential problems in your code. (It is
possible to turn both directives off when required—to avoid annoying
warnings in your program output, for instance. See the perldiag,
perllexwarn, and perlmodlib sections of the Perl manual.)
Finally, there is a subroutine definition for
codon2aa. As an argument,
this subroutine takes a codon represented as a string of three DNA
bases and returns the amino acid code corresponding to the codon. It
accomplishes this by a simple lookup in the hash %genetic_code and
returns the result from the subroutine using the return built-in
function. The codon2aa subroutine calls die and exits the program when
it encounters an undefined codon.
The hash %genetic_code within the subroutine
codon2aa. It only has to
be initialized once, when the program is first called, which results in
a significant speedup. The definition of the hash is outside of the
subroutine definition, but in the namespace of the Geneticcode package.
The hash is initialized when the Geneticcode.pm module is loaded by
this statement:
use Geneticcode;
Here's an
example that uses the new Geneticcode module, which is saved
in a file called testGeneticcode and run by typing perl testGeneticcode:
use strict;
use warnings;
use lib
"/home/tisdall/MasteringPerlBio/development/lib";
use Geneticcode;
my $dna =
'AACCTTCCTTCCGGAAGAGAG';
# Initialize variables
my $protein = '';
# Translate each three-base
codon to an amino acid, and append to a
protein
for(my $i=0; $i <
(length($dna) - 2) ; $i += 3) {
$protein .=
Geneticcode::codon2aa( substr($dna,$i,3) );
}
print $protein, "\n";
Recall that
the Perl built-in function substr can extract a portion of
a string. In this case, substr extracts from $dna the three characters
beginning at the position given in the counter variable $i; this
three-character codon is then passed as the argument to the subroutine
codon2aa. This program produces the output:
NLPSGRE
Expanding
Geneticcode.pm
Modules are
a great way to organize code into logical collections of
interacting parts. When you create modules, you need to decide how to
organize your code into the appropriate collection of modules. Here, we
have some subroutines that translate codons into amino acids; others
read sequence data from files and print it to the screen. We'll also
expand the Geneticcode module; let's also create a SequenceIO module.
Of course, the new module will be created in a file called
SequenceIO.pm, and that file will be placed in a directory that Perl
can find—in this case, the same directory in which we've placed the
Geneticcode module.
Here's the code for
Geneticcode.pm:
package Geneticcode;
use strict;
use warnings;
my(%genetic_code) = (
'TCA'
=> 'S', # Serine
'TCC'
=> 'S', # Serine
'TCG'
=> 'S', # Serine
'TCT'
=> 'S', # Serine
'TTC'
=> 'F', # Phenylalanine
...
as before ...
'GAG' =>
'E', # Glutamic Acid
'GGA'
=> 'G', # Glycine
'GGC'
=> 'G', # Glycine
'GGG'
=> 'G', # Glycine
'GGT'
=> 'G', # Glycine
);
#
# codon2aa
#
# A subroutine to translate
a DNA 3-character codon to an amino acid
# Version 3,
using hash lookup
sub codon2aa {
my($codon) = @_;
$codon =
uc $codon;
if(exists
$genetic_code{$codon}) {
return $genetic_code{$codon};
}else{
die
"Bad codon '$codon'!!\n";
}
}
#
# dna2peptide
#
# A subroutine to translate
DNA sequence into a peptide
sub dna2peptide {
my($dna)
= @_;
#
Initialize variables
my
$protein = '';
#
Translate each three-base codon to an amino acid,
and append to a protein
for(my
$i=0; $i < (length($dna) - 2) ; $i += 3) {
$protein .= codon2aa(
substr($dna,$i,3) );
}
return
$protein;
}
# translate_frame
#
# A subroutine to translate
a frame of DNA
sub translate_frame {
my($seq,
$start, $end) = @_;
my
$protein;
# To make
the subroutine easier to use, you won't
need to specify
#
the end point-it will just go to the end of
the sequence
#
by default.
unless($end) {
$end = length($seq);
}
#
Finally, calculate and return the translation
return dna2peptide ( substr
( $seq, $start - 1, $end -$start + 1) );
}
1;
Now, we have
in one module the code that accomplishes a translation
from the genetic code. However, we also need to read sequence in from
FASTA sequence files, and print out sequence (the translated protein)
to the screen. Because these needs are likely to repeat in many
programs, it makes sense to make a separate module for just the file
reading, sequence extraction, and sequence printing operations.
Here's the code for the second module SequenceIO.pm,
which handles
reading from a file, extracting FASTA sequence data, and printing
sequence data:
package SequenceIO;
use strict;
use warnings;
# get_file_data
#
# A subroutine to get data
from a file given its filename
sub get_file_data {
my($filename) = @_;
#
Initialize variables
my
@filedata = ( );
open(GET_FILE_DATA, $filename) or die "Cannot open
file '$filename':$!\n\n";
@filedata
= <GET_FILE_DATA>;
close
GET_FILE_DATA;
return
@filedata;
}
#
extract_sequence_from_fasta_data
#
# A subroutine to extract
FASTA sequence data from an array
sub
extract_sequence_from_fasta_data {
my(@fasta_file_data) = @_;
# Declare
and initialize variables
my
$sequence = '';
foreach
my $line (@fasta_file_data) {
# discard blank line
if ($line =~ /^\s*$/) {
next;
# discard comment line
} elsif($line =~ /^\s*#/) {
next;
# discard fasta header line
} elsif($line =~ /^>/) {
next;
# keep line, add to sequence
string
} else {
$sequence .= $line;
}
}
# remove
non-sequence data (in this case,
whitespace) from $sequence string
$sequence
=~ s/\s//g;
return
$sequence;
}
# print_sequence
#
# A subroutine to format and
print sequence data
sub print_sequence {
my($sequence, $length) = @_;
# Print
sequence in lines of $length
for ( my
$pos = 0 ; $pos < length($sequence) ;
$pos += $length ) {
print substr($sequence,
$pos, $length), "\n";
}
}
1;
Before we
discuss the code, let's see a small program that uses it:
# Translate a DNA sequence into one of the six reading frames
use strict;
use warnings;
use lib "/home/tisdall/MasteringPerlBio/development/lib";
use Geneticcode;
use SequenceIO;
# Initialize variables
my @file_data = ( );
my $dna = '';
my $revcom = '';
my $protein = '';
# Read in the contents of the file "sample.dna"
@file_data = SequenceIO::get_file_data("sample.dna");
# Extract the sequence data from the contents of the file "sample.dna"
$dna = SequenceIO::extract_sequence_from_fasta_data(@file_data);
# Translate the DNA to protein in one
of the six reading frames
# and print the
protein in lines 70 characters long
print "\n -------Reading
Frame 1--------\n\n";
$protein =
Geneticcode::translate_frame($dna, 1);
SequenceIO::print_sequence($protein,
70);
exit;
Here's the input file:
> sample dna
(This is a typical fasta header.)
agatggcggcgctgaggggtcttgggggctctaggccggccacctactgg
tttgcagcggagacgacgcatggggcctgcgcaataggagtacgctgcct
gggaggcgtgactagaagcggaagtagttgtgggcgcctttgcaaccgcc
tgggacgccgccgagtggtctgtgcaggttcgcgggtcgctggcgggggt
cgtgagggagtgcgccgggagcggagatatggagggagatggttcagacc
cagagcctccagatgccggggaggacagcaagtccgagaatggggagaat
gcgcccatctactgcatctgccgcaaaccggacatcaactgcttcatgat
cgggtgtgacaactgcaatgagtggttccatggggactgcatccggatca
ctgagaagatggccaaggccatccgggagtggtactgtcgggagtgcaga
gagaaagaccccaagctagagattcgctatcggcacaagaagtcacggga
gcgggatggcaatgagcgggacagcagtgagccccgggatgagggtggag
ggcgcaagaggcctgtccctgatccagacctgcagcgccgggcagggtca
gggacaggggttggggccatgcttgctcggggctctgcttcgccccacaa
atcctctccgcagcccttggtggccacacccagccagcatcaccagcagc
agcagcagcagatcaaacggtcagcccgcatgtgtggtgagtgtgaggca
tgtcggcgcactgaggactgtggtcactgtgatttctgtcgggacatgaa
gaagttcgggggccccaacaagatccggcagaagtgccggctgcgccagt
gccagctgcgggcccgggaatcgtacaagtacttcccttcctcgctctca
ccagtgacgccctcagagtccctgccaaggccccgccggccactgcccac
ccaacagcagccacagccatcacagaagttagggcgcatccgtgaagatg
agggggcagtggcgtcatcaacagtcaaggagcctcctgaggctacagcc
acacctgagccactctcagatgaggaccta
Here's the output of the
program:
-------Reading Frame 1--------
RWRR_GVLGALGRPPTGLQRRRRMGPAQ_EYAAWEA_LEAEVVVGAFATAW
DAAEWSVQVRGSLAGVVRECAGSGDMEGDGSDPEPPDAGEDSKSENGENAP
IYCICRKPDINCFMIGCDNCNEWFHGDCIRITEKMAKAIREWYCRECREKDPK
LEIRYRHKKSRERDGNERDSSEPRDEGGGRKRPVPDPDLQRRAGSGTGVGAML
ARGSASPHKSSPQPLVATPSQHHQQQQQQIKRSARMCGECEACRRTEDCGHC
DFCRDMKKFGGPNKIRQKCRLRQCQLRARESYKYFPSSLSPVTPSESLPRPRRP
LPTQQQPQPSQKLGRIREDEGAVASSTVKEPPEATATPEPLSDEDL
A few
comments are in order. First, the subroutines for translating
codons are in the Geneticcode module. They include the hash
%genetic_code and the subroutines codon2aa, dna2peptide, and
TRanslate_frame, which are involved with translating DNA data to
peptides. The subroutines for reading sequence data in from files, and
for formatting and printing it to the screen, are in the SequenceIO
module. They are the subroutines get_file_data,
exTRact_sequence_from_fasta_data, and print_sequence.
2.8 Using Modules
So far, the
benefit of modules may seem questionable. You may be
wondering what the advantage is over simple libraries (without package
declarations), since the main result seems to be the necessity to refer
to subroutines in the modules with longer names!
Exporting
Names
There's a way to avoid lengthy module
names and still use the short
ones if you place a call to the special Exporter module in the module
code and modify the use MODULE declaration in the calling code. Going
back to the first example Geneticcode.pm module, recall it began with
this line:
package Geneticcode;
and included the definition for the hash genetic_code and the
subroutine codon2aa.
If you add these lines to the beginning of the file, you can export the
symbol names of variables or subroutines in the module into the
namespace of the calling program. You can then use the convenient short
names for things (e.g., codon2aa instead of Geneticcode::codon2aa).
Here's a short example of how it works (try typing perldoc Exporter to
see the whole story):
package
Geneticcode;
For all the details, consult the perlmod,
perlrun and the perlmodlib parts of the Perl documentation at
http://www.perldoc.org. You can also type ‘perldoc perlmod’ or ‘perldoc
perlmodlib’ at a shell prompt or in a command window.
package Geneticcode;
and included the definition for the hash genetic_code and the
subroutine codon2aa.
If you add these lines to the beginning of the file,
you can export the
symbol names of variables or subroutines in the module into the
namespace of the calling program. You can then use the convenient short
names for things (e.g., codon2aa instead of Geneticcode::codon2aa).
Here's a short example of how it works (try typing perldoc Exporter to
see the whole story):
package Geneticcode;
require Exporter;
@ISA = qw(Exporter);
@EXPORT_OK = qw(...); #
symbols to export on request
Here's how to export the name codon2aa from the module only when
explicitly requested:
@EXPORT_OK = qw(codon2aa); # symbols to export on
request
The calling program then has to explicitly request the codon2aa symbol
like so:
use Geneticcode qw(codon2aa);
If you use this approach, the calling program can just say:
codon2aa($codon);
instead of:
Geneticcode::codon2aa($codon);
The Exporter module that's included in the standard Perl distribution
has several other optional behaviors, but the way just shown is the
safest and most useful. As you'll see, the object-oriented programming
style of using modules doesn't use the Export facility, but it is a
useful thing to have in your bag of tricks. For more information about
exporting (such as why exporting is also known as "polluting your
namespace"), see the Perl documentation for the Exporter module (by
typing perldoc Exporter at a command line or by going to the
http://www.perldoc.com web page).
2.9 CPAN Modules
The
Comprehensive Perl Archive Network (CPAN, http://www.cpan.org) is
an impressively large collection of Perl code (mostly Perl modules).
CPAN is easily accessible and searchable on the Web, and you can use
its modules for a variety of programming tasks.
There are two important points about CPAN. First, a
large number of the
things you might want your programs to do have already been programmed
and are easily obtained in downloadable modules. You just have to go
find them at CPAN, install them on your computer, and call them from
your program. Second, all code on CPAN is free of charge and available
for use by a very unrestrictive copyright declaration.
CPAN includes convenient ways to search for useful
modules, and there's
a CPAN.pm module built-in with Perl that makes downloading and
installing modules quite easy (when things work well, which they
usually do).
You can find more information by typing the following at
the command line:
perldoc CPAN
Searching
CPAN
CPAN's main web page has a few ways to search the contents. Let's say
you need to perform some statistics and are looking for code that's
already available. At the main CPAN page, look for "searching" and
click on search.cpan.org. If you search for "statistics" in all
locations, you'll get over 300 hits, so you should restrict your search
to modules with the pull-down menu. You'll get 25 hits (more by the
time you read this). Afterward, download and install it, and use the
module in a program.
Here's the subroutine definition
part of the module for your references:
package
Statistics::ChiSquare;
# ChiSquare.pm
#
# Jon Orwant,
orwant@media.mit.edu
#
# 31 Oct 95, revised Mon
Oct 18 12:16:47 1999, and again November 2001
# to fix an off-by-one
error
#
# Copyright 1995, 1999,
2001 Jon Orwant. All rights reserved.
# This program is free
software; you can redistribute it and/or
# modify it under the same
terms as Perl itself.
#
# Version 0.3.
Module list status is "Rdpf"
use
strict;
use vars qw($VERSION @ISA @EXPORT);
require Exporter;
require AutoLoader;
@ISA = qw(Exporter AutoLoader);
# Items to export into callers namespace by default. Note: do not export
# names by default without a very good reason. Use EXPORT_OK instead.
# Do not simply export all your public functions/methods/constants.
@EXPORT = qw(chisquare);
$VERSION = '0.3';
my @chilevels = (100, 99, 95, 90, 70, 50, 30, 10, 5, 1);
my %chitable = ( );
# assume
the expected probability distribution is uniform
sub chisquare {
my @data = @_;
@data = @{$data[0]} if @data = = 1 and
ref($data[0]);
my $degrees_of_freedom = scalar(@data) - 1;
my ($chisquare, $num_samples, $expected, $i) = (0,
0, 0, 0);
if (! exists($chitable{$degrees_of_freedom})) {
return "I can't handle ",
scalar(@data),
" choices without a better
table.";
}
foreach (@data) { $num_samples += $_ }
$expected = $num_samples / scalar(@data);
return "There's no data!" unless $expected;
foreach (@data) {
$chisquare += (($_ -
$expected) ** 2) / $expected;
}
foreach (@{$chitable{$degrees_of_freedom}}) {
if ($chisquare < $_) {
return
"There's a <$chilevels[$i+1]% and <$chilevels[$i]% chance that
this data
is random.";
}
$i++;
}
return "There's a <$chilevels[$#chilevels]%
chance that this data is random.";
}
$chitable{1} = [0.00016, 0.0039,
0.016, 0.15, 0.46, 1.07, 2.71, 3.84,
6.64];
$chitable{2} =
[0.020, 0.10, 0.21, 0.71,
1.39, 2.41, 4.60, 5.99, 9.21];
$chitable{3} =
[0.12, 0.35, 0.58,
1.42, 2.37, 3.67, 6.25, 7.82, 11.34];
$chitable{4} =
[0.30, 0.71, 1.06,
2.20, 3.36, 4.88, 7.78, 9.49, 13.28];
$chitable{5} =
[0.55, 1.14, 1.61,
3.00, 4.35, 6.06, 9.24, 11.07, 15.09];
$chitable{6} =
[0.87, 1.64, 2.20,
3.83, 5.35, 7.23, 10.65, 12.59, 16.81];
$chitable{7} =
[1.24, 2.17, 2.83,
4.67, 6.35, 8.38, 12.02, 14.07, 18.48];
$chitable{8} =
[1.65, 2.73, 3.49,
5.53, 7.34, 9.52, 13.36, 15.51, 20.09];
$chitable{9} =
[2.09, 3.33, 4.17, 6.39,
8.34, 10.66, 14.68, 16.92, 21.67];
$chitable{10} =
[2.56, 3.94, 4.86, 7.27, 9.34,
11.78, 15.99, 18.31, 23.21];
$chitable{11} =
[3.05, 4.58, 5.58, 8.15, 10.34,
12.90, 17.28, 19.68, 24.73];
$chitable{12} =
[3.57, 5.23, 6.30, 9.03, 11.34, 14.01,
18.55, 21.03, 26.22];
$chitable{13} =
[4.11, 5.89, 7.04, 9.93, 12.34, 15.12,
19.81, 22.36, 27.69];
$chitable{14} =
[4.66, 6.57, 7.79, 10.82, 13.34, 16.22,
21.06, 23.69, 29.14];
$chitable{15} =
[5.23, 7.26, 8.55, 11.72, 14.34, 17.32,
22.31, 25.00, 30.58];
$chitable{16} =
[5.81, 7.96, 9.31, 12.62, 15.34, 18.42,
23.54, 26.30, 32.00];
$chitable{17} =
[6.41, 8.67, 10.09, 13.53, 16.34, 19.51, 24.77,
27.59, 33.41];
$chitable{18} =
[7.00, 9.39, 10.87, 14.44, 17.34, 20.60, 25.99,
28.87, 34.81];
$chitable{19} = [7.63,
10.12, 11.65, 15.35, 18.34, 21.69, 27.20, 30.14,
36.19];
|