CIS526 Bioinformatics III


<< back to courseware demo page
- This online lecture is for demonstration purposes


 
Lecture One

BLAST Tutorial

Students will need to finish the required readings for this lecture.

The menu to the right provides links to the major topics of the lecture as well as to the assignment page for this lecture. You may also scroll down to begin reading the lecture material.

 

Lecture Menu

Learning Objectives

Introduction

Review Questions

Required  Readings

Assignment

Learning Objectives

After completing this lecture, you should be able to use NCBI's BLAST to conduct similarity search. The key aspects of this competency expectation include:

  • Learn the basic concepts of similarity search.
  • Know how to enter query sequence and how to choose NCBI BLAST programs and databases to conduct the search.
  • Learn how to interpret search results.
  • Get a general idea of WU BLAST.
  • Learn what MegaBLAST is.
  • Get a general idea how PSI-BLAST works.

 

 

~Please use Internet Explorer
to view this lecture and the accompanying Figures~

Note: The materials in this lecture have been developed partially based on materials and figures from NCBI and Washington University at St. Louis (WU). The copyright for such material is held by NCBI and WU.

1.1 Introduction

This lesson is designed to help you learn NCBI's BLAST. It will teach you how to input a sequence into the Basic BLAST web page, choose a program and database, and examine the results. In addition, it will introduce some basic concepts in similarity search.

Sequence analysis plays a key role in research in biology. Researchers often need to know the construct of the sequence, that is, what necleotides constitute a neucleotide sequence, or what amino acids are included in a protein sequence, and in what order. Having learned the components of a neucleotide sequence, researchers can then identify the components of the corresponding protein sequence and further understand the 3-D structure of the protein. The identification of sequence components is usually done by sequencing. Of course, sequencing is not the only tool in sequence analysis. By observing how genetic markup and functions are passed from one generation to the next, scientists can identify which chromosome hosts a particular gene, and which gene can enhance or suppress a phenotype.

Ultimately, the research of biology is to find treatments for human diseases. However, most biological experiments focus on species other than homo sapien, or human. This is not only because of safety consideration, but also because of the shorter life cycles of certain species, such as fruit fly and mouse. After the function of a fly gene or mouse gene is identified, we need to find out the corresponding human gene. Only after this step, can we start to design drugs that target the diseases caused by the genes of interest. The cross-species gene identification is done by using similarity search. Specifically, given a genetic sequence of fly genes (or worm, or fish), you are asked what genes are contained in the sequence, and what are the corresponding human genes. To answer these questions, you can use tools that can compare this sequence with genes from different species. One of the most widely used computer algorithms is BLAST (Basic Local Alignment Search Tools). We will first focus on NCBI's BLAST and then introduce another variant, the WU-BLAST, developed at Washington University in St. Louis.

The core of NCBI 's BLAST services is BLAST 2.0 otherwise known as "Gapped BLAST". This service is designed to take protein and nucleic acid sequences and compare them against a selection of NCBI databases. The BLAST algorithm was written balancing speed and increased sensitivity for distant sequence relationships. Instead of relying on global alignments (commonly seen in multiple sequence alignment programs) BLAST emphasizes regions of local alignment to detect relationships among sequences which share only isolated regions of similarity (Altschul et al., 1990). Therefore, BLAST is more than a tool to view sequences aligned with each other or to calculate percent homology, but a program to locate regions of sequence similarity with a view to comparing structure and function.

NCBI's BLAST page can be accessed via http://www.ncbi.nlm.nih.gov. Students are strongly encouraged to take a quick look of a simplified tutorial on NCBI's BLAST to get an idea how BLAST runs. The simplified tutorial can be accessed at http://www4.ncbi.nlm.nih.gov/Education/BLASTinfo/tut1.html. You do not need to understand every concept in the page. We will address all important concepts later.



1.2 Choose BLAST Program and Database

NCBI's sequence search consists of a number of BLAST programs. Some accept nucleotide sequence and compare only nucleotide sequence databases, such as blastn, and some others can only compare amino acid sequence against protein sequences, such as blastp. Therefore, before a BLAST job is started, an appropriate search program must be specified. A list of BLAST programs and the descriptions of their functions are shown in Figure 1.1

Program 
Description
blastp Compares an amino acid query sequence against a protein sequence database.
blastn
Compares a nucleotide query sequence against a nucleotide sequence database.
blastx Compares a nucleotide query sequence translated in all reading frames against a protein sequence database. You could use this option to find potential translation products of an unknown nucleotide sequence.
tblastn Compares a protein query sequence against a nucleotide sequence database dynamically translated in all reading frames.
tblastx Compares the six-frame translations of a nucleotide query sequence against the six-frame translations of a nucleotide sequence database. Please note that the tblastx program cannot be used with the nr database on the BLAST Web page because it is computationally intensive.

Figure 1.1 Various BLAST Programs.

In addition, users should also choose appropriate databases. The databases are indexed sequence collections. Each database consists only one type of sequences. Some frequently used peptide (protein) sequence databases include:

Database
Description
nr All non-redundant GenBank CDS translations+PDB+SwissProt+PIR+PRF 
month All new or revised GenBank CDS translation+PDB+SwissProt+PIR released in the last 30 days. 
swissprot The last major release of the SWISS-PROT protein sequence database (no updates). These are uploaded to our system when they are received from EMBL.
patents Protein sequences derived from the Patent division of GenBank.
yeast Yeast (Saccharomyces cerevisiae) protein sequences. This database is not to be confused with a listing of all Yeast protein sequences. It is a database of the protein translations of the Yeast complete genome.
E. coli E. coli (Escherichia coli) genomic CDS translations.
pdb Sequences derived from the 3-dimensional structure Brookhaven Protein Data Bank.
kabat [kabatpro] Kabat's database of sequences of immunological interest. For more information http://immuno.bme.nwu.edu/
alu

Translations of select Alu repeats from REPBASE, suitable for masking Alu repeats from query sequences. It is available at ftp://ncbi.nlm.nih.gov/pub/jmc/alu. See "Alu alert" by Claverie and Makalowski, Naturevol. 371, page 752 (1994).


Figure 1.2 Various Pipetide (Protein) Sequence Databases.

 

Nucleotide sequence databases include:

 

Database
Description
nr All non-redundant GenBank+EMBL+DDBJ+PDB sequences (but no EST, STS, GSS, or HTGS sequences).
month All new or revised GenBank+EMBL+DDBJ+PDB sequences released in the last 30 days.
dbest Non-redundant database of GenBank+EMBL+DDBJ EST Divisions.
dbsts Non-redundant database of GenBank+EMBL+DDBJ STS Divisions.
mouse ests The non-redundant Database of GenBank+EMBL+DDBJ EST Divisions limited to the organism mouse.
human ests The Non-redundant Database of GenBank+EMBL+DDBJ EST Divisions limited to the organism human.
other ests The non-redundant database of GenBank+EMBL+DDBJ EST Divisions all organisms except mouse and human.
yeast Yeast (Saccharomyces cerevisiae) genomic nucleotide sequences. Not a collection of all Yeast nucelotides sequences, but the sequence fragments from the Yeast complete genome.
E. coli E. coli (Escherichia coli) genomic nucleotide sequences.
pdb Sequences derived from the 3-dimensional structure of proteins.
kabat [kabatnuc] Kabat's database of sequences of immunological interest. For more information http://immuno.bme.nwu.edu/
patents Nucleotide sequences derived from the Patent division of GenBank.
vector Vector subset of GenBank(R), NCBI, (ftp://ncbi.nlm.nih.gov/pub/blast/db/ directory).
mito Database of mitochondrial sequences (Rel. 1.0, July 1995).
alu Select Alu repeats from REPBASE, suitable for masking Alu repeats from query sequences. It is available at ftp://ncbi.nlm.nih.gov/pub/jmc/alu. See "Alu alert" by Claverie and Makalowski, Nature vol. 371, page 752 (1994).
epd Eukaryotic Promotor Database ISREC in Epalinges s/Lausanne (Switzerland).
gss Genome Survey Sequence, includes single-pass genomic data, exon-trapped sequences, and Alu PCR sequences.
htgs High Throughput Genomic Sequences.

Figure 1.3 Various Nucleotide Sequence Databases.

Some other sequences are derived from raw sequence trace files. These data are not trimmed for quality of vector sequences . The trace files in the Trace Archive are from a variety of projects and strategies, including Whole Genome Shotgun (WGS), Clone by Clone Strategies, BAC end sequencing, and EST sequencing.

 

1.3 BLAST Search

There are three parts in the search interface, basic search interface, options for advanced search, and format configuration to display the results.

 

1.3.1 Basic Search Interface

Figure 1.1a is the basic search interface designed to retrieve similar protein sequence and Figure 1.1b enable users to enter nucleotide sequences. We will focus on nucleotide search due to the similarity between these two interfaces.

Figure 1.4a The Basic Search Interface for Protein Sequences.

Figure 1.4b The Basic Search Interface for Neucleotide Sequences.

 

The BLAST 'Search' box accepts a number of different types of input and automatically determines the format. To allow this feature there are certain conventions required with regard to the input of identifiers (e.g., accessions or gi's). These are described in 3.) below. Accepted input types are:

1.) FASTA:

A sequence in FASTA format begins with a single-line description, followed by lines of sequence data. The description line is distinguished from the sequence data by a greater-than (">") symbol in the first column. It is recommended that all lines of text be shorter than 80 characters in length. An example sequence in FASTA format is:

>gi|129295|sp|P01013|OVAX_CHICK GENE X PROTEIN (OVALBUMIN-RELATED) QIKDLLVSSSTDLDTTLVLVNAIYFKGMWKTAFNAEDTREMPFHVTKQESKPVQMMCMNNSFNVATLPAE KMKILELPFASGDLSMLVLLPDEVSDLERIEKTINFEKLTEWTNPNTMEKRRVKVYLPQMKIEEKYNLTS VLMALGMTDLFIPSANLTGISSAESLKISQAVHGAFMELSEDGIEMAGSTGVIEDIKHSPESEQFRADHP FLFLIKHNPTNTIVYFGRYWSP

Blank lines are not allowed in the middle of FASTA input.

Lower-case letters are accepted and are mapped into upper-case; a single hyphen or dash can be used to represent a gap of indeterminate length; and in amino acid sequences, U and * are acceptable letters (see below). Before submitting a request, any numerical digits in the query sequence should either be removed or replaced by appropriate letter codes (e.g., N for unknown nucleic acid residue or X for unknown amino acid residue). For those programs that use amino acid query sequences (BLASTP and TBLASTN), the accepted amino acid codes are:

A --> alanine P --> proline
B --> aspartate or asparagine Q --> glutamine
C --> cystine R --> arginine
D --> aspartate S --> serine
E --> glutamate T --> threonine
F --> phenylalanine U --> selenocysteine
G --> glycine V --> valine
H --> histidine W --> tryptophan
I --> isoleucine Z --> glutamate or glutamine
K --> lysine X --> any
L --> leucine * --> translation stop
M --> methionine - --> gap of indeterminate length
N --> asparagine  

Figure 1.5a Accepted Amino Acid Codes.

The nucleic acid codes supported are:

A --> adenosine M --> A C (amino)
C --> cytidine S --> G C (strong)
G --> guanine W --> A T (weak)
T --> thymidine B --> G T C
U --> uridine D --> G A T
R --> G A (purine) H --> A C T
Y --> T C (pyrimidine) V --> G C A
K --> G T (keto) N --> A G C T (any)
- gap of indeterminate length

Figure 1.5b Accepted Nucleotide Codes.

2.) Bare Sequence.

A query sequences can be just lines of data, without the FASTA definition line, e.g.:

QIKDLLVSSSTDLDTTLVLVNAIYFKGMWKTAFNAEDTREMPFHVTKQESKPVQMMCMNNSFNVATLPAE KMKILELPFASGDLSMLVLLPDEVSDLERIEKTINFEKLTEWTNPNTMEKRRVKVYLPQMKIEEKYNLTS VLMALGMTDLFIPSANLTGISSAESLKISQAVHGAFMELSEDGIEMAGSTGVIEDIKHSPESEQFRADHP FLFLIKHNPTNTIVYFGRYWSP

It can also be sequence interspersed with numbers and/or spaces, such as the sequence portion of a GenBank/GenPept flatfile report:


1 qikdllvsss tdldttlvlv naiyfkgmwk tafnaedtre mpfhvtkqes kpvqmmcmnn
61 sfnvatlpae kmkilelpfa sgdlsmlvll pdevsdleri ektinfeklt ewtnpntmek
121 rrvkvylpqm kieekynlts vlmalgmtdl fipsanltgi ssaeslkisq avhgafmels
181 edgiemagst gviedikhsp eseqfradhp flflikhnpt ntivyfgryw sp

Blank lines are not allowed in the middle of bare sequence input.


3.) Identifiers.

Normally these are simply accession, accession.version or gi's (e.g., p01013, AAA68881.1, 129295), but a bar-separated NCBI sequence identifier (e.g., gi|129295) will also be accepted. These NCBI sequence identifiers have a very specific syntax described in Appendix B of ftp://ftp.ncbi.nlm.nih.gov/blast/documents/blastdb.txt. The identifier may consist of only one token (i.e., word). Spaces between letters in the input will cause it to be treated as bare sequence (spaces before or after the identifier are allowed). Examples of illegal input are:

ACCESSION P01013 AAA68881. 1 gi| 129295

For the first example "ACCESSION" must be removed, in the second example there is a space before the version number of the accession, in the third example there is a space after the bar ("|").

For MegaBlast, where more than one query may be specified, each identifier should be on a separate line.

 

Set Subsequence

A region of the query sequences can be used to be used for BLAST searching. You can enter the range in nucleotides or protein residues in the "Form" and "To" boxes provided under "Set Subsequence". For example to limit matches to the region from nucleotide 24 to nucleotide 200 of a query sequence, you would enter From= 24 To= 200. If one of the limits you enter is out of range, the intersection of the [From,To] and [1,length] intervals will be searched, where length is the length of the whole query sequence

 

1.3.2 Options for Advanced Search

Users can control how the search should proceed by specifying some BLAST parameters in the Options window.

 

Figure. 1.6  Options for Advanced BLAST

 

Limit by Entrez Query

BLAST searches can be limited to the results of an Entrez query against the database chosen. This can be used to limit searches to subsets of the BLAST databases. Any terms can be entered that would normally be allowed in an Entrez search session. For example:

protease NOT hiv1[Organism]

This will limit a BLAST search to all proteases, except those in HIV 1. This can also be used to limit searches to a particular molecule type:

biomol_mrna[PROP] AND brain

To limit to a specific organism you can either select using the pulldown menu, form a list of the most common organism in the databases. Or enter the name of the organism in the Entrez Query field with the [Organism] qualifier. For example:

Mus musculus[Organism]

Or For help in constructing Entrez queries please see the "Writing Advanced Search Statements" section of the Entrez Help document.

Filter (Low-complexity)

Mask off segments of the query sequence that have low compositional complexity, as determined by the SEG program of Wootton & Federhen (Computers and Chemistry, 1993) or, for BLASTN, by the DUST program of Tatusov and Lipman (in preparation). Filtering can eliminate statistically significant but biologically uninteresting reports from the blast output (e.g., hits against common acidic-, basic- or proline-rich regions), leaving the more biologically interesting regions of the query sequence available for specific matching against database sequences. Filtering is only applied to the query sequence (or its translation products), not to database sequences. Default filtering is DUST for BLASTN, SEG for other programs.

It is not unusual for nothing at all to be masked by SEG, when applied to sequences in SWISS-PROT, so filtering should not be expected to always yield an effect. Furthermore, in some cases, sequences are masked in their entirety, indicating that the statistical significance of any matches reported against the unfiltered query sequence should be suspect.

Filter (Human repeats)

This option masks Human repeats (LINE's and SINE's) and is especially useful for human sequences that may contain these repeats. Filtering for repeats can increase the speed of a search especially with very long sequences (>100 kb) and against databases which contain large number of repeats (htgs). For more information please see "Why does my search timeout on the BLAST servers?" in the BLAST Frequently Asked Questions. Human Repeat Filtering is still experimental and under development, so it may change in the near future.

Filter (Mask for lookup table only)

This option masks only for purposes of constructing the lookup table used by BLAST. BLAST searches consist of two phases, finding hits based upon a lookup table and then extending them. The option to "Mask for lookup table only" masks only for the lookup table so that no hits are found based upon low-complexity sequence. The BLAST extensions are performed without masking and so they can be extended through low-complexity sequence. This option is still experimental and may change in the near future.

Mask Lower Case

With this option selected you can cut and paste a FASTA sequence in upper case characters and denote areas you would like filtered with lower case. This allows you to customize what is filtered from the sequence during the comparison to the BLAST databases.

Expect

The statistical significance threshold for reporting matches against database sequences; the default value is 10, meaning that 10 matches are expected to be found merely by chance, according to the stochastic model of Karlin and Altschul (1990). If the statistical significance ascribed to a match is greater than the EXPECT threshold, the match will not be reported. Lower EXPECT thresholds are more stringent, leading to fewer chance matches being reported. Increasing the threshold shows less stringent matches. Fractional values are acceptable.

Word Size

The size of each word, or fragment of sequence. For shorter sequences, like primers, a shorter word should be chosen. Word sizes determine how the indexes on the sequence databases should be built. In other word, different word sizes correspond to different indexes. Therefore, only a few different word sizes are available for users to select.

 

1.3.3 Format

Format window controls how the results should be returned and displayed.

 

Figure 1.7 Format Configuration.

 

Graphical Overview

An overview of the database sequences aligned to the query sequence is shown. The score of each alignment is indicated by one of five different colors, which divides the range of scores into five groups. Multiple alignments on the same database sequence are connected by a striped line. Mousing over a hit sequence causes the definition and score to be shown in the window at the top, clicking on a hit sequence takes the user to the associated alignments.

NCBI-gi

Causes NCBI gi identifiers to be shown in the output, in addition to the accession and/or locus name.

Descriptions

Restricts the number of short descriptions of matching sequences reported to the number specified; default limit is 100 descriptions. See also EXPECT.

Alignments

Restricts database sequences to the number specified for which high-scoring segment pairs (HSPs) are reported; the default limit is 100. If more database sequences than this happen to satisfy the statistical significance threshold for reporting (see EXPECT below), only the matches ascribed the greatest statistical significance are reported.

Database LinkOuts

Enabling this option provides cross reference links from the BLAST results to other NCBI specialized databases. If a database sequence matches your query and it also found in LocusLink or UniGene (more databases to be included in the future) there will be links () from the BLAST search results to these resources.

Alignments Views

  • pairwise: Standard BLAST alignment in pairs of query sequence and database match.
  • Query-anchored with identities: The databases alignments are anchored (shown in relation to) to the query sequence. Identities are displayed as dashes, with mismatches displayed as single letter nucleotide abbreviations.
  • Query-anchored without identities: Identities are shown as single letter nucleotide abbreviations.
  • Flat Query-anchored with identities: The 'flat' display shows inserts as deletions on the query.
    Identities are displayed as dashes, with mismatches displayed as single letter nucleotide abbreviations.
  • Flat Query-anchored without identities: The 'flat' display shows inserts as deletions on the query. Identities are shown as single letter nucleotide abbreviations.

1.3.4 Command Line Options for Advanced Search

BLAST program also has command line options. Users can specify the following parameters to better control how the search proceeds:

-G Cost to open gap [Integer], default = 5 for nucleotides 11 proteins
-E Cost to extend gap [Integer], default = 2 nucleotides 1 proteins
-q Penalty for nucleotide mismatch [Integer], default = -3
-r reward for nucleotide match [Integer], default = 1
-e expect value [Real], default = 10
-W wordsize [Integer], default = 11 nucleotides 3 proteins
-y Dropoff (X) for blast extensions in bits (default if zero), default = 20 for blastn 7 for other programs
-X X dropoff value for gapped alignment (in bits), default =15 for al programs except for blastn for which it does not apply
-Z final X dropoff value for gapped alignment (in bits), 50 for blastn 25 for other programs

All BLAST programs produce a similar output consisting of program introduction, a schematic distribution of the ordered alignments of the query sequence to those in the databases, sequence alignments, scores and E values. We will introduce each element shortly.

 

1.4.1 Request ID

BLAST Results are returned in either text format (default) or HTML format (must supply an e-mail address and select the HTML results format option). After a query sequence is submitted for BLAST, a Request ID number is given such that the results be obtained at a later time, see Figure 1.8. Most results will be held for up to 24 hours; very large result files will be deleted after 30 minutes. If you want the results immediately, click on the "Format Results" button

Formatting items such as the results format option and the number of descriptions and alignments in the results output are needed only for formatting, these items may be specified from the BLAST query form or at the time you request your results.

 

Figure 1.8 The Search Result Interface with Request ID. 

 

1.4.2 Conserved Domain Search

If conserved domains are detected, you can retrieve the conserved domain search results by clicking the red bar marked "serpin" in Figure 1.8. This search compares protein sequences to the Conserved Domain Database. The CDD is a database containing a collection of functional and/or structural domains derived from two popular collections, Smart and Pfam, plus contributions from colleagues at NCBI.


Figure 1.9 Conserved Domain Search

 

In Figure 1.9, four proteins in a conserved domains are found. You can move the mouse over any of the four red bars to see the name of the corresponding protein. By clicking the red bar or the underneath sequence identifier, you can retrieve the detailed alignment information about the protein, as shown in Figure 1.10.

 

Figure 1.10 Detailed AlignmentInformation in the Conserved Domain Search. 

 

1.4.3 Sequence Search Results

You can click the "Format" button or follow the "Retrieve results for an RID" link in the search result interface (see Figure 1.8) to view the BLAST results, as shown in Figure 1.11.

 

Figure 1.11 Graphical Alignments of the Search Results.

 

The sequence results consist graphical alignments and text alignments. Coloured bars are distributed in a way to reflect the region of alignment onto the query sequence. The colour legend (color key) represents the significance of the alignment scores. Holding the mouse over a given bar will display a description of that specific alignment sequence in the above window; clicking on a specific bar will cause the browser to jump down to that particular alignment, refer to Figure 1.12.

 

Figure 1.12 Detailed Text Alignment of a Selected Sequence.

 

Identifiers for the database sequences appear at the top of the detailed text alignments and are hyperlinkedto the associated GenBank entry.

The Score (bits) is a sum value calculated for alignments using the scoring matrix; the higher the score value, the better the alignment.

E value, or the expect value, is the probability that the associated match is due to randomness; the lower the E value, the more specific/significant the match -- That's why the sequence alignments are in descending order of E values.

The percent identity (called "Identities" is given as a percent) is the percent of exact matches between your query sequence and the database sequence. This value also gives the number of nucleotide bases or amino acid residues that are matched in the database sequence versus the query sequence.

Alignments are gapped unless specified by the user at the BLAST search submission page. Gap value is the percent of the alignment sequence that has been gapped in the particular alignment.

Besides the graphical bar-chart representation, the result page also contains an ordered set of biological definition line of the database sequences which have been significantly aligned to the query sequence, see Figure 1.13. Sequence alignments and their corresponding line descriptions are listed in order of lowest to highest E value.

Figure 1.13 List of Similar Sequences Found, Ordered by E-Value

 

You can follow the link of "Related Structures" or red icon "S" to retrieve structure information, or follow the blue icon "L" to get more information from LocusLink. Figure 1.14 is a screenshot of the related structures.

 

Figure 1.14 Related Structure Information.

1.5 Other BLAST Variants

There are several other widely used sequence search tools, including WU-BLAST, MegaBLAST, and PSI-BLAST. We will briefly discuss these applications here.

 

1.5.1 WU BLAST

Washington University BLAST (WU BLAST) is a powerful software package for gene and protein identification, using sensitive, selective and rapid similarity searches of protein and nucleotide sequence databases. It is widely believed that WU BLAST is more sensitive than NCBI's basic BLAST. In other word, it can identify similar sequences that otherwise would be missed by the NCBI's basic BLAST due to insertions, deletions, and other causes. Therefore, quite a few biotech companies license and maintain a local copy of WU BLAST for internal BLAST needs. Although many scientists still use NCBI's web-based BLAST tools, locally installed WU BLAST has been playing an increasingly important role in batch sequence analysis.

The recent version, WU BLAST 2.0, builds upon WU BLAST 1.4, which in turn was based on the public domain NCBI BLAST version 1.4 (Gish, unpublished, 1994; Altschul et al., 1990; Gish and States, 1993). While NCBI BLAST and WU BLAST 1.4 are in the public domain, WU BLAST 2.0 contains significant new features and extended capabilities, the development of which began in late 1994, at Washington University in Saint Louis. First released in May 1996, or more than a year ahead of the NCBI, WU BLAST 2.0 is the original gapped BLAST with statistics and is known for setting higher standards for sensitivity, speed, correctness and accuracy, scalability and reliability than competing programs and implementations. WU BLAST is not a re-hash of NCBI BLAST and essentially shares no code with it, except for small portions that both packages derived from ungapped NCBI BLAST 1.4.

Key features of WU BLAST include:

  • Potentially multiple regions of similarity are identified and reported for each database sequence, thus yielding increased sensitivity and selectivity. This feature is essential for finding: all exons in a multi-exon gene sequence, not just the longest or best-matching exon; all complete or partial copies of a repetitive element in a genomic sequence, not just the best matching one; and multiple, discrete domains of similarity between sequences, not just the highest-scoring one.

  • Karlin and Altschul (1993) "Sum statistics" are available (and used by default) in all search modes, to evaluate the joint probability of multiple regions of similarity, as described by Altschul and Gish (1996). By this technique, sets of similar regions are often found to be statistically significant that individually would be insignificant and go unreported. The combination of well-chosen heuristics and statistics in WU BLAST is often more sensitive/selective than: the full dynamic programming approach of Smith and Waterman (1981), that finds and evaluates the significance of only the highest scoring alignment with each database sequence; and other approaches or BLAST implementations that identify multiple regions of local similarity which are then evaluated individually for statistical significance rather than jointly.

  • Poisson statistics are available as an option to Karlin-Altschul Sum statistics in all search modes. Simpler Karlin-Altschul (1990) statistics, that do not involve joint probability calculations, are also available as an option. Using the postsw option, a full Smith-Waterman alignment is performed on query-subject pairs of sequences that will be reported by BLASTP. The Smith-Waterman scores and alignments are combined with the initial BLAST results and redundancy is removed. This may alter the relative ranking of database matches before output. Use of this option is recommended, although it may be supplanted in the future by other option(s) or by a redefined default behavior.

  • Word lengths (re: the W parameter) as short as 1 have been supported continuously by WU blastn, as are nucleotide neighborhood words, using the neighborhood word score threshold parameter, T. Using neighborhood words, nucleotide sequence similarity can be detected even in the absence of any identical residues between two sequences. Users are cautioned, however, that careless use of the T parameter can result in vast and overwhelming amounts of memory being requested by the software; T should likely be used only in conjunction with very short word lengths.

  • Licensed WU BLAST 2.0 supports the eXtended Database Format (XDF), a power user's dream in so many ways for working with peptide and nucleotide sequences. Both the NCBI BLAST 2.0 database format and the NCBI implementation of the BLAST search algorithm are restricted to sequences under 16 Mbp in length, whereas human genome contigs exceeded 25 Mbp in the last century (Hattori et al., 2000) and extend to several tens of megabytes today. In contrast, XDF can accurately store individual sequences of up to 1 Gbp (billion bp) with ambiguity codes intact. Other BLAST software, such as the NCBI's, limits database files to 2 gigabytes, whereas WU BLAST's XDF supports databases (and database files) of virtually unlimited size -- provided of course that the underlying operating system supports these so-called "large files", which most modern operating systems do.

  • One or more word masks can be specified on the command line, using the "wordmask=<mask>" option, where <mask> may be a classical filter program such as seg, xnu, or dust. Whereas sequence filters convert certain letters in the query sequence into ambiguity codes (X for amino acid and N for nucleotide), word masks do not alter the sequence. Word masks instead cause the indicated portion(s) of the query sequence to be skipped during BLAST neighborhood word generation. This leaves the query sequence intact for generating alignments that are seeded by word hits arising in flanking, unmasked regions of the sequence.

  • WU BLAST 2.0 reliably supports parallel processing on a variety of SMP (symmetric multiprocessing) computing platforms. WU BLAST is the only BLAST that threads properly across multiple CPUs on dual-processor Apple PowerMacs running MacOS X and does not require a G4 processor. POSIX threads are used under Compaq Tru64 UNIX 4.0+, Linux for X86 and Alpha processors, IBM AIX, MacOS X, IRIX 6.5, and HP/UX 11. While POSIX threads are available under Solaris 2+ (SPARC and X86), Solaris threads are specifically used instead for slightly better performance. The IRIX m_fork() system call provides parallel processing under older versions of IRIX 4 through 6.4; and DCE threads are used under Digital UNIX 3.2.

1.5.2 Mega BLAST

Mega BLAST uses the greedy algorithm for nucleotide sequence alignment search. This program is optimized for aligning sequences that differ slightly as a result of sequencing or other similar "errors". When larger word size is used (see explanation below), it is up to 10 times faster than more common sequence similarity programs. Mega BLAST is also able to efficiently handle much longer DNA sequences than the blastn program of traditional BLAST algorithm. Default parameters include:

  • Word size is roughly the minimal length of an identical match an alignment must contain if it is to be found by the algorithm. Mega BLAST is most efficient with word sizes 16 and larger, although word size as low as 8 can be used. If the value W of the word size is divisible by 4, it guarantees that all perfect matches of length W + 3 will be found and extended by Mega BLAST search, however perfect matches of length as low as W might also be found, although the latter is not guaranteed. Any value of W not divisible by 4 is equivalent to the nearest value divisible by 4 (with 4i+2 equivalent to 4i).

  • Gapping parameters: By default, non-affine gapping parameters are assumed. This means that the gap opening penalty is 0, and gap extension penalty E can be computed from match reward r and mismatch penalty q by the formula: E = r/2 - q. The non-affine version of Mega BLAST requires significantly less memory and is also significantly faster, however affine gapping parameters can also be used, preferrably with larger word sizes. Non-affine gapping parameters tend to yield alignments with more gaps, but the gap lengths are shorter.

  • X-dropoff value: As in BLAST, this value provides a cutoff threshold for the extension algorithm tree exploration. When the score of a given branch drops below the current best score minus the X-dropoff, the exploration of this branch stops. However the actual values of the X-dropoff for Mega BLAST and for traditional nucleotide BLAST algorithms are not necessarily compatible, i.e. with the same word size, match, mismatch and gapping penalties and with the same X-dropoff, the two algorithms might produce different results, which can be remedied by changing the X-dropoff value for one of the algorithms.

1.5.3 PSI-BLAST

The Position-Specific Iterated BLAST, or PSI-BLAST program performs an iterative search in which sequences found in one round of searching are used to build a score model for the next round of searching. In PSI-BLAST the algorithm is not tied to a specific score matrix. Traditionally, it has been implemented using an AxA substitution matrix where A is the alphabet size. PSI-BLAST instead uses a QxA matrix, where Q is the length of the query sequence; at each position the cost of a letter depends on the position w.r.t. the query and the letter in the subject sequence.

Since the PSI-BLAST will re-iterated the BLAST searches creating a defined profile, this tool can be used when your BLAST search results give you very few matches. Upon re-iteration (you just click on the button to re-iterate) you may reveal alignment matches that are significant that you would not have found using BLAST alone. PSI-BLAST generates "on-the-fly" a scoring matrix specific to your BLAST search, and continues to specify this matrix upon each re-iteration.

 

1.5.4 PSI-BLAST

The Position-Specific Iterated BLAST, or PSI-BLAST program performs an iterative search in which sequences found databases. The pattern designates the amino acid sequence you are searching for e.g. [RG]-[M]-[X]-[YWF]-5[X]-[A]; this submission pattern would yeild a search for sequence patterns having "R" (Arginine) or "G" (Glycine) at position 1 (not necessarily position 1 or the N-terminus of the amino acid sequences in the databases), followed by a "M" (Methionine), followed by any amino acid "X", followed by any one of three AA : "Y" (Tyrosine) or "W" (Tryptophan) or "F" (Phenylalanine); followed by any 5 amino acids "X", followed by an "A" (Alanine).


1.5.5 BLAST 2 Sequences

This tool produces the alignment of two given sequences using BLAST engine for local alignment.

Review Questions
 
1. Predict the function of the following protein from Methanobacterium thermoautotrophicum.
MYRITVIPGD GIGVEVMEAA LHVLQALEIE FEFTHAEAGN ECFRRCGDTL PEETLKLVRK
ADATLFGAVT TVPGQKSAII TLRRELDLFA NLRPVKSLPG VPCLYPDLDF VIVRENTEDL
YVGDEEYTPE GAVAKRIITR TASRRISQFA FQYAQKEGMQ KVTAVHKANV LKKTDGIFRD
FYKVASEYPQ MEANDYYVDA TAMYLITQPQ EFQTIVTTNL FGDILSDEAA GLIGGLGLAP
SANIGEKNAL FEPVHGSAPQ IAGKNIANPT AMILTTTLML KHLNKKQEAQ KIEKALQKTL
MRGIMTPDLG GTASTMEMAE AIKEEIVKGE


a What other similar proteins can you find?
b Which functions have been described in this family of proteins? Which aspect of the protein function is conserved between the different functions? Which aspect is the least conserved?

2. The following proteins have been annotated as histidine kinases.
> AF1483 * AF1483 * ~(1338676..1341402) * signal-transducing histidine kinase * PID:g2649082
MVLEEMRIRI DISNEQNRKM LVDFLGKRYE IAEDNFDLLI IDGVTLKRKW REIERIKAES
RAFLPVLLVT TRKDLKIAEK HLWKRVDELI IEPVDKLELL ARIEILLRAR KQALQLEEHA
RIMEIELGTL FETIAHPIVV ISPEFEILHA NRYAQKIFRE QGIENAIGKK CYKVFHGREE
PAENCPCVAT FRNHKPETRE IEIFGRMYAV STTPIFIDGE LRKVVHLAFD ITDFKRMERR
LERLYKANLL LHEVERAILS ADETEEILKM TAEKLAEMLP VRGVGITVFE NGRARVVAVT
DKKMPGFREG EMIAGEDVAK VMQTLSQGKP WVKRVEGRGE GERRLMELGI KSYALIPIVS
DSLLGSINVP SEEEDAFDEE TIQILMEVAH SVALAIRSAR MREELEESEE KFRKLAEHSQ
VGIDIIQEGV FVYVNEKFAE ILGYEREELI GKSPVDFIHP DDREKFERNY RARILGEKNH
VNYRLRVLTK SGEVRIIDAY GSRVILRGKP AIVGVSVDIT EREKMRQELE KYTQELEKLV
EERTKQLAES EKRYRLLVES PIVAFWEADS NGVFRFVNDR LLEMSGYSRD EVVGKMTMFD
PIAPEQREWL AERIRLHKEH KLYGDVVEAE LVKKDGSRFH VLVSPAPIYD EKGNLVRIVG
AMIDITDRKM AEEKLKQTLE ELRKANEELE AYVHAISHDL RAPLRNLQGY VSALVEDYGE
KLEEDARFYL SRLKALTEKM DGLINDLLEY ARVSKAKAEV RRVDLNLIVE DVLDYLKDEI
RGKSAVIEIE KLPAVKGDRK LLFTVMLNLI SNAIKFVEEG VRPEVKVWAE DVNGKVRVYV
KDNGIGIPEE YHEKIFNIFE RLHGEEVYPG TGVGLAIVKK AMEVMGGRYG VRSKPGEGSI
FWIELERG

>AF0277 * AF0277 * 256572..258461 * signal-transducing histidine kinase, putative * PID:g2650366
MSQFILLKLN QRGKIIDVTG LPGLKGKYFH EVFTVENGTA SPVARYNGMS FKMAKVDFDD
GQVCVLIPEQ DESCLDYLPI GLAVVRDGEI VYSNNMFREL LGENYYSERF SNFIRSLNEK
IGEGVVRDEV AITSELGEER KLEIMASKGY YNGKEAILCT LRDATRDREF ENLFLTLTSK
AFVVVYIIQD GKFAFVNEMA TGLGYSIEEL YRMNPFDLVH PDDKEQVIDN YVRRLAGEHV
EVPYRFRLVA RDGRLLYVDA IAARVIFRGR PAVMGMLIDR TDEMKNQEKL KMYERFFRRS
KDMFFILDRY GRFIDVNPRY AEILGYAKED LLGRTSRIIA HEDDLEILRE NFGKVLRGES
VKFSFRAKSR DGGVRFVEVV EWPVFKNGEV AGAEGVIRDI TDRVTTEEEL KKKNQLLRII
GEINELILKE RDEYALLQKV CRFFSKIRDT DSWTWILDGN RLIKATPLAP ECHLAEKTKD
GVLRFEDCHC PLSKAKSLAV PIRHNGNVFG VLVLCNVGSL AEDEMTIIEE LGINLGFAVS
SYRAERDRKI AFNLLLENLK QLESLADRLR NPVAIISGFL EIKDDIGYER AFREIENQIE
RINRILDDLR LQETLTYFIL KGGFGTKFL


a Obtain for both proteins the predicted domain organization, and write down the differences.
d Compare the results from basic, mega, and psi- BLAST.





Answers to Reviews Questions:
1. Answer:  BLAST is used for the search for sequence similarity.

2. Answer:  One reason is to see if your DNA has any new matches against the DNA stored
in the constantly growing database.

3. Answer : One is at the National Center for Biotechnology Information. The other is
at Washington University.

4. What does homology between two sequences means?
Answer:  It means the sequences are related evolutionarily.


Required Readings

The reading assigned from the text is intended to support and supplement the material covered in class.

You will be held responsible for the material covered in the reading assignments. You should read the material in the text before the corresponding lecture. And remember that the goal of the class is to help you understand the basic tools in bioinformatics, rather than to cover as much material as possible.

TEXTBOOK: (Go to the following web sites to use online textbooks)

1. David W. Mount, Bioinformatics: Sequence and Genome Analysis, Chapter 1, Chapter 2 (P. 24, P.29-35), Chapter 3.

2. NCBI BLAST Tutorial, http://www4.ncbi.nlm.nih.gov/Education/BLASTinfo/tut1.html.


Assignment

1. Draw a chemical structure of double-stranded DNA to show the sugar-phosphate backbone and base-paired bases with hydrogen bonds between A-T and G-C base pairs.

2. Sketch a diagram on the translation steps in prokaryotes based on your understanding of the mechanism of protein synthesis.