![]() |
|
|
| Lecture
One BLAST Tutorial Students will need to finish the required readings for this lecture. The menu
to the right provides links to the major topics of the lecture as well
as to the assignment page for this lecture. You may also scroll down
to begin reading the lecture material.
|
Lecture Menu |
||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Learning
Objectives After completing this lecture, you should be able to use NCBI's BLAST to conduct similarity search. The key aspects of this competency expectation include:
|
|||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| ~Please
use Internet Explorer 1.1 Introduction This lesson is designed to help you learn NCBI's BLAST. It will teach you how to input a sequence into the Basic BLAST web page, choose a program and database, and examine the results. In addition, it will introduce some basic concepts in similarity search. Sequence analysis plays a key role in research in biology. Researchers often need to know the construct of the sequence, that is, what necleotides constitute a neucleotide sequence, or what amino acids are included in a protein sequence, and in what order. Having learned the components of a neucleotide sequence, researchers can then identify the components of the corresponding protein sequence and further understand the 3-D structure of the protein. The identification of sequence components is usually done by sequencing. Of course, sequencing is not the only tool in sequence analysis. By observing how genetic markup and functions are passed from one generation to the next, scientists can identify which chromosome hosts a particular gene, and which gene can enhance or suppress a phenotype. Ultimately, the research of biology is to find treatments for human diseases. However, most biological experiments focus on species other than homo sapien, or human. This is not only because of safety consideration, but also because of the shorter life cycles of certain species, such as fruit fly and mouse. After the function of a fly gene or mouse gene is identified, we need to find out the corresponding human gene. Only after this step, can we start to design drugs that target the diseases caused by the genes of interest. The cross-species gene identification is done by using similarity search. Specifically, given a genetic sequence of fly genes (or worm, or fish), you are asked what genes are contained in the sequence, and what are the corresponding human genes. To answer these questions, you can use tools that can compare this sequence with genes from different species. One of the most widely used computer algorithms is BLAST (Basic Local Alignment Search Tools). We will first focus on NCBI's BLAST and then introduce another variant, the WU-BLAST, developed at Washington University in St. Louis. The core of NCBI 's BLAST services is BLAST 2.0 otherwise known as "Gapped BLAST". This service is designed to take protein and nucleic acid sequences and compare them against a selection of NCBI databases. The BLAST algorithm was written balancing speed and increased sensitivity for distant sequence relationships. Instead of relying on global alignments (commonly seen in multiple sequence alignment programs) BLAST emphasizes regions of local alignment to detect relationships among sequences which share only isolated regions of similarity (Altschul et al., 1990). Therefore, BLAST is more than a tool to view sequences aligned with each other or to calculate percent homology, but a program to locate regions of sequence similarity with a view to comparing structure and function. NCBI's
BLAST page can be accessed via http://www.ncbi.nlm.nih.gov. Students
are strongly encouraged to take a quick look of a simplified tutorial
on NCBI's BLAST to get an idea how BLAST runs. The simplified tutorial
can be accessed at http://www4.ncbi.nlm.nih.gov/Education/BLASTinfo/tut1.html.
You do not need to understand every concept in the page. We will address
all important concepts later.
Figure 1.1 Various BLAST Programs. In addition, users should also choose appropriate databases. The databases are indexed sequence collections. Each database consists only one type of sequences. Some frequently used peptide (protein) sequence databases include:
Figure
1.2 Various Pipetide (Protein) Sequence Databases.
Nucleotide sequence databases include:
Figure
1.3 Various Nucleotide Sequence Databases. Some other sequences are derived from raw sequence trace files. These data are not trimmed for quality of vector sequences . The trace files in the Trace Archive are from a variety of projects and strategies, including Whole Genome Shotgun (WGS), Clone by Clone Strategies, BAC end sequencing, and EST sequencing.
1.3 BLAST Search There are three parts in the search interface, basic search interface, options for advanced search, and format configuration to display the results.
1.3.1 Basic Search Interface Figure 1.1a is the basic search interface designed to retrieve similar protein sequence and Figure 1.1b enable users to enter nucleotide sequences. We will focus on nucleotide search due to the similarity between these two interfaces.
Figure 1.4a The Basic Search Interface for Protein Sequences.
Figure 1.4b The Basic Search Interface for Neucleotide Sequences.
The BLAST 'Search' box accepts a number of different types of input and automatically determines the format. To allow this feature there are certain conventions required with regard to the input of identifiers (e.g., accessions or gi's). These are described in 3.) below. Accepted input types are: 1.) FASTA: A sequence in FASTA format begins with a single-line description, followed by lines of sequence data. The description line is distinguished from the sequence data by a greater-than (">") symbol in the first column. It is recommended that all lines of text be shorter than 80 characters in length. An example sequence in FASTA format is: >gi|129295|sp|P01013|OVAX_CHICK GENE X PROTEIN (OVALBUMIN-RELATED) QIKDLLVSSSTDLDTTLVLVNAIYFKGMWKTAFNAEDTREMPFHVTKQESKPVQMMCMNNSFNVATLPAE KMKILELPFASGDLSMLVLLPDEVSDLERIEKTINFEKLTEWTNPNTMEKRRVKVYLPQMKIEEKYNLTS VLMALGMTDLFIPSANLTGISSAESLKISQAVHGAFMELSEDGIEMAGSTGVIEDIKHSPESEQFRADHP FLFLIKHNPTNTIVYFGRYWSP Blank lines are not allowed in the middle of FASTA input. Lower-case letters are accepted and are mapped into upper-case; a single hyphen or dash can be used to represent a gap of indeterminate length; and in amino acid sequences, U and * are acceptable letters (see below). Before submitting a request, any numerical digits in the query sequence should either be removed or replaced by appropriate letter codes (e.g., N for unknown nucleic acid residue or X for unknown amino acid residue). For those programs that use amino acid query sequences (BLASTP and TBLASTN), the accepted amino acid codes are:
Figure 1.5a Accepted Amino Acid Codes. The nucleic acid codes supported are:
Figure 1.5b Accepted Nucleotide Codes. 2.) Bare Sequence. A query sequences can be just lines of data, without the FASTA definition line, e.g.: QIKDLLVSSSTDLDTTLVLVNAIYFKGMWKTAFNAEDTREMPFHVTKQESKPVQMMCMNNSFNVATLPAE KMKILELPFASGDLSMLVLLPDEVSDLERIEKTINFEKLTEWTNPNTMEKRRVKVYLPQMKIEEKYNLTS VLMALGMTDLFIPSANLTGISSAESLKISQAVHGAFMELSEDGIEMAGSTGVIEDIKHSPESEQFRADHP FLFLIKHNPTNTIVYFGRYWSP It can also be sequence interspersed with numbers and/or spaces, such as the sequence portion of a GenBank/GenPept flatfile report:
Blank lines are not allowed in the middle of bare sequence input.
Normally these are simply accession, accession.version or gi's (e.g., p01013, AAA68881.1, 129295), but a bar-separated NCBI sequence identifier (e.g., gi|129295) will also be accepted. These NCBI sequence identifiers have a very specific syntax described in Appendix B of ftp://ftp.ncbi.nlm.nih.gov/blast/documents/blastdb.txt. The identifier may consist of only one token (i.e., word). Spaces between letters in the input will cause it to be treated as bare sequence (spaces before or after the identifier are allowed). Examples of illegal input are: ACCESSION P01013 AAA68881. 1 gi| 129295 For the first example "ACCESSION" must be removed, in the second example there is a space before the version number of the accession, in the third example there is a space after the bar ("|"). For MegaBlast, where more than one query may be specified, each identifier should be on a separate line.
Set Subsequence A region of the query sequences can be used to be used for BLAST searching. You can enter the range in nucleotides or protein residues in the "Form" and "To" boxes provided under "Set Subsequence". For example to limit matches to the region from nucleotide 24 to nucleotide 200 of a query sequence, you would enter From= 24 To= 200. If one of the limits you enter is out of range, the intersection of the [From,To] and [1,length] intervals will be searched, where length is the length of the whole query sequence
1.3.2 Options for Advanced Search Users can control how the search should proceed by specifying some BLAST parameters in the Options window.
Figure. 1.6 Options for Advanced BLAST
Limit by Entrez Query BLAST
searches can be limited to the results of an Entrez query against
the database chosen. This can be used to limit searches to subsets
of the BLAST databases. Any terms can be entered that would normally
be allowed in an Entrez search session. For example: protease
NOT hiv1[Organism] This
will limit a BLAST search to all proteases, except those in HIV 1.
This can also be used to limit searches to a particular molecule type:
biomol_mrna[PROP]
AND brain To
limit to a specific organism you can either select using the pulldown
menu, form a list of the most common organism in the databases. Or
enter the name of the organism in the Entrez Query field with the
[Organism] qualifier. For example: Mus
musculus[Organism] Or For help in constructing Entrez queries please see the "Writing Advanced Search Statements" section of the Entrez Help document. Filter (Low-complexity) Mask off segments of the query sequence that have low compositional complexity, as determined by the SEG program of Wootton & Federhen (Computers and Chemistry, 1993) or, for BLASTN, by the DUST program of Tatusov and Lipman (in preparation). Filtering can eliminate statistically significant but biologically uninteresting reports from the blast output (e.g., hits against common acidic-, basic- or proline-rich regions), leaving the more biologically interesting regions of the query sequence available for specific matching against database sequences. Filtering is only applied to the query sequence (or its translation products), not to database sequences. Default filtering is DUST for BLASTN, SEG for other programs. It is not unusual for nothing at all to be masked by SEG, when applied to sequences in SWISS-PROT, so filtering should not be expected to always yield an effect. Furthermore, in some cases, sequences are masked in their entirety, indicating that the statistical significance of any matches reported against the unfiltered query sequence should be suspect. Filter (Human repeats) This option masks Human repeats (LINE's and SINE's) and is especially useful for human sequences that may contain these repeats. Filtering for repeats can increase the speed of a search especially with very long sequences (>100 kb) and against databases which contain large number of repeats (htgs). For more information please see "Why does my search timeout on the BLAST servers?" in the BLAST Frequently Asked Questions. Human Repeat Filtering is still experimental and under development, so it may change in the near future. Filter (Mask for lookup table only) This option masks only for purposes of constructing the lookup table used by BLAST. BLAST searches consist of two phases, finding hits based upon a lookup table and then extending them. The option to "Mask for lookup table only" masks only for the lookup table so that no hits are found based upon low-complexity sequence. The BLAST extensions are performed without masking and so they can be extended through low-complexity sequence. This option is still experimental and may change in the near future. Mask Lower Case With this option selected you can cut and paste a FASTA sequence in upper case characters and denote areas you would like filtered with lower case. This allows you to customize what is filtered from the sequence during the comparison to the BLAST databases. Expect The statistical significance threshold for reporting matches against database sequences; the default value is 10, meaning that 10 matches are expected to be found merely by chance, according to the stochastic model of Karlin and Altschul (1990). If the statistical significance ascribed to a match is greater than the EXPECT threshold, the match will not be reported. Lower EXPECT thresholds are more stringent, leading to fewer chance matches being reported. Increasing the threshold shows less stringent matches. Fractional values are acceptable. Word Size The size of each word, or fragment of sequence. For shorter sequences, like primers, a shorter word should be chosen. Word sizes determine how the indexes on the sequence databases should be built. In other word, different word sizes correspond to different indexes. Therefore, only a few different word sizes are available for users to select.
1.3.3 Format Format window controls how the results should be returned and displayed.
Figure 1.7 Format Configuration.
Graphical Overview An overview of the database sequences aligned to the query sequence is shown. The score of each alignment is indicated by one of five different colors, which divides the range of scores into five groups. Multiple alignments on the same database sequence are connected by a striped line. Mousing over a hit sequence causes the definition and score to be shown in the window at the top, clicking on a hit sequence takes the user to the associated alignments. NCBI-gi Causes NCBI gi identifiers to be shown in the output, in addition to the accession and/or locus name. Descriptions Restricts the number of short descriptions of matching sequences reported to the number specified; default limit is 100 descriptions. See also EXPECT. Alignments Restricts database sequences to the number specified for which high-scoring segment pairs (HSPs) are reported; the default limit is 100. If more database sequences than this happen to satisfy the statistical significance threshold for reporting (see EXPECT below), only the matches ascribed the greatest statistical significance are reported. Database LinkOuts Enabling this option provides cross reference links from the BLAST results to other NCBI specialized databases. If a database sequence matches your query and it also found in LocusLink or UniGene (more databases to be included in the future) there will be links () from the BLAST search results to these resources. Alignments Views
1.3.4 Command Line Options for Advanced Search BLAST program also has command line options. Users can specify the following parameters to better control how the search proceeds: -G
Cost to open gap [Integer], default = 5 for nucleotides 11 proteins
All BLAST programs produce a similar output consisting of program introduction, a schematic distribution of the ordered alignments of the query sequence to those in the databases, sequence alignments, scores and E values. We will introduce each element shortly.
1.4.1 Request ID BLAST Results are returned in either text format (default) or HTML format (must supply an e-mail address and select the HTML results format option). After a query sequence is submitted for BLAST, a Request ID number is given such that the results be obtained at a later time, see Figure 1.8. Most results will be held for up to 24 hours; very large result files will be deleted after 30 minutes. If you want the results immediately, click on the "Format Results" button Formatting items such as the results format option and the number of descriptions and alignments in the results output are needed only for formatting, these items may be specified from the BLAST query form or at the time you request your results.
Figure 1.8 The Search Result Interface with Request ID.
1.4.2 Conserved Domain Search If conserved domains are detected, you can retrieve the conserved domain search results by clicking the red bar marked "serpin" in Figure 1.8. This search compares protein sequences to the Conserved Domain Database. The CDD is a database containing a collection of functional and/or structural domains derived from two popular collections, Smart and Pfam, plus contributions from colleagues at NCBI.
Figure 1.9 Conserved Domain Search
In Figure 1.9, four proteins in a conserved domains are found. You can move the mouse over any of the four red bars to see the name of the corresponding protein. By clicking the red bar or the underneath sequence identifier, you can retrieve the detailed alignment information about the protein, as shown in Figure 1.10.
Figure 1.10 Detailed AlignmentInformation in the Conserved Domain Search.
1.4.3 Sequence Search Results You can click the "Format" button or follow the "Retrieve results for an RID" link in the search result interface (see Figure 1.8) to view the BLAST results, as shown in Figure 1.11.
Figure 1.11 Graphical Alignments of the Search Results.
The sequence results consist graphical alignments and text alignments. Coloured bars are distributed in a way to reflect the region of alignment onto the query sequence. The colour legend (color key) represents the significance of the alignment scores. Holding the mouse over a given bar will display a description of that specific alignment sequence in the above window; clicking on a specific bar will cause the browser to jump down to that particular alignment, refer to Figure 1.12.
Figure 1.12 Detailed Text Alignment of a Selected Sequence.
Identifiers for the database sequences appear at the top of the detailed text alignments and are hyperlinkedto the associated GenBank entry. The Score (bits) is a sum value calculated for alignments using the scoring matrix; the higher the score value, the better the alignment. E value, or the expect value, is the probability that the associated match is due to randomness; the lower the E value, the more specific/significant the match -- That's why the sequence alignments are in descending order of E values. The percent identity (called "Identities" is given as a percent) is the percent of exact matches between your query sequence and the database sequence. This value also gives the number of nucleotide bases or amino acid residues that are matched in the database sequence versus the query sequence. Alignments are gapped unless specified by the user at the BLAST search submission page. Gap value is the percent of the alignment sequence that has been gapped in the particular alignment. Besides the graphical bar-chart representation, the result page also contains an ordered set of biological definition line of the database sequences which have been significantly aligned to the query sequence, see Figure 1.13. Sequence alignments and their corresponding line descriptions are listed in order of lowest to highest E value.
Figure 1.13 List of Similar Sequences Found, Ordered by E-Value
You can follow the link of "Related Structures" or red icon "S" to retrieve structure information, or follow the blue icon "L" to get more information from LocusLink. Figure 1.14 is a screenshot of the related structures.
Figure 1.14 Related Structure Information.
1.5 Other BLAST Variants There are several other widely used sequence search tools, including WU-BLAST, MegaBLAST, and PSI-BLAST. We will briefly discuss these applications here.
1.5.1 WU BLAST Washington University BLAST (WU BLAST) is a powerful software package for gene and protein identification, using sensitive, selective and rapid similarity searches of protein and nucleotide sequence databases. It is widely believed that WU BLAST is more sensitive than NCBI's basic BLAST. In other word, it can identify similar sequences that otherwise would be missed by the NCBI's basic BLAST due to insertions, deletions, and other causes. Therefore, quite a few biotech companies license and maintain a local copy of WU BLAST for internal BLAST needs. Although many scientists still use NCBI's web-based BLAST tools, locally installed WU BLAST has been playing an increasingly important role in batch sequence analysis. The recent version, WU BLAST 2.0, builds upon WU BLAST 1.4, which in turn was based on the public domain NCBI BLAST version 1.4 (Gish, unpublished, 1994; Altschul et al., 1990; Gish and States, 1993). While NCBI BLAST and WU BLAST 1.4 are in the public domain, WU BLAST 2.0 contains significant new features and extended capabilities, the development of which began in late 1994, at Washington University in Saint Louis. First released in May 1996, or more than a year ahead of the NCBI, WU BLAST 2.0 is the original gapped BLAST with statistics and is known for setting higher standards for sensitivity, speed, correctness and accuracy, scalability and reliability than competing programs and implementations. WU BLAST is not a re-hash of NCBI BLAST and essentially shares no code with it, except for small portions that both packages derived from ungapped NCBI BLAST 1.4. Key features of WU BLAST include:
1.5.2 Mega BLAST Mega BLAST uses the greedy algorithm for nucleotide sequence alignment search. This program is optimized for aligning sequences that differ slightly as a result of sequencing or other similar "errors". When larger word size is used (see explanation below), it is up to 10 times faster than more common sequence similarity programs. Mega BLAST is also able to efficiently handle much longer DNA sequences than the blastn program of traditional BLAST algorithm. Default parameters include:
1.5.3 PSI-BLAST The Position-Specific Iterated BLAST, or PSI-BLAST program performs an iterative search in which sequences found in one round of searching are used to build a score model for the next round of searching. In PSI-BLAST the algorithm is not tied to a specific score matrix. Traditionally, it has been implemented using an AxA substitution matrix where A is the alphabet size. PSI-BLAST instead uses a QxA matrix, where Q is the length of the query sequence; at each position the cost of a letter depends on the position w.r.t. the query and the letter in the subject sequence. Since the PSI-BLAST will re-iterated the BLAST searches creating a defined profile, this tool can be used when your BLAST search results give you very few matches. Upon re-iteration (you just click on the button to re-iterate) you may reveal alignment matches that are significant that you would not have found using BLAST alone. PSI-BLAST generates "on-the-fly" a scoring matrix specific to your BLAST search, and continues to specify this matrix upon each re-iteration.
1.5.4 PSI-BLAST The Position-Specific Iterated BLAST, or PSI-BLAST program performs an iterative search in which sequences found databases. The pattern designates the amino acid sequence you are searching for e.g. [RG]-[M]-[X]-[YWF]-5[X]-[A]; this submission pattern would yeild a search for sequence patterns having "R" (Arginine) or "G" (Glycine) at position 1 (not necessarily position 1 or the N-terminus of the amino acid sequences in the databases), followed by a "M" (Methionine), followed by any amino acid "X", followed by any one of three AA : "Y" (Tyrosine) or "W" (Tryptophan) or "F" (Phenylalanine); followed by any 5 amino acids "X", followed by an "A" (Alanine). This tool produces the alignment of two given sequences using BLAST engine for local alignment. |
|||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Review Questions 1. Predict the function of the following protein from Methanobacterium thermoautotrophicum. MYRITVIPGD GIGVEVMEAA LHVLQALEIE FEFTHAEAGN ECFRRCGDTL PEETLKLVRK ADATLFGAVT TVPGQKSAII TLRRELDLFA NLRPVKSLPG VPCLYPDLDF VIVRENTEDL YVGDEEYTPE GAVAKRIITR TASRRISQFA FQYAQKEGMQ KVTAVHKANV LKKTDGIFRD FYKVASEYPQ MEANDYYVDA TAMYLITQPQ EFQTIVTTNL FGDILSDEAA GLIGGLGLAP SANIGEKNAL FEPVHGSAPQ IAGKNIANPT AMILTTTLML KHLNKKQEAQ KIEKALQKTL MRGIMTPDLG GTASTMEMAE AIKEEIVKGE
2. The following proteins have been annotated as histidine
kinases. >AF0277 * AF0277 * 256572..258461 * signal-transducing
histidine kinase, putative * PID:g2650366
Answers to Reviews Questions: 1. Answer: BLAST is used for the search for sequence similarity. 2. Answer: One reason is to see if your DNA has any new matches against the DNA stored in the constantly growing database. 3. Answer : One is at the National Center for Biotechnology Information. The other is at Washington University. 4. What does homology between two sequences means? Answer: It means the sequences are related evolutionarily. |
|||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Required Readings The reading assigned from the text is intended to support and supplement the material covered in class. You will be held responsible for the material covered
in the reading assignments. You should read the material in the text
before the corresponding lecture. And remember that the goal of the
class is to help you understand the basic tools in bioinformatics, rather
than to cover as much material as possible. TEXTBOOK: (Go to the following web sites to use online textbooks) 1. David W. Mount, Bioinformatics: Sequence and Genome Analysis, Chapter 1, Chapter 2 (P. 24, P.29-35), Chapter 3. 2.
NCBI BLAST Tutorial, http://www4.ncbi.nlm.nih.gov/Education/BLASTinfo/tut1.html. |
|||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Assignment 1. Draw a chemical structure of double-stranded DNA to show the sugar-phosphate backbone and base-paired bases with hydrogen bonds between A-T and G-C base pairs. 2.
Sketch a diagram on the translation steps in prokaryotes based on your
understanding of the mechanism of protein synthesis. |
|||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||