Bio in for Ma Tics

8/2/2019 Bio in for Ma Tics

1/52

Brought to you by molecularsciences.org.This work is licensed under a Creative Commons Attribution-Share Alike 3.0 License.This publication may not be redistributed without this notice.

Bioinformatics

Computers and the internet have revolutionized everything from agriculture and architecture to research. Biologicalresearch is no exception. Biology is the science of studying living beings. Bioinformatics is the use of techniques fromapplied mathematics, informatics, statistics, and computer science to solve biological problems. It is the science ofusing information to understand biology.

Long before the invention of the word bioinformatics, researchers tried to use computers to assist in their research.These researcher realized three concepts which are still fundamental to bioinformatics today.

data representationthe concept of similaritybioinformatics is a data-driven science as opposed to theoretical science

To make it possible for a computer to work on a problem. This problem must be abstracted in a computer

understandable format. This often requires simplification of the problem and coding. Computers can be very clever atdetecting similarity and similarity allows us to imply that two seemly different entities share a certain property.Bioinformatics is a data driven science meaning that we require lots of data. Fortunately, the biggest problem is notthe lack of data but the quality of data, meaningful classification of data and our insufficient capacity to interpret thedata.

Biological Data

Biology is now a data-intensive science and fortunately most of the data is available freely over the Internet. Before

beginning, one needs to know what kind of data is available, where, in what format, and how it can be accessed. Mostdatabases provide very useful and powerful tools to help its users access, manipulate, and analyze the data. Knowingand using these tools would help the user avoid lots of unnecessary work.

Bioinformatics Research Centers

Several research centers are dedicated to bioinformatics research. Following are most significant.

NCBI National Center for Biotechnology Information

EBI European Bioinformatics InstituteSIB Swiss Institute of BioinformaticsANGIS Australian National Genome Information ServiceCBR Canadian Bioinformatics Resource

CBI Peking center for bioinformaticsBIC Singapore Bioinformatics CentreSANBI South African National Bioinformatics Institute

Sanger Institute

Biological Databases

The invention of various techniques and instruments for analyzing living being at the molecular level has lead to an

explosion of scientific data generated by the scientific community. This data cannot be stored on paper. It must bestored, organized, and indexed in an electronic database. In addition we need tools to view, verify, analyze andinterface this data with other databases.

Ads by Google Gene Sequence Alignment Bioinformatic Human Gene

Ads by Google Gene Cloning Geneious DNA ORF Gene Gene Services

nformatics http://www.molecularsciences.org/book/export/html/2

52 2/14/2012 6:53 PM


2/52

An electronic biological database is a large, organized body of persistent data that can be queried to add, update,

extract, and remove data. Biological databases have to respond to the needs of its various users. A certain biologicaldata often means very different things to different researchers. For example, a physicist, a biochemist, and a biologistsitting in the same room would be interested in different aspects of the same protein. They might even use different

taxonomy to refer to the same protein. Even two biologists would be interested in looking at the protein from differentperspectives.

Biological data is often very connected and these connections are essential for comprehension and discovery. Anucleotide sequence is linked to a protein it codes for. Nucleotide sequences are grouped into genes. A gene may codefor one protein, several proteins or none at all. This protein might have different names in different species. A proteinbelongs to protein family and it must be linked to its evolutionary progeny. We would also like to have links toscientific publications related to our protein, find out the methods and instruments used for its discovery, and even theparameters of the instrument used. Researchers frequently repeat experiments conducted by others to verify andimprove their processes.

Why do we need biological databases?

Back in the 70s, researchers refered to the "Atlas of Protein Sequences and Structures" by Margaret Dayhoff to findinformation on their protein of interest. Since then biological has exploded to a point that we can no longer imaginepublishing all the data on paper. One of the earliest electronic database was PIR (http://pir.georgetown.edu) whichwas essentially run by a group of researchers. This was a significant improvement since it offered the advantage ofadding, updating, deleting and most importantly searching the data is a much more effecient manner. Today PIR is nolonger in service. It is live but it only serves as an archive. It could not cope with the growing demands whiledatabases such as SwissProt are built to cope with the needs..

Today, biology is a data-rich science where each experiment generates enormous amounts of data. We can no longer

analyze all this data by a pair of eyes. We need powerful data analysis tools to help us interpret and understand thesignificance of this data. Biological databases offer data storage facility and various tools which help understand andanalyze the data.

Nucleotide Sequence Databases

Each database is different, however, a nucleotide sequence is expected to contain at least the following:

id and/or accession number

taxonomic datareferencesannotation/curation

keywordscross referencessequencesdocumentation

Annotation refers to adding extra information regarding a certain record in a database.Curation refers to evaluating what goes in the database and what is not fit to go into the database.

First Generation Nucleotide Sequence Databases

Ads by Google Gene Services Gene Synthese DNA Sequence RNA Seq

Satmetrix NPS ScoreCustomer Experience Software. Request a demo now.

www.satmetrix.com/

Gene Expression AnalysisNext Generation Sequencing Analysis User-friendly, Advanced, Integrated

www.clcbio.com


52 2/14/2012 6:53 PM


3/52

The first generation nucleotide sequence databases are essentially sequence archive. The data is present in the

database as it was determined and interpreted by its publisher. The original author retains full control of theinformation he submitted. As one can imagine, this results in a multitude of problems such as:

data of varying quality and lengthshighly redundant dataerrors in sequence, annotations, etc.

lack of consistency

Second Generation Nucleotide Sequence Databases

The second generation nucleotide sequence databases were built with an eye on lessons learned from the firstgeneration nucleotide sequence databases. The goal is to have one sequence entry for every naturally occuring

molecule. In RefSeq, a second generation database, chromosome, gene, mRNA, and protein data are curated. Otherdata such as contigs, model mRNA, and model protein is calculated. A gene can result into multiple products. In suchas case, separate RefSeq ids are used for each product and all are linked by a Locus Id. Second generation nucleotide

sequences are essentially gene-centric databases.

Gene-Centric Databases

In a gene-centric database, all information relevant to a given gene is made accessible at once. Entrez and RefSeq arethe most commonly used. Entrez Gene is tightly linked to RefSeq. RefSeq, the Reference Sequence, collection aims to

provide a comprehensive, integrated, non-redundant set of sequences, including genomic DNA, transcript RNA, andprotein products.

Gene-centric databases contain gene-specific information, which focuses on the genomes that have been completelysequenced, that have an active research community to contribute gene-specific information, or that are scheduled forintense analysis. The content of Entrez Gene represents the result of curation and automated integration of data fromNCBI's RefSeq and other collaborating databases.

Genome-Centric Databases

Genome-centric databases contain information about the gene sequence, relative position, strand orientation,biochemical functions, etc. Ensembl and TIGRare information management systems that are able to connectspecialized sequence collection and browsing tools.

Genbank: case study

GenBank is a comprehensive public database of nucleotide sequences built and distributed by the NCBI. GenBank isprimarily built from the sequence data submissions from authors and from the bulk submission of ESTs, GSS and otherhigh-throughput data from sequencing centers.

EST: Expressed Sequence Tags produced by one-shot sequencing of a cloned cDNA.GSS: Genome Sequence Survey is similar to EST with the exception that most of the sequences are genomic in origin.

GenBank doubles in size every 18 months. WGS and envrionmental sequences now occupy a significant space in thedatabases.

WGS: Whole Genome Shotgun are contigs of a sequencing project. WGS data can contain annotation and should beupdated as sequencing progresses.Contig: A contig is a DNA sequence assembled from DNA fragments of 100-300 base pairs.

Environmental Sequences: These are all DNA sequences present in a sample. The sample often contains manydifferent organisms and these organisms are very often unknown and unidentified.

Each GenBank entry includes a concise description of:

sequencescientific name and taxonomy of the source organismbibliographic references


52 2/14/2012 6:53 PM


4/52

listing of areas of biological significance such as coding regions and their protein translations, transcription

units, repeat regions and sites of mutations or modifications.

GenBank partitions sequence into divisions that roughly correspond to:

taxonomic groups such as bacteria (BCT), viruses (VRL), and rodents (ROD).sequencing strategies such as EST, GSS, HTG, HTC and environmental sample (ENV) sequences

HTC: High throughput cDNAHTG: High throughput genomic sequences, single-pass, unfinished genomic sequences

EST and HTC are RNA or cDNA. GSS, HTG, WGS, and ENV are DNA.

The data in GenBank, and the collaborating databases EMBL and DDBJ, are submitted primarily by individual

authors to one of the three databases, or by sequencing centers as batches of EST, STS, GSS, HTC, WGS or HTGsequences. Data are exchanged daily with DDBJ and EMBL so that the daily updates from NCBI servers incorporatethe most recently available sequence data from all sources. Virtually all records enter GenBank as direct electronic

submissions.

EMBL, GenBank, DDJB and Swiss-Prot both identifiers and accession numbers to identify each entry. To makethings more complicated, identifiers and accession numbers mean different things on different databases. OnSwiss-Prot identifiers are alphanumeric terms that are meaningful to a human being. For example, HBA_HUMANrefers to a human haemoglobin alpha chain. Identifiers can change but they rarely do. Accession number theHBA_HUMAN is P69905. Accession numbers are primary keys so they never change. If two entries are merged, thenew entry will have both accession numbers. One would be the primary key and the other would be the secondarykey. When the entries are split, new accession numbers are assigned to each entry and the old accession number isnoted as the secondary key.

GenBank data can be retrieved by Entrez. Entrez covers over 30 biological databases containing DNA and proteinsequence data, genome mapping data, population sets, phylogenetic sets, environmental sample sets, gene expressiondata, the NCBI taxonomy, protein domain information, protein structures from the Molecular Modeling Database,MMDB, and MEDLINE references via PubMed. Entrez is a very good system to use since it returns much moreinformation than is available on GenBank.

Biological databases often come with useful tools. BLAST is the very powerful tool which allows sequence-similarity

comparisons.

GenBank database can be downloaded by ftp at ftp.ncbi.nih.gov.

This page is a brief summary of descriptions of Swiss-Prot, GenBank, and EMBL available on their websites.

Protein SequencesThere are two major protein sequence resources:

UniProt = Swiss-Prot + TrEMBL + PIR

NCBI-nr = Swiss-Prot + GenPept + PIR + RefSeq + PDB + PRF

In addition, there are several different specialized protein databases.

UniProt

UniProt is a central resource for protein sequence and function. The UniProt consortium (since 2003) consists ofEMBL, SIB, and PIR. PIR is no longer being updated. It now only functions as a archive. UniProt itself is divided intoseveral components.

UniProtKB/TrEMBL


52 2/14/2012 6:53 PM


5/52

UniProtKB/TrEMBL contains computer annotated protein sequences. TrEMBL entries are produced by translating

nucleic acid sequences (CDS) in EMBL using computer tools. In addition, it includes data from PIR. TrEMBL suffersfrom poor submission of annotated CDS.

TrEMBL is a platform for the improvement of automated annotation tools. A TrEMBL entry is created after applyingmany annotation tools such as SignalP, TMHMM, REP, etc. Then evidence tags are added to any part of a TrEMBLentry not derived from the original EMBL entry.

UniProtKB/Swiss-Prot

UniProtKB/TrEMBL contains manually annotated protein sequences. Swiss-Prot entries are produced by manually

annotating TrEMBL entries. Before creating a Swiss-Prot entry, the sequence is checked and analyzed. The data iscross-checked with literature and external scientific expertise. Once an entry is moved to Swiss-Prot, it is deletedfrom TrEMBL. Data in Swiss-Prot does not migrate to TrEMBL. Together, Swiss-Prot and TrEMBL provide all

known protein sequences in the public domain.

The goals of Swiss-Prot are:

Non-redundant: (one entry - one gene - one specie)

Maximum manual annotation: maximum annotation of protein diversityMaximum links to other databases

A Swiss-Prot Entry contains:

ID and accession numbernames and taxonomyreferences

commentscross-references

keywordsfeaturessequence

UniRef

One UniRef100 entry contains all identical sequences including fragments.One UniRef90 entry contains sequences that have at least 90% or more identity.

One UniRef50 entry contains sequences that have at least 50% or more identity.

UniParc

UniParc are raw archived protein sequences.

Sequences and information in UniProt is accessible via text search, BLAST similarity search, and FTP.

Non-coding DNA

A remarkable variability exists in genome size among eukaryotes that has little correlation with organismal

complexity, size or number of coding genes. Even a unicellular organism can have a larger genome than a mammal!This striking disparity is due to non-coding DNA.

Non-coding DNA describes DNA which does not contain instructions for making cell products. It constitutes a largeportion of the genome of eukaryotes. Some this non-coding DNA is involved in regulating the coding regions of DNA.Functions of the remaining non-coding DNA are still unknown.


52 2/14/2012 6:53 PM


6/52

The genome contains several types of non-coding regions (regions not coding for proteins). Non-coding regions can

be found in three areas:

Genic DNA, genic DNA coding for ncRNA, and intergenic DNA

Genic DNA is involved directly in gene expression. UTR regions (untranslated regions of mRNA), and introns aregenic DNA.

The intergenic region contains mostly repetititve regions. Functional regions which constitute to about 15% ofintergenic regions contains SAR (scaffold attachment regions), telomeres, centromeres. The functions of the

remaining 85% regions are unknown.

SAR (Scaffold attachment regions) is an AT-rich segment of a eukaryotic genome that acts as an attachment pointto the nuclear matrix. Nuclear matrix is a proteinaceous scaffold-like network that permeates the cell.

A telomere is a region of highly repetitive DNA at the end of a chromosome that functions as a disposable buffer.Every time linear eukaryotic chromosomes are replicated, the DNA polymerase complex is incapable of replicating allthe way to the end of the chromosome; if it were not for telomeres, this would quickly result in the loss of useful

genetic information.

The centromere is the site where spindle fibers of the mitotic spindle attach to the chromosome during mitosis. Inmost eukaryotes, the centromere has no defined DNA sequence. It typically consists of large arrays of repetitive DNAwhere the sequence within individual repeat elements is similar but not identical.

Repetitive DNA sequence classes

Much of this variation in genome size is due to non-coding, tandemly repeated DNA. A substantial fraction of theeukaryote genomes is often composed of repetitive DNA.

1. Simple Repeats

Simple repeats are duplications of the simple sets of DNA bases, typically 1 5bp. CpG are among the most importantsimple repeats. A CpG island is a short stretch of DNA in which the frequency of the dinucleotide sequence CG ishigher than other regions. The p simply indicates that C and G are connected by a phosphodiester bond. To beclassified a CpG island, a sequence must be at least 200 bases long.

DNA methylation occurs at CG-rich sites. Methylated cytosines may be converted to thymine by deamination overevolution CpG -> TpG. Methylated (inactive regions) are thus poor in CpG. CpG islands are unmethylated regions of

the genome that are associated with the 5 ends of genes which are frequently switched on. Often CpG islands ovelapthe promoter and extend about 1000 base pairs downstream into the transcription unit.

2. Tandem Repeats - DNA satellites

Tandem repeats are typically found at the centromeres and telomeres of chromosomes. These are duplications of

more complex 100-200 base sequences. DNA satellites can further be divided into satellites, minisatellites, andmicrosatellites, based on the number of nucleotides involved.

3. Segmental Duplications

Segmental Duplications are large blocks of 10-300kbp which have been copied to another region of the genome.

4. Interspersed Repeats (Transposons)

Interspersed repeats are repeated DNA sequences located at dispersed regions in a genome. They are also known as

mobile elements for transposable elements. LINEs are long interspersed elements. SINEs are short interspersedelements.

5. Pseudogenes


52 2/14/2012 6:53 PM


7/52

Pseudogenes are defined as nonfunctional sequences of DNA originally derived from functional genes (evolutionary

relics). There are 2 major classes:

unprocessed pseudogenes derived from gene duplication and processed pseudogenes derived from retrotransposition of mRNA

Pseudogenes may be transcribed but not translated. Their chromosomal distributions appear random and dispersed.Pseudogenes can be considered as potogenes, i.e. DNA sequences with a probability of becoming new genes.

Processed pseudogenes are very similar to their closest corresponding human gene, being 94% complete in codingregions, with sequence similarity of 75% for amino acids and 86% for nucleotides.

Pseudogene.org is a organization which concentrates on pseudogenes.

Protein Coding DNA

In prokaryotes, one gene codes for one protein. Eukaryotes used a much more elaborate mechanism to increasesequence diversity and to enable themselves to produce newer proteins.

Alternative promoter usageSeveral exons are involved to code for a single protein. Any one of the several exons can used to initiate theexpression. The choice of the initiating exon could generate a different isoform of the same protein. In other words,alternative usage of promoters results in proteins with different isoforms.

Alternative splicing

RNA splicing is a precisely regulated co- and post- transcriptional process (occurring prior to mRNA translation) thatremoves introns and joins exons in a primary transcript.

During RNA splicing, exons can either be retained in the mature message or targeted for removal in differentcombinations to create a diverse array of mRNAs from a single pre-mRNA, a process referred to as alternative RNAsplicing (tissue and cell specific).

There are four known modes of alternative splicing:

1. Alternative selection of promoters:

This is the only method of splicing which can produce an alternative N-terminus domain in proteins. In this case,different sets of promoters can be spliced with certain sets of other exons.2. Alternative selection of cleavage/polyadenylation sites:This is the only method of splicing which can produce an alternative C-terminus domain in proteins. In this case,different sets of polyadenylation sites can be spliced with the other exons.3. Intron retaining mode

In this case, instead of splicing out an intron, the intron is retained in the mRNA transcript. However, the intron mustbe properly encoding for amino acids. The intron's code must be properly expressible, otherwise a stop codon or ashift in the reading frame will cause the protein to be non-functional.4. Exon cassette mode:In this case, certain exons are spliced out to alter the sequence of amino acids in the expressed protein.mRNA editing

~15 % of disease-causing mutations involve misregulation of alternative splicing (missplicing)

Exon order is not conserved. It cam be scrambled. A technique used in alternative promotor usage.

Trans-splicing vs. Cis-splicing

Splicing prepares pre-mRNA in eukaryotes to produce mature mRNA. This mature messenger RNA is then preparedto undergo translation as part of protein synthesis to produce proteins. When the exons are in the SAME RNAtranscript, it is called cis-splicing.

Trans-splicing is a form of splicing that joins two exons that are not within the same RNA transcript.


52 2/14/2012 6:53 PM


8/52

Exonic splicing enhancers (ESEs) pre-mRNA cis-acting elements

ESEs are discrete sequences within exons that promote both constitutive and regulated splicing. The precisemechanism by which ESEs facilitate the assembly of splicing complexes has been controversial. However, recentstudies have provided insights into this question and have led to a new model for ESE function. Other recent work has

suggested that ESEs are comprised of diverse sequences and occur frequently within exons. Ominously, these latterstudies predict that many human genetic diseases linked to mutations within exons might be caused by theinactivation of ESEs.

Exon sequence enhancers prediction - http://rulai.cshl.edu/tools/ESE/

Alternative splicing database project - http://www.ebi.ac.uk/asd/index.html

Non-coding RNA

Non-coding RNAs represent ~10% of the genes but ~98% of all human transcripts. snRNA participates inpost-transciptional chemical modification or processing of different RNAs.

Micro RNAs (miRNAs) are a class of non-coding RNA gene. They play an important role in the regulation oftranslation and degradation of mRNAs through base pairing to partially complementary sites in the untranslatedregions (UTRs) of the messenger.

Antisense transcription is transcription from the opposite strand to a protein-coding or sense strand. Computationalanalysis suggests that between 15 and 25% of mammalian genes overlap, give rise to pairs of sense and antisenseRNA. They are almost universally associated with candidate imprinted loci, also occurring on the autosomes. Its playroles in gene regulation involving degradation of the corresponding sense transcripts (RNA interference) as well asgene silencing at the chromatin level. The challenge is to determine the correct orientation for an expressed sequence,especially an expressed tag sequence (ESTs).

Antisense mRNA is an mRNA transcript that is complementary to endogenous mRNA. It is the noncoding strandcomplementary to the coding sequence of mRNA. Introducing a transgene coding for antisense mRNA is a strategy

used to block expression of a gene of interest. A strand of antisense mRNA can also be introduced into the cytosol by

microinjection. Radioactively-labelled antisense mRNA can be used to hybridise to endogenous sense mRNA, whichcan show the level of transcription of genes in various cell types.

ncRNA genes are found in genomic sequences by their sequence or structural homology.

tRNA have conserved sequence elements. Programs use a combination of patterns searches; probabilistic methodsand (for eukaryotes) search for Pol III promoters. tRNAscan is a very good program for finding tRNAs.

Pairwise Alignment

Much of bioinformatics involves sequences. Sequences are represented with strings of letters in an alphabet. DNA has

an alphabet of 4 letters while proteins have an alphabet of 20 letters.

The most basic sequence analysis is to ask if two sequences are related. This involves aligning two sequences andthen deciding whether the sequences are related or is the similarity just by chance. The key issues to ponder over are:

1. what sorts of alignments should be considered

2. the scoring system used to rank alignments3. the algorithm used to find optimal (or good) scoring alignments4. the statistical methods used to evaluate the significance

Finding similarity between sequences is important for many biological questions. Some examples:

Finding similar proteins allows us to predict their function and structure.Locating similar subsequences in DNA allows us to identify pockets of interest, such as regulatory elements.

Locating DNA overlapping sequences helps us in sequence assembly.

Two similar sequences are probably biologically similar. Very often similar sequences have similar 3D structures. This


52 2/14/2012 6:53 PM


9/52

is important since the 3D structure of a protein defines its functions. In addition, similar sequences can come from

two species which share a common ancestor, thereby indicating their evolutionary relationship. In other words, theresidues occupying similar positions could have similar functional roles. Evolution tends to conserve the moreefficient functional units. Therefore, important sequences which code for the important proteins are conserved among

organisms in nature.

In the absence of comprehension of the biological mechanisms, it is indispensable to compare a new unknown

sequence to known sequences that we know better. Therefore, discovery of efficient and reliable algorithms arebecoming more and more important as the number of sequences increase exponentially.

Similar, Identical, Homologous

Understanding the difference between similar and identical is crucial for sequence alignment. An identical pair is apair of two same amino acids. A similar pair is a pair of amino acids which could be considered chemically similar inthat certain position. Two amino acids are considered similar if one can be substituted for another with a positive log

odds score from a scoring matrix.

VKASQRTTVVK ++RTTVVKPNKRTTV

In this example, T, V, R, and K are identical pairs while S,N and Q,K are similar pairs.

Similarity can often be misleading. It can reveal evolutionarily related sequences or it can align two sequences withcompletely different function and structure. The challenge is to differentiate between the former and the latter.

Sequence alignment

A sequence alignment takes two sequences of the same alphabet as input and outputs an alignment of the two

sequences. Alignment simply refers to placing one symbol against another. It does not involve judging the quality ofthe alignment. An alignment consists of writing two sequences one on each axis and inserting letters and symbols suchthat the two sequences have the same length. All methods are permitted as long as the order of the symbols in the

sequences is not modified. There is no quality evaluation in the alignment step.

Lets look at the following two sequences:

GCGCATGGATTGAGCGA

TGCGCCATTGATGACCA

A possible alignment could be:

-GCGC-ATGGATTGAGCGA

TGCGCCATTGAT-GACC-A

The string GCGC is a perfect match. The eight character G is a mismatch since it matches with T. The - symbols are

indel (insertions or deletions) as they allow for an more optimal match to occur. Many different alignments arepossible. The trick is to choose the most likely alignment. This is accomplished by scoring alignments and is coveredin the next section.

Sequence identity refers to the occurrence of exactly the same nucleic acid or amino acid in the same position in two

aligned sequences. Sequence similarity is meaningful only when possible substitutions are scored according to theprobability with which they occur. Sequence homology indicates evolutionary relatedness among sequences. Twosequences are said to be homologous if they are both derived from a common ancestral sequence. Similarity refers to

the presence of identical and similar sites in two sequences, while homology reflects a stronger claim that the twosequences share a common ancestor.

Similarity is not definite in a unique and exact manner. It is a mix of biological knowledge and mathematical and

heuristic concepts. Sequence similarity is not about comparing two texts to state whether they are similar or different.A sequence similarity must be capable of tolerating gaps and substitutions. This is an optimization problem whichcould be formulated in a dynamic programming problem. The idea is to give a score to each pair of residues. Thensearch for insertions and deletions which can maximize the global score using a substitution matrix. In addition, thedegree of similarity must be validated biologically and statistically. It is also important to be able to distinguish


52 2/14/2012 6:53 PM


10/52

between accidental similarity and similarity based on biological factors.

Note: Parts of this post are summary ofDurbin.

Scoring

Scoring Model

When we compare sequences, we are looking for evidence that they have diverged from a common ancestor by aprocess of mutation and selection. Basic mutational processes are:

Substitutions: Residue changes in the sequenceInsertions: Addition of a residueDeletions: Removal of a residue

Insertions and deletions, together, are called gaps.

The total score we assign to an alignment is the sum of terms for each aligned pair of residues, plus terms for eachgap. In probabilistic interpretation, this would correspond to the log of the relative likelihood that the sequences are

related, compared to being unrelated. In other words it is the log of the probability of being related to anothersequence compared to the log of the probability of being unrelated.

The easiest scoring method is to assume that the each element of the sequence evolved independently and that the

probability of a mutation is 1/20. However, this is an erroneous assumption since some changes are more plausiblethan others. The plausibility depends on properties of the amino acids. Amino acids which are likely to preserve thestructure and function of the protein are more likely to be preserved over evolution than ones which modify. It is ,

therefore, expected that the identities and conservative substitutions are more likely to occur than randomlyconserved regions. Thus true positives are more likely to have a positive score while random substitutions areexpected to contribute towards a negative score.

Using an additive scoring corresponds to an assumption that we can consider mutations at different sites of thesequence to have occurred independently. In other words, each gap is a mutation. This seems to be a reasonable

assumption for DNA and protein sequences. However, this assumption is seriously inaccurate for RNA, since RNA istranscribed from DNA.

Additive scoring function is defined as follows:

(x,y) is the score of replacing x by y(x,-) is the score of deleting x(-,x) is the score of inserting x

The score of an alignment is the sum of position scores

The optimal or maximal score between two sequences is the maximal score of all alignments of these sequences,

namely:

d(s1,s2) = max(alignment score)

The additive form of the score allows us to perform dynamic programming to compute optimal score efficiently.

Substitution Matrices / Scoring Matrices

What you really want to learn when evaluating a sequence alignment is whether a given alignment is random ormeaningful. To access the meaningfulness of an alignment we construct a scoring matrix.

A scoring matrix is a table of values that describe the probability of a residue pair occurring in an alignment. The

values in a scoring matrix are logarithms of ratios of two probabilities. The first is the probability of randomoccurrence of an amino acid in a sequence alignment. The second is the probability of meaningful occurrence of apair of residues in a sequence alignment.

In order to score an alignment, the alignment program needs to know whether it is more likely or less likely that a


f 52 2/14/2012 6:53 PM


11/52

given amino acid pair occurred randomly. Negative log odds ratio is random while positive indicates an evolutionary

relationship. It is important to note that the scores are logarithms so a match of 2 residues is far from a coincidence.

Notationsequences: x, yxi is the ith symbol in x, yj is the jth symbol in y

A is the alphabet e.g. A = {A, C, G, T} for DNAsymbols from the alphabet are a, b, ...

The unrelated or random model R is the simplest. It assumes that a occurs independently with some frequency qa, theprobability of two sequences is just the product of the probabilities of each amino acid:

P(x,y|R) =

qxi

qyii j

The product of the frequencies of each element of sequence x multiplied by the product of the frequencies of eachelement of sequence y.

In the alternative match model M, aligned pairs of residues occur with a joint probability pab. pab can be thought of as

the probability that the residues a and b have each independently been derived from some unknown original residue cwhich was present in their common ancestor. This gives:

P(x,y|M) =

pxiyii

Joint probability is the probability of two or more things happening at once. In our case, this is the probability offinding the same nucleotide or amino acid on both sequences. In this model, we take the product of the probabilitiesof getting the same residues on both sequences.

The ratio between these the values computed by these two formulas is called the odds ratio. When we take the log of

this ratio, we arrive at the log-odds ratio. To log likelihood ratio of a residue pair computed with:

s(a,b) = log(pab

)qaqb

This is basically the log of the joint probability of a pair divided by the product of the frequencies of each member ofthe pair. The sum of this value for each pair in both sequences gives us log-odds ratio.

S = s(xi,yi)

i

The log-odds ratio can also be looked at as the sum of P(alternative) / P(random).

There are several ways to derive substitution scores, however, substitution scoring based on probabilistic modelsseems to be the most accurate.

In order to arrive at an additive scoring system, we take the log of this ratio. The log likelihood ratios can be arrangedin a matrix. DNA has a 4 x 4 matrix while proteins have a 20 x 20 matrix. This matrix is called the score matrix orsubstitution matrix. Blosum50 and PAM are the most commonly used matrices.

Substitution matrices essentially make a statement about the probability of observing ab pairs in real alignments.

Gap Penalties

DNA sequences change not only by point mutation, but by insertion and deletion of residues as well. Consequently, it


f 52 2/14/2012 6:53 PM


12/52

is often necessary to introduce gaps into one or both of the sequences being aligned to produce a meaningful

alignment between them.

Gaps have to be penalized. The standard cost associated with a gap of length g is given either by a linear score or anaffine score.

(g) = -gd(g) = -d-(g-1)e

where d is called the gap-open penalty and e is called the gap extension penalty.

Most sequence alignment models use affine gap penalties where the cost of opening a gap in a sequence is differentfrom the cost of extending a gap that has already been started. The extension penalty is usually set to a number lessthan the gap-open penalty d. This allows insertions and deletions to be penalized less than they would in linear gap

cost. This is desirable when gaps of a few residues are expected almost as often as gaps of a single residue. [1]

Gap penalties also correspond to a probabilistic model of alignment, although this is less widely recognized than theprobabilistic basis of substitution matrices. We assume that the probability of a gap occurring at a particular site in agiven sequence is the product of a function f(g) of the length of the gap, and the combined probability of the set ofinserted residues. In other words, the length of a gap is not correlated to the residues it contains. Here the gap

penalties correspond to the log probability of a gap of that length. [1]

On the other hand, if there is evidence for a different distribution of residues in gap regions then there should beresidue-specific scores for the unaligned residues in gap regions. These scores should be equal to the logs of the ratioof their frequencies in gapped versus aligned regions. For example, a sequence is more likely to be in a hydrophobic

region of the protein. [1]

Gap penalties are intimately tied to the scoring matrix that aligns the sequences. The best pair of gap opening and

extension penalties for one scoring matrix doesnt necessarily work with another.

Linear Gap PenaltyLinear gap penalties are the simplest type of gap penalty. The only parameter, d, is a penalty per gap. This is almostalways negative, so that the alignment with fewer gaps is favored over the alignment with more gaps. Under a lineargap penalty, the overall penalty for one large gap is the same for many small gaps.

Affine Gap PenaltyAffine gap penalties attempt to overcome this problem. In biological sequences, for example, it is much more likelythat one big gap of length 10 occurs in one sequence, due to a single insertion or deletion event, than it is that 10 smallgaps of length 1 are made. Therefore, affine gap penalties have a gap opening penalty, c, and a gap extension penalty,e. A gap of length l is then given a penalty c + (l-1)e. So the gaps are discouraged, c and e are almost always negative.

Since a few large gaps is better than many small gaps, e is almost always smaller than c.

Source

[1] Durbin

Significance of Scores

Once we have an optimal alignment, we need to access the significance of its score. This would permit us to decide ifit is a biologically meaningful alignment or not.

We look at the distribution of the maximum N match scores to independent random sequences. If the probability ofthis maximum being greater than the observed best is small, then the observation is considered significant.

Alignment Algorithms

Given a scoring system, we need to have an algorithm for finding an optimal alignment for a pair of sequences. Whenboth sequences have the same length, there is only one possible global alignment of the complete sequences, but

things get complicated once gaps are allowed or when we look for local alignment between subsequences of twosequences. It is not computationally feasible to enumerate all possible matches. For two sequences of length n, there


f 52 2/14/2012 6:53 PM


13/52

are:

possible global alignments between the two. Clearly, this is an NP hard problem.

Dynamic Programming

A dynamic programming algorithm is an algorithm for finding optimal alignments which use additive alignment scores.Dynamic programming is crucial for computational sequence analysis. Unlike heuristic methods, dynamicprogramming algorithms are guaranteed to find the optimal scoring alignment or set of alignments. Dynamicprogramming involves dividing the problem into smaller problems and storing the results in a table. It is like arecursion with memory. In the previous section, we defined additive scoring function as:

(x,y) is the score of replacing x by y

(x,-) is the score of deleting x(-,x) is the score of inserting x

The score of an alignment is the sum of position scores. The optimal score between two sequences is the alignmentwhich gives the maximal score. As we have just seen, enumerating all possible alignments is not feasible. In alog-odds ratio scoring scheme, better alignment would produce higher scores. To find the optimal alignment, we wouldlike to maximize the score. In terms of a Blosum50 matrix, we want to maximize the positive values and minimize thesmaller values. Since, dynamic programming is recursion with a memory, lets look at how the recursion argumentwould be constructed.

Suppose we have two sequences:

s[1..n+1] and t[1..m+1]

The best alignment must be one of three cases:

1. Last match is (s[n+1],t[m +1] )2. Last match is (s[n +1],-)

3. Last match is (-, t[m +1] )

Thus:

1. d(s[1..n + 1], t[1..m + 1]) = d(s[1..n], t[1..m]) + (s[n+1], t[m+1])

2. d(s[1..n + 1], t[1..m + 1]) = d(s[1..n], t[1..m + 1]) + (s[n+1], -)3. d(s[1..n + 1], t[1..m + 1]) = d(s[1..n + 1], t[1..m]) + (-, t[m+1])

where (s,t) is the gap cost.

Global Alignment: Needleman & Wunsch Algorithm

We now construct a matrix F indexed by i and j, one index for each sequence, where F(i,j) is the score of the bestalignment between the initial segments of each sequence. F(i,j) is built recursively.

F(i,j) = d(s[i..i],t[1..j])

Using our recursive argument, we get the following reference:


f 52 2/14/2012 6:53 PM


14/52

Graphically, this translates to the following:

Certain texts write this algorithm from the perspective of F(i-1,j-1), but I find this method more intuitive. This, ofcourse, makes no difference in the algorithm.

We need to first handle the base cases in recursion.

F(0,0) = 0F(i+1,0) = F(i,0) + (s[i+1],-)

F(0,j+1) = F(0,j) + (-,t[j+1])

This allows us to fill the first column and the first row. Since we are using using linear gaps, we need to assign a gapcost. Here, I have assigned a gap cost of 2. So the values for the first row and column would be 0, -2, -4, etc.Graphically, it looks like the following:

Now we need to find out F(1,1). We know that A and A are a perfect match. Therefore, we add 1 to the first equationsince it represents a perfect match. The other two represent A,- and -,A matches. To fill in a value for F(1,1), and to

fill the rest of the table, we need to find the maximum of the three.

F(1,1) = max(0+1, -2, -2) = 1F(1,2) = max(-2+1, -4, 1-2) = -1

...

Remember that A,- and -,A are penalized by gap costs.


f 52 2/14/2012 6:53 PM


15/52

Thus the conclusion is that d(AAAC, AGC) = -1. To find the best alignment, we would need to traceback to F(0,0). Inthis step, we start from the last cell and simply point our arrows back to the cells we used to derive our cells.

The traceback gives us the best alignment. In this case, the alignment is:

AAACAA-G

We chose an arbitrary gap cost for our example. If we had chosen a different value such as 8, we would still have

gotten the same traceback.

This algorithm has both space and time complexity of O(mn), since filling the table requires O(mn) and the traceback

requires O(m+n).

In programming terms, N&W involves an iterative matrix method of calculation. All possible pairs of residues (basesor amino acids) - one from each sequence - are represented in a 2-dimensional array. All possible alignments(comparisons) are represented by pathways through this array.

The following four steps are necessary to align sequence1 of N positions with sequence2 of M positions:

1. Build a matrix of size N * M;2. Assign similarity values;3. For each cell, look at all possible pathways back to the beginning of the sequence and give that cell the value of themaximum scoring pathway;

4. Construct an alignment (pathway) back from the highest scoring cell to give the highest scoring alignment.

Try out graphical alignment at http://www.itu.dk/people/sestoft/bsa/graphalign.html

Local Alignment: Smith-Waterman Algorithm


f 52 2/14/2012 6:53 PM


16/52

Global alignment is useful when we want to align two sequences completely. Very often, however, two sequences do

not align completely. In fact we are usually more interested in finding best alignment of subsequences. For example,we would like to find out if human and mouse haemoglobins are homologous. The highest scoring alignment ofsubsequences of sequence s and sequence t is called the best local alignment.

The algorithm for finding local alignments is similar to the global alignment algorithm with two notable differences.

F(i,j) can take a 0 value if all other values are less than 0. 0 value corresponds to starting a new local alignment.1.The traceback can start from anywhere in the matrix. It starts at the maximum value and ends at 0.2.

The algorithm is as follows:

The base cases are:

F(0,0) = 0F(i+1, 0) = max(0, F(i,0) + (s[i+1],-))F(0, j+1) = max(0, F(0,j) + (-,t[j+1]))

If we have two sequences, s=TAATA and t=TACTAA, we would get the following alignments:

TAATA_

TACTAA

___TAATA

TACTAA

For local alignment to work, the expected score for a random match must be negative. If that is not true, then longmatches between entirely unrelated sequences will have high scores, just based on their length. As a consequence,

although the algorithm is local, the maximal scoring alignments would be global or nearly global. A true subsequencealignment would be likely to be masked by a longer but incorrect alignment, just because of its length. Similarly, theremust be some (s,t) greater than 0, otherwise the algorithm won't find any alignment at all.[1]

The random match is required to have a negative value. In an ungapped case, only the expected value of a fixedlength alignment can be considered and it must be noted that in a random model, all residues are independent. Thegives the following formula:

where qa is the probability that s would occur in any given position in a sequence. When (s,t) is derived as a log

likelihood ratio, using the same qa as for random probabilities, the equation above is satisfied. No equivalent analysis

for optimal ungapped alignments exist. There is no analytical method for predicting which gap scores will result in

local vs. global alignment behavior.


f 52 2/14/2012 6:53 PM


17/52

Repeated Matches

If one or both sequences are long enough, we would most probably find several different local alignments with asignificant score. For example, we might find several copies of a repeated domain or motif in a protein. Here we areinteresting in an asymmetric method which finds one or more non-overlapping copies of sections of one sequence

(e.g. domain or motif) in the other. [1]

In our algorithm, we are interested in sequence matches with score higher than a certain threshold T. The reason

behind defining T is that we would always find small subsequences with small positive scores which would quitelikely match unrelated sequences.

Following notation is used for this algorithm:

y is a sequence containing some domain or sequencex is the sequence in which we are looking for multiple matchesT is some threshold score value

Once again, we use F(i,j) matrix but the recurrence in now different, as is the meaning of F(i,j). In the final alignment,

x will be partitioned into regions that match parts of y in gapped alignments, and regions that are unmatched. Thescore of a completed match region is the standard gapped alignment score minus threshold T. [1]

The algorithm obtains all local alignments in one pass. Changing the value of T changed what the algorithm finds.


f 52 2/14/2012 6:53 PM


18/52

Overlap Matches


f 52 2/14/2012 6:53 PM


19/52

Source:

[1] Biological Sequence Analysis: Probabilistic Models of Proteins and Nucleic Acids by Durbin.

[2] Dr. Nir Friedman's lectures: www.cs.huji.ac.il/~nir[3] http://www.itu.dk/people/sestoft/bsa/graphalign.html[4] http://thor.info.uaic.ro/~ciortuz/SLIDES/pairAlign.pdf

Dynamic Programming with more complex models


f 52 2/14/2012 6:53 PM


20/52

Linear gap model scoring scheme is not ideal for biological sequences since gaps are often longer than one residue. If

we are given a general function (g) then we can still use all the dynamic programming algorithms described in the

previous section with adjustments to the recurrence relation. This however require O(n3) operations, thus not feasible.

Alignment with affine gap scores

The alternative is to assume an affine gap cost structure. This brings us back to O(n 2) implementation of dynamic

programming. This, however, requires us to keep track of multiple values for each pair of residue coefficients (i,j) inplace of F(i,j). This corresponds to FSA. An alignment corresponds to a path through the states, with symbols from theunderlying pair of sequences being transferred to the alignment according to the (i,j) values in the states.

Heuristic Algorithms

Dynamic programming guarantees to find the best solution but has a complexity of O(mn). Heuristic algorithms do notguarantee the best solution but are very fast in comparison with deterministic algorithms such as dynamicprogramming.

BLAST

BLAST finds high scoring local alignments between a query sequence and a target database, both of which can beeither DNA or protein. The idea is that true match alignments are very likely to contain short stretches of high scoringidentities. We use these as seed and expand the alignments. BLAST makes a list of all neighborhood words of a fixedlength that would match the query sequence somewhere with score higher than some threshold.

FASTA

FASTA uses a multistep approach to finding local high scoring alignments:

1. Lookup table to locate all identically matching words of length ktup between 2 sequences.

2. Lookup diagonals with many mutually supporting word matches3. Pursue the best diagonal, extending the exact word matches to find maximal scoring ungapped regions. This isanalogous to hit extension in blast.4. Check to see if any of these ungapped regions can be joined by a gapped region, allowing for gap costs.5. The highest scoring candidate matches in a search database are realigned using the full dynamic programmingalgorithm, but restricted to a subregion of the dynamic programming matrix forming a band around the candidateheuristic search.

Hidden Markov Models

Hidden Markov Models (HMMs) have many applications in bioinformatics. They are, for example, used to search forpatterns in a sequence. Here pattern refers to particular chain of characters arranged in a particular sequence e.g.

TATA box or CpG islands.

Patterns

Patterns can be deterministic or non-deterministic at initial inspection. For example, traffic lights follow a predictablepattern. Yellow follows green and red follows yellow. Weather, however, does not follow a predictable pattern inmost parts of the planet. A sunny day can be followed by a rainy day, cloudy day or even another sunny day.

Markov Chains

It is necessary to learn markov chains before one can understand hidden markov models.

Suppose we are looking for CpG islands in a sequence. If we are using a probabilistic model, we would want a modelwhere the probability of a symbol depends on the previous symbol. Markov chains is such a model.


f 52 2/14/2012 6:53 PM


21/52

A markov chain is a set of states connect by arrows called transitions. Each transition has a probability parameter

associated with it. The probability parameter on an arrow from C to G represents the probability of a G following a C.

A finite markov chain is an integer stochastic process, consisting of:

a domain D of m states {s1,...,sm} and1.

an m dimensional vector (p(s1),...,p(sm))2.

an m x m transition probabilities matrix M=(asisj)3.

For DNA,

D = {A,C,G,T}1.

p(A) is the probability of A being the first letter being in the sequence. aAG is probability that G would follow Ain a sequence.

2.

The matrix M is shown below.3.

This matrix represents transition probabilities, M = ast. Note that the sum of each vector (row of matrix) is 1.

Transition probability is represented as follows:

Note that this is simply conditional probability P(t|s), the probability of s occurring given that t has occurred. A keyproperty of markov chains is that the probability of each symbol xi depends only on the preceding symbol xi-1 rather

than the entire sequence. [1]

This equation shows that we need to specify the P(x1), the probability of starting in a particular state in addition to

specifying the transition probabilities. We now add two begin state and end state to our model in ensure that thebeginning and end are modeled. is the begin state and = end state.

P(x1 = s) = asP(|xL = t) = at

We do not need to associate any probability to the begin and end states. They can just serve as points wheretransitions begin and end. The end state is useful in modeling distribution of lengths of sequences. The distributionover lengths decays exponentially.

Using Markov chains for discrimination

In human genomes the pair CG often transforms to (methyl-C) G which often transforms to TG. Hence the pair CGappears less than expected from what is expected from the independent frequencies of C and G alone. Due tobiological reasons, this process is sometimes suppressed in short stretches of genomes such as in the start regions of


f 52 2/14/2012 6:53 PM


22/52

many genes. These areas are called CpG islands.

To be able to discriminate between CpG island and non-CpG islands, we need to model strings with and without CpGislands as Markov Chains over the same states {A,C,G,T} but different transition probabilities:

+ model: Use transition matrix a+st where a+

st is the probability that t follows s in a CpG island

- model: Use transition matrix a-st where a

-st is the probability that t follows s in a non-CpG island

We produce a matrix for the + model and another for the - model. To use these models for discrimination, wecalculate the log-odds ratio:

This would produce another matrix which should ideally discriminate between CpG islands and others by positive andnegative scores. If the ratio is greater than 1, CpG island is more likely.

Hidden Markov Models

Using markov chains, we had to build two models, + and -. Now we would like to use one model to do both. To dothis, we would need to add both the + and - probabilities into one model. Thus we would end up with two statescorresponding to each nucleotide symbol. To void this confusion, we would rename our states from A, C, G, T to A+,

C+, G+, T+ and A-, C-, G-, T-.

The transition probabilities in this model are so that within each group they are close to the transition probabilities ofthe original component model, but there is also a small but finite chance of switching into the other component.Overall there is more chance of switching from + to - than vice versa, so if left to run free, the model will spend

more of its time in the - non-island states than in the island states. [1]

A hidden Markov model (HMM) is a statistical model where the system being modeled is assumed to be a Markovprocess with unknown parameters, and the challenge is to determine the hidden parameters from the observable

parameters. The extracted model parameters can then be used to perform further analysis.

In a regular Markov model, the state is directly visible to the observer, and therefore the state transition probabilitiesare the only parameters. In a hidden Markov model, the state is not directly visible, but variables influenced by the

state are visible. Each state has a probability distribution over the possible output tokens. Therefore the sequence oftokens generated by an HMM gives some information about the sequence of states. The essential difference betweena Markov chain and a hidden Markov chain is that while there is a 1-1 correspondence between states and symbols in

Markov chains, the same isn't true for hidden Markov chains.

Definition: A HMM is a triplet M = (S, Q, T) where:

S is an alphabet of symbols

Q is a finite set of states, capable of emitting symbols form the alphabet ST is a set of probabilities, comprised of:

State transition probabilities, denoted by akl for each k, l Q.

Emission probabilities, denoted by ek(b) for each k Q and b S.

We now need to distinguish the sequence of states from the sequence of symbols. The path is a sequence of states.

The path itself follows a simple Markov chain, so the probability of a state depends only on the previous state. As inthe markov chain model, we can define the state transition probabilities in terms of :

The probability of going to l given that we are at k. Since we have decoupled the symbols b from the states k, there isno longer a 1-1 correspondence between states and symbols. Thus we must introduce a new set of parameters for themodel:


f 52 2/14/2012 6:53 PM


23/52

ek(b) is the probability that the symbol b is seen in state k. These are known as emission probabilities.

Where for our convenience we denote 0 = begin and L+1 = end.

First a state 1 is chosen according to the probabilities a0i. In that state an emission is emitted according to the

distribution e1 for that state. Then a new state 2 is chosen according to the transition probabilities a1i and so forth.

P(X,) is the joint probability of an observed sequence X and a state sequence .

In practice, this is not very useful since very often we do not know the path. So we have to estimate the path either byfinding the most likely path or using an a posteriori distribution over states.

Using HMM

There are 3 canonical problems associated with HMMs:

Given the parameters of the model, compute the probability of a particular output sequence. This problem issolved by the forward-backward procedure.Given the parameters of the model, find the most likely sequence of hidden states that could have generated agiven output sequence. This problem is solved by the Viterbi algorithm.Given an output sequence or a set of such sequences, find the most likely set of state transition and outputprobabilities. In other words, train the parameters of the HMM given a dataset of sequences. This problem issolved by the Baum-Welch algorithm.

Viterbi Algorithm

Although it is no longer possible to tell what state the system is in by looking at the corresponding symbol, it is oftenthe sequence of underlying states that we are interested in. Decoding is the act of finding out the meaning of asequence by considering the underlying states. A commonly used method is a dynamic programming algorithm,Viterbi.

In general, many different states can give rise to any particular sequence of symbols. For example the following 3

states give result in CGCG sequence of symbols.

1. C+, G+, C+, G+2. C-, G-, C-, G-3. C+, G-, C+. G-

The probability of the first is larger than the second which is larger than the third. So if we are to choose only onepath, it is likely that the first one will be chosen.

We can calculate the most probable path in a hidden Markov model using a dynamic programming algorithm.

The most probable path &pi*

can be found recursively. Suppose the probability Vk(i) of the most probable path

ending in state k with observation i is known for all states k. Then these probabilities can be calculated for

observation xi+1 as:


f 52 2/14/2012 6:53 PM


24/52

All sequences have to start in state 0 (the begin state), so the initial condition is that V 0(0) = 1. By keeping pointers

backwards, the actual state sequence can be found by backtracking.

We use logs to convert products to sums to avoid underflow problems.

Viterbi algorithm makes three assumptions:

Viterbi operates on state machine assumptions1.Transition from a previous state to a new state is marked by an incremental metric2.Events are cumulative3.

Forward Algorithm

Since many different state paths can give rise to the same sequence x, we must add the probabilities for all possiblepaths to obtain the full probability of x,

The number of possible paths increases exponentially with the length of the sequence, so evaluation by enumeratingall paths in not practical. Approximation or enumeration is unnecessary as the full probability can itself be calculatedby a similar dynamic programming procedure to the Viterbi algorithm, replacing the maximization steps with sums.

This is called the forward algorithm.

The quantity corresponding to the Viterbi variable Vk(i) in the forward algorithm is:

fk(i) = (x1,...xi, = k)

Note: This post is a summary of chapter 3 of Durbin.

Multiple Sequence Alignment

Multiple sequence alignment techniques are most commonly applied to protein sequences; ideally they are astatement of both evolutionary and structural similarity among the proteins encoded by each sequence in thealignment.

Multiple alignments must usually be inferred from primary sequences alone. Biologists produce high quality multiplesequence alignments by hand using expert knowledge of protein sequence evolution. This knowledge comes from

experience. Important factors include:

specific sorts of columns in alignments, such as highly conserved residues or buried hydrophobic residuesthe influence of secondary or tertiary structure, such as the alteration of hydrophobic and hydrophilic columnsin exposed beta sheet


f 52 2/14/2012 6:53 PM


25/52

expected patterns of insertions and deletions, that tend to alternate with blocks of conserved sequence

The phylogenetic relationships between sequences dictate constraints on the changes that occur in columns and in thepatterns of gaps.

Manual alignment is tedious. To automate the process, it is hard to define exactly what an optimal multiple sequence

alignment is, and it is impossible to set a standard for a single correct multiple alignment. In theory, there is oneunderlying evolutionary process and one evolutionarily correct alignment generated from any group of sequences.However, the differences between sequences can be so great in parts of an alignment that there isnt an apparent,

unique solution to be found by an alignment algorithm. Those same divergent regions are often structurallyunalignable as well. Most of the insight that we derive from multiple alignments comes from analyzing the regions ofsimilarity, not from attempting to align highly diverged regions.

In general, an automatic method must have a way to assign a score so that better multiple alignments get betterscores. We should carefully distinguish the problem of scoring a multiple alignment from the problem of searching

over possible multiple alignments to find the best one.

To automate multiple alignment, we need to do the following:

look at what we need to do for automatic multiple alignment structurally and evolutionarily

consider how to turn the biological criteria into a numerical scoring scheme, so that a program will recognize a

good multiple alignment.examine various approaches by different multiple alignment programs

describe a full probabilistic multiple alignment approach based on profile HMM

What does a multiple alignment mean?

In a multiple sequence alignment, homologous residues among a set of sequences are aligned together in columns.Homologous is meant for both structural and evolutionary sense. Ideally, a column of aligned residues occupy

similar 3D structural positions and all diverge from a common ancestral residue.

Except for trivial cases of highly identical sequences, it is not possible to unambiguously identify structurally orevolutionarily homologous positions and create a single correct multiple alignment. Since protein structures also

evolve, we do not expect 2 protein structures with different sequences to be entirely superposable. Even thedefinition of structurally superposable is subjective and can be expected to vary among experts.

In principle, there is always an unambiguously correct alignment even if the structures diverge. In practice, however,

an evolutionarily correct alignment can be even more difficult to infer than a structural alignment. Structuralalignment has an independent point of reference, superposition of x-ray crystallography or NMR structures. Theevolutionary history of the residues of a sequence family cannot be independently known from any source. It must beinferred from sequence alignment.

The program should not be asked to produce exactly the same alignment. Instead, it should be focused on the subsetof columns corresponding to key residues and core structural elements that can be aligned with confidence.

Summary

multiple alignment is an alignment of more than two sequences

usually gives more information about conserved regionsIt gives better estimate of significance when using a sequence of unknown functionMust use multiple alignments when establishing phylogenetic relationships

Note: This post is a summary of chapter 6.1 ofDurbin.

Scoring MSA

The scoring system should take 2 important features into account:

1. some positions are more conserved than others2. sequences are not independent, but instead are related by a phylogenetic tree


f 52 2/14/2012 6:53 PM


26/52


27/52

Thus, in return for giving up evolutionary tree and assuming independence between sequences, we gain the ability to

straightforwardly estimate a position-specific model of both residue probabilities in columns and insertions anddeletions. This assumption, however, can only be reasonable if representative sequences of a sequence family arechosen carefully. In practice, sample sequences are often biased with under or over representations of sub families.

Several tree-based weighting schemes have been devised to deal with this.

Sum of pairs: SP scores

Sum-of-Pairs scores sum all possible pairwise match scores between amino acids in an aligned column; entropicscores use Shannon's information theoretical entropy to measure the diversity of symbols (amino acids) in a column;matrix scores employ a substitution matrix to evaluate stereochemical diversity in a column; sequence weightedscores normalize against redundancy of sequences in the alignment.

The standard method of scoring multiple alignments is not the HMM formulation, but is similar in that it does NOTuse a phylogenetic tree and it assumes statistical independence for the columns. Columns are scored by an SPfunction using substitution scoring matrix. The SP score for a column is defined as:

where scores s(a,b) come from a substitution scoring matrix such as PAM or BLOSUM matrix. Fro simple linear gapcosts, gaps are handled by defining s(a,-) and s(-,a) to be the gap cost, and s(-,-) to be zero. Otherwise gap costs are

scored separately (e.g. affine gap cost).

Since substitution scores are derived as log-odds scores for pairwise comparisons, the extension to MSA would be forinstance:

The relative difference between correct and incorrect alignment decreases with the number of sequences in thealignment.

Note: This post is a summary of chapter 6.2 ofDurbin.

Multidimensonal dynamic programming

The dynamic programming algorithms used for pairwise sequences alignment can theoretically be extended to any

number of sequences. However, the time and memory requirements of this algorithm increase exponentially with thenumber of sequences.

The only assumption necessary to make multidimensional dynamic programming to work is that column scores areindependent.

A common approach to multiple sequence alignment is to progressively align pairs of sequences. The general strategyis:

A starting pair of sequences is selected and aligned1.Each subsequent sequence is aligned to the previous alignment2.

This is a greedy heuristic algorithm. A greedy algorithm decomposes a problem into pieces, and then chooses the best

solution to each piece without paying attention to the problem as a whole. Since it is a heuristic algorithm, progressivealignment is not guaranteed to find the best solution. In practice, however, progressive alignment methods areefficient and produce biologically meaningful results.

MSA uses a clever heuristic multidimensional dynamic programming algorithm. It assumes an SP scoring system forboth residues and gaps. We assume that the score of a multiple alignment is the sum of the scores of all pairwisealignments defined by the multiple alignment. The score of the complete alignment is given by:


f 52 2/14/2012 6:53 PM


28/52

Let kl

be the optimal pairwise alignment of k,l, which we can calculate in O(L2) time by standard dynamic

programing. Obviously, S(akl S(kl).

Combining this simple observation and the definition of the SP scoring system, we obtain a lower bound on the score

of any pairwise alignment that can occur in the optimal multiple alignment. Now we only need to consider pairwisealignment better than the lower bound. This significantly reduces the complexity.

Note: This post is a summary of chapter 6.3 of Durbin

Progressive alignment methods

Progressive alignment works by constructing a succession of pairwise alignments.

Choose two sequences and align them by pairwise alignment1.Choose a third sequence and align it to the previous alignment2.

Repeat this until you have you more sequences left3.

Initially, you align two sequences and then align the third sequence to the alignment, and so on. There are severalprogressive alignment strategies. They differ in the following ways:

in the way they choose the order of alignmentin whether the progression involves only alignment of sequences to a single growing alignment of whethersubfamilies are built on a tree structure which is used for alignments

in the procedure used to align and score sequences or alignments against existing alignments

Progressive alignment is heuristic:

It does not separate the process of scoring an alignment from the optimization algorithm.

It does not directly optimize any global scoring function of alignment correctness.It is relatively fast and efficient

However, in many cases the resulting multiple alignment is quite reasonable.

The most important heuristic of progressive alignment algorithms is to align the most similar pairs of sequences first.These are the most reliable alignments. Most algorithms build a binary guide tree whose leaves represents sequencesand whose interior nodes represent alignments. The root node represents a complete multiple alignment and the nodes

furthest from the root represent the most similar pairs.

MA

/ \

AL1 AL2

/ \ / \

AL3 S1 S2 S3

/ \

S4 S5

Feng-Doolittle Algorithm

Calculate a diagonal matrix N(N-1)/2 distances between all pairs of N sequences by standard pairwisealignment, converting raw alignment scores to pairwise distances.

1.

Construct a guide tree from the distance matrix using the clustering algorithm by Fitch & Margolish2.Starting from the first node added to the tree, align the child nodes. Repeat in the order they were added to thetree i.e. most similar to least similar until all sequences have been aligned.

3.

To convert alignment scores to distance values, we use the following formula:


f 52 2/14/2012 6:53 PM


29/52

where

Sobs: observed pairwise alignment score

Smax: maximum score, the average of the score of aligning either sequence by itself

Srand: score of a random alignment

Seff: can be seen as a normalized percentage similarity decreasing to 0 with increasing evolutionary distance.

The method for converting alignment scores to distances does not need to be very accurate as the goal is to create an

approximate guide tree. Fitch & Margolish is a fast clustering algorithm that builds evolutionary trees from distancematrices. Before adding a sequence to an existing group, any alignment to one of the group members is tried. Thehighest score for such an alignment determines how the new sequence will be aligned to the group. Group-to-group

alignments are done by comparing all possible pairwise alignment of the members of one group with the members ofthe other group. The best of these alignments determines the alignment of the two groups. Generally, PAMsubstitution with an affine gap penalty is used.

The symbol X is used to denote gaps after alignments. Rule: once a gap, always a gap. This rule is put in place toensure consistency. It encourages gaps to occur in the same columns in subsequent pairwise alignments: s(X,anything)

= 0.

A problem with Feng-Doolittle is that all alignments are determined by pairwise sequence alignments. Once an alignedgroup has been built up, it is advantageous to use position specific information from the group's multiple alignment toalign a new sequence to it. The degree of sequence conservation at each position should be taken into account andmismatches at highly conserved positions penalized more stringently than mismatches at variable positions. Gappenalties in positions might be reduced where lots of gaps occur in the cluster alignment, and increased where no gapsoccur.

Profile Alignment

A profile for a given group a sequences contains all features which are somehow typical for this group. We can thinkof conserved residues as a possible feature example. The idea behind profile alignment is and penalize mismatchesmore strongly in highly conserved regions than in variable positions.

Many progressive methods use pairwise alignment of sequences to profiles or of profiles to profiles as a subroutinewhich is used many times in the process. The exact definition of the scoring function used in profile-sequence or

profile-profile alignment varies. Aligned residues are usually scored by some form of SP score, but the handling ofgaps varies substantially between different methods.

For linear gap scoring, profile alignment is simple because the gap scores can be included in the SP score by settings(-,a) = s(a,-) = -g and s(-,-) = 0. If we have two multiple alignments or profiles, an alignment of these two means thatthat gaps are inserted in whole columns, so that the alignment within one of the profiles is not changed. We can then

split the sum into two sums only concerning the two profiles and one sum containing all cross terms. The first two

sums are unaffected by the global alignment because adding columns of gap characters to a profile adds 0 to the scores(-,-) = 0. Therefore, the optimal alignment of th two profiles can be obtained by only optimizing the last sum with the

cross terms. This can be done exactly like a pairwise alignment where columns are scored against columns by addingthe pair scores. Obviously one of the profiles can consist of a single sequence only, which corresponds to aligning asingle sequence to a profile.

Clustal W

One widely used implementation of profile-based progressive alignment is the CLUSTALW program. CLUSTALWworks in much the same way as Feng-Doolittle method except for its carefully tuned use of profile alignmentmethods.

Algorithm

Construct a distance matrix of all N(N-1)/ pairs by pairwise dynamic programming alignment followed byapproximate conversion of similarity score to evolutionary distances using the model of Kimura.

1.

Construct a guide tree by a neighbor-joining clustering algorithm by Saitou & Nei.2.


f 52 2/14/2012 6:53 PM


30/52

Progressively align at nodes in order of decreasing similarity, using sequence-sequence, sequence-profile, and

profile-profile alignment.

3.

ClustalW is unabashedly ad hoc (designed for this, not generalizable) in its alignment construction and scoring stage.In addition to the usual methods of profile construction and alignment, various heuristics of ClustalW contribute to itsaccuracy:

Sequences are weighted to compensate for biased representation in large sub-familiesSubstitution matrix used to score an alignment is chosen on the basis of the similarity expected of the

alignment; closely related sequences are aligned with hard matrices (BLOSUM 80) and distant sequences arealigned with soft matrices (BLOSUM 50).Position-specific gap-open profile penalties are multiplied by a modifier that is a function of the residues

observed at the position. Penalties are obtained from gap frequencies observed in large number of structurallybased alignments.Gap-open penalties are also decreased if the position is spanned by a consecutive stretch of 5 or morehydrophilic residues.Both gap-open and gap-extend penalties are increased if there are no gaps in a column but gaps occur nearby inalignment. This rule tries to force all the gaps to occur in the same places in an alignment.In the progressive alignment stage, if the score of an alignment is low, the guide tree may be adjusted on the flyto defer the low-scoring alignment until later in the progressive alignment phase when more profile informationhas been accumulated.

Iterative Refinement

One problem with progressive alignment algorithms is that the subalignments are 'frozen'. Once a group of sequences

has been aligned, their alignment to each other cannot be changed at a later stage as more data arrives. Iterativerefinement algorithms attempt to circumvent this problem.

In iterative refinement, an initial alignment is generated, then one sequence or a set of sequences are taken out andrealigned to a profile of the remaining aligned sequences. If a meaningful score is being optimized, this earlier

increases the overall score or results in the same score. Another sequence is chosen and realigned, and so on, until wearrive at a point where the alignment does not change. This procedure is guaranteed to converge to a local maximum

of the score provided that all the sequences are tried and a maximum score exists, simply because the sequence spaceis finite. Barton-Sternberg is a good example.

Barton-Sternberg Algorithm

Find two sequences with highest pairwise similarity and align them using standard pairwise dynamicprogramming.

1.

Find the sequence that is most similar to a profile of the alignment of the first two, and align it to the first twoby profile-sequence alignment. Repeat until all sequences are included.

2.

Remove one sequence and realign it to a profile of other sequences. Repeat for all sequences.3.Repeat step 3 until the score converges or a fixed number of times if it doesn't.4.

The ideas of profile alignment and iterative refinement come quite close to the formulation of probabilistic HMM

approaches for multiple alignment.

Note: This post is a summary of chapter 6.4 of Durbin.

Sources

[1] Durbin[2] http://www.cs.helsinki.fi/u/ajrantan/talks/slides_andre.pdf

Profile HMM Training

Profile HMMs could be used in place of standard profiles in progressive or iterative alignment methods. The use ofprofile HMM formalisms may have certain advantages such as replacing SP scoring scheme by profile HMMassumption that sequences are generated independently from a single 'root' probability distribution.


f 52 2/14/2012 6:53 PM


31/52

Profile HMMs can also be trained from initially unaligned sequences using Baum-Welch expectation maximization

algorithm.

Multiple alignment with a known profile HMM

Before tackling the problem of estimating a model and a multiple alignment simultaneously from initially unalignedtraining sequences, we consider the simpler problem of obtaining a multiple alignment from a known model. To align asequence to a profile HMM, we find the most probable path through the model which is found by the Viterbi

algorithm. Constructing a multiple alignment just requires calculating a Viterbi alignment for each individualsequence. Residues aligned to the same profile HMM match state are aligned in columns. Use fig. 456.

Suppose we align 5 sequences. Then we derive Viterbi optimal path and realign the sequences. A profile HMM insertsinsert states [a-z] for unmatched residues and [A-Z] for matched residues. A profile HMM does not modify the

alignment. Insert state residues represent parts of the sequences which are atypical, unconserved, and notmeaningfully alignable.

profile HMM trained from unaligned sequences

Now we try to estimate a model and multiple alignment from initially unaligned sequences.

Initialization: Choose the length of the profile HMM and initialize parameters.Training: Estimate the model using Baum-Welch or Viterbi algorithm. It is necessary to use a heuristic method foravoiding local optima.

Multiple Alignment: Align all sequences to the final model using the Viterbi algorithm and build a multiplealignment.

Initialization

A profile HMM is a repeating linear structure of three states (match, delete, and insert). The only decision that must

be made in choosing an initial architecture for Baum-Welch estimation is the length of the model M. M is the numberof match states in the profile HMM rather than the total number of states, which is usually set to the average length ortraining sets or based on prior knowledge.

Since Baum-Welch finds local optima, it is important to choose initial models carefully. The model should beencouraged to use 'sensible' transitions; or instance, transitions into match states should be large compared to other

transition probabilities. At the same time, we want to start Baum-Welch from multiple different points to see if allconverge to approximately the same optimum, so we want some randomness in the choice of initial model parameters.

Training

Avoiding Local Maxima

Note: This post is a summary of chapter 6.5 of Durbin.

Gene Prediction

Gene prediction refers to algorithmically identifying stretches of DNA sequences that are biologically functional. Inthe old days, gene prediction was a very painstaking and difficult process. Today, thanks to comprehensive genomesequencing and powerful computational resources, gene prediction is largely a computational problem.

Gene prediction is used to find a functional sequence. In other words, a region of the DNA which is coding for aprotein or mRNA. Regulatory regions, regions of DNA that regulate gene expression, are also considered functional.

Gene prediction does not tell us which genes code for which proteins.

There are two primary approaches for predicting genes:

Intrinsic approach Ab Initio

Extrinsic approaches homology-based


f 52 2/14/2012 6:53 PM


32/52

Prerequisite Knowledge

A gene is the fundamental physical and functional unit of heredity. It is an ordered sequence of nucleotides located in

a particular position on a particular chromosome that encodes a specific function product (RNA or protein).

An Open Reading Frame (ORF) is a series of DNA codons which do not contain any stop codons.

A Coding Sequence (CDS) is a region of DNA or RNA whose sequence determines the sequence of amino acids in a

protein.

Frames always read from 5 to 3.

Prokaryotic gene model

Prokaryotes have small genomes with high gene density. They contain operons, which mean that one transcript resultsin many genes. Since there are no introns, one gene produces one protein. There is one ORF per gene. ORFs beginwith start codon and end with stop codon. The

Bio in for Ma Tics

Documents

Transcript of Bio in for Ma Tics