La place de la phylogénie en Bio-informatique Jean-Michel CLAVERIE Structural & Genetic Information...

Post on 16-Jan-2016

217 views 0 download

Transcript of La place de la phylogénie en Bio-informatique Jean-Michel CLAVERIE Structural & Genetic Information...

La place de la phylogénie en Bio-informatique

Jean-Michel CLAVERIE

Structural & Genetic Information LaboratoryCNRS UPR2589Luminy, France

http://www.igs.cnrs-mrs.fr

Journées RNG « Phylogénie »

Bio-Informatique des Séquences:Une progression naturelle

• Une séquence: trouver où sont les motifs et les gènes

• Deux séquences: les aligner

• Trois et plus: faire une arbre

Gene finding: the general principles behind all current methods

1. Gene-encoded peptide sequences have to lead to foldable, compact structure in a water environnement,

2. Real protein are thus made of a balanced mixture of hydrophilic, hydrophobic, rigid, flexible, neutral, or charged residues

3. This constraint induces a recognizable statistical bias in coding DNA sequences, that we use for detection.

4. Around the good looking regions, we look for facultative additional signals,

5. The optimal prediction is then the best scoring one according to a « standard » gene model.

64 codons, including 3 STOPs : TGA, TAA, TAGSimple situation like E.coli: (A=T=G=C=0.25)

Between two successive STOP codons: each of the 61 codons -> 1.6 % expected relative frequency

In contrast, codon distributions within bona fide genes have to results into ‘foldable’ proteins, hence balancing hydrophylic and hydrophobic, + and - charged, rigid and flexible residues. 

Non-coding vs. Protein-coding DNA sequence:Why is there a bias?

Coding segments have to be compatible with the typical amino acid composition of

natural proteins

  Amino Acid % in N codon % expected proteins at random 

Ala 9.46 4 6.4 +Arg 5.84 6 9.6 -Asn 3.99 2 3.2 +Asp 5.10 2 3.2 +Cys 1.20 2 3.2 -Gln 4.57 2 3.2 +Glu 5.50 2 3.2 +Gly 7.41 4 6.4 +His 2.44 2 3.2 -Ile 5.95 3 4.8 +Leu 10.5 6 9.6 +Lys 4.14 2 3.2 +Met 2.72 1 1.6 +Phe 4.19 2 3.2 +Pro 4.39 4 6.4 -Ser 5.71 6 9.6 -Thr 5.50 4 6.4 -Trp 1.48 1 1.6Tyr 2.76 2 3.2Val 7.16 4 6.4

AA with

+

or

20% deviation

from random

A variety of methods take advantageof this bias

Sequence

SlidingWindow

Chi2

(obs-rand)

CodingRegion

The Caesar (Julius) Cipher

Plaintext : veni, vidi, viciCiphertext: YHQL, YLGL, YLFL

Plain alphabet: abcdefghijklmnopqrstuvwxyzCipher alphabet: DEFGHIJKLMNOPQRSTUVWXYZABC

The Caesar cipher is based on a cipher alphabet that is shifted a certain number of places, relative to the plain alphabet

Frequency of letters in

English

letters % letters %a 8.2 n 6.7

b 1.5 o 7.5

c 2.8 p 1.9

d 4.3 q 0.1

e 12.7 r 6.0

f 2.2 s 6.3

g 2.0 t 9.1

h 6.1 u 2.8

i 7.0 v 1.0

j 0.2 w 2.4

k 0.8 x 0.2

l 4.0 y 2.0

m 2.4 z 0.1

From 100,362 characters

(Beker & Piper, 1950)

Frequency analysis of an

enciphered message

letters % letters %A 0.9 N 0.9

B 7.4 O 11.2

C 8.0 P 9.2

D 4.1 Q 0.6

E 1.5 R 1.8

F 0.6 S 2.1

G 0.3 T 0.0

H 0.0 U 1.8

I 3.3 V 5.3

J 5.3 W 0.3

K 7.7 X 10.1

L 7.4 Y 5.6

M 3.3 Z 1.5

XZAVOIDBYGERSPCFHJKLMNQTUWabcdefghijklmnopqrstuvwxyz

e = 12.7

The foundingFather of

« Gene-finding » bioinformatics

‘Al-kin-dee-u’

Al-Kindī

Al-Kindī (0850)Decyphering Cryptographic MessagesPMID: 0000001

Homophonic substitution cipher

a b c d e f g h i j k l m n o p q r s t u v w x y

09 48 13 01 14 10 06 23 32 15 04 26 22 18 00 38 94 29 11 17 08 34 60 28 21

12 81 41 03 16 31 25 39 70 37 27 58 05 95 35 19 20 61 89 52

33 62 45 24 50 73 51 66 54 42 76 43

47 79 44 56 83 84 71 72 77 86 49

53 46 65 88 91 90 80 96 69

67 55 68 93 99 75

78 57 85

92 64 97

74

82

87

Frequent letters, many codes

  Amino Acid % in N codon % expected proteins at random 

Leu 10.5 6 9.6 +Arg 5.84 6 9.6 -Ser 5.71 6 9.6 -Ala 9.46 4 6.4 +Gly 7.41 4 6.4 +Val 7.16 4 6.4Thr 5.50 4 6.4 -Pro 4.39 4 6.4 -Ile 5.95 3 4.8 +Glu 5.50 2 3.2 +Asp 5.10 2 3.2 +Gln 4.57 2 3.2 +Phe 4.19 2 3.2 +Lys 4.14 2 3.2 +Asn 3.99 2 3.2 +Tyr 2.76 2 3.2His 2.44 2 3.2 -Cys 1.20 2 3.2 -Met 2.72 1 1.6 +Trp 1.48 1 1.6

« Homophonic » Substitution in the Genetic Code

From old to new methods

Methods:1. Codon usage (Staden & McLachlan, 1982) : profile2. Differential k-tuple frequency (1986) : profile

K-tuple concepts and methods

ATGCTAGCATAGCTGCATGACATGCATGCAATGC TGCA TGCT ATGC GCTA CATG CTAG GCAT TAGC TGCA AGCA ATGC

4-tuple(tetramer)

Pre-compute: Fcoding (ATGC) , Fncoding (ATGC)

……………. , ………………

……………. , ………………

Fcoding (°°°°) , Fncoding (°°°°)

K-tuple profile

Sequence

Fc

-------

Fc

+Fnc

CodingRegion

ATGC

Fcoding (ATGC)

------------------------------------------------

Fcoding (ATGC) + Fncoding (ATGC)

0.5

k-tuple frequency analysis was probably knownof the Arabs (1000) and rediscovered

during the Renaissance

Homophonic deciphering:

looking for 2-tuples that often, never or rarely occurs: je, jx, wa

Blaise de Vigenère(1586)

invented his undecipherable cipher

to resist k-tuple analysis

The Vigenère

square

Plain abcdefghijklmnopqrstuvwxyz 1 BCDEFGHIJKLMNOPQRSTUVWXYZA 2 CDEFGHIJKLMNOPQRSTUVWXYZAB 3 DEFGHIJKLMNOPQRSTUVWXYZABC 4 EFGHIJKLMNOPQRSTUVWXYZABCD 5 FGHIJKLMNOPQRSTUVWXYZABCDE 6 GHIJKLMNOPQRSTUVWXYZABCDEF 7 HIJKLMNOPQRSTUVWXYZABCDEFG 8 IJKLMNOPQRSTUVWXYZABCDEFGH 9 JKLMNOPQRSTUVWXYZABCDEFGHI10 KLMNOPQRSTUVWXYZABCDEFGHIJ11-20 ……………………………………………………………………21 VWXYZABCDEFGHIJKLMNOPQRSTU22 WXYZABCDEFGHIJKLMNOPQRSTUV23 XYZABCDEFGHIJKLMNOPQRSTUVW24 YZABCDEFGHIJKLMNOPQRSTUVWX25 ZABCDEFGHIJKLMNOPQRSTUVWXY26 ABCDEFGHIJKLMNOPQRSTUVWXYZ

The Vigenère square: example

Keyword : WHITE WHITEWHITEWHITEWHIPlaintext : diverttroopstoeastridgeCiphertext: ZPDXVPAZHSLZBHIWZBKMZNM

Plain abcdefghijklmnopqrstuvwxyz 4 EFGHIJKLMNOPQRSTUVWXYZABCD 7 HIJKLMNOPQRSTUVWXYZABCDEFG 8 IJKLMNOPQRSTUVWXYZABCDEFGH19 TUVWXYZABCDEFGHIJKLMNOPQRS22 WXYZABCDEFGHIJKLMNOPQRSTUV

WHITE

Charles Babbage:

Another pioneerIn bioinformatics

Cryptanalysis ofThe Vigenère cypher (1854)

PMID: 0000002

1. Look for repeats in the cipher text

2. Infer the keyword lenght from consistent repeat distance: L

3. Then analyze the character frequency at each position independently: 1, 2, 3 , …., L

4. Deduce (« call ») the corresponding Ceasar shift

The invention of inhomogeneous (and hidden) Markov Models

From old to new methods

Methods:1. Codon usage (Staden & McLachlan, 1982) : profile2. Differential k-tuple frequency (1986) : profile3. Differential in-phase k-tuples (1988) : profile4. Non-homogeous Markov chains (Borodvsky,1986;

Tavare & Song, 1989) : profile/call5. Neural-net (Mural & Uberbacher, 1991) : call6. Hidden Markov chain (Kulp & al., 1996) : call

Bioinformatics vs. Biology

• 1951 1st protein sequence (Insulin, Sanger)

• 1960 Sequence-structure relationship (Globins, Perutz)

• 1965 "Evolutionary divergence & convergence in Proteins" Zuckerkandl & Pauling

• 1967 "Construction of Phylogenetic Trees" Fitch & Margoliash. • 1968 Atlas of Protein Sequences (M. Dayhoff, Georgetown)

• 1970 "A general method applicable to the search for similaries in amino-acid sequences of two proteins" Needleman & Wunsch

• 1973 Genetic engineering (Cohen, Boyer et al.)

• 1974 "Prediction of Protein Conformation" Chou & Fasman • 1977 ADN sequencing (Sanger, Maxam, Gilbert)

• 1977 1st bioinformatic "package" (Staden; DB/ assembly, analysis)

• 1978 Databases: ACNUC, PIR, EMBL, GenBank• 1980 Database access via telephone lines ( PIR )• 1981 Los Alamos-GenBank: 270 seqs, 370.000 nt

• 1965 First «  industrial » computer IBM/360• "Evolutionary divergence and convergence in Proteins"

Zuckerkandl & Pauling• 1967 "Construction of Phylogenetic Trees" Fitch & Margoliash. • 1968 First « mini » computer DEC PDP-8 (floor top)• Atlas of Protein Sequences (M. Dayhoff, Georgetown)• 1970 "A general method applicable to the search for similaries "

Needleman & Wunsch.• 1971 1st work on RNA folding (Ninio)• 1972 First « micro » processor Intel 8008• 1973 Genetic engineering (Cohen et al.)• 1974 "Prediction of Protein Conformation" Chou & Fasman • 1975 Intel 8080, kit Altair• 1977 1st bioinformatic package (Staden; PDP 8/11) • DEC-VAX Mini-computers• Micro-computer (Apple, Commodore, Radioshack)• 1978 Databases: ACNUC, PIR, EMBL, GenBank• 1980 Telephone access to the PIR database• 1981 Los Alamos-GenBank: 270 seqs, 370.000 nt• IBM-PC (8088), 16-32kb• 1983 IBM-XT HARD disk (10 Mbytes)• 1984 MacIntosh : graphic/mouse interface

Bioinformatics vs. Computers

Perspective historique (suite)

• 1981 Local Alignement (Smith-Waterman , JMB)

• 1985 "Fasta" (Pearson-Lipman, PNAS)

• 1989 ARPANET --> INTERNET

• 1990 "Blast" (Altschul et al., JMB)

• 1990 Positional cloning of NF-1

• 1991 "Grail", first practical gene «finder » (Mural et al., PNAS)

• 1991 "EST" (Venter et al., Matsubara et al.)

• 1992 Complete sequence of yeast chr. 3

• 1995 Complete sequence of H. influenza

• 1996 Complete sequence of S. cerevisiae

• 1997 "Gapped Blast" (Alschul et al., NAR)/Genscan (Burge & Karlin)

• 1997 11 complete bacterial genomes available

• 1998 2 Mbase/day of new public sequence data, C. elegans• 2000 Human Chr 22, 21, 90%draft, Drosophila, 30 bacterial

genomes

Bioinformatics & Genomics: more recent past

Bon workshop!

Genes & Genomes

DNA

RNA (transcripts)

transcription

Gene 1 Gene 2 Gene 3 Gene 4

translation+ folding

Proteins (or RNAs)Function 2 Function 3

Function 4

Function 1

Bio

Info

rmat

ics