Analyse de SéquencesAnalyse de Séquences Macromoléculaires...

1

Analyse de SéquencesAnalyse de Séquences Macromoléculaires II

Cours : 10 hTD : 12 h

( l l)

O. LecompteLaboratoire de Bioinformatique et Génomique Intégratives – IGBMChttp://www-bio3d-igbmc.u-strasbg.fr/~lecompte/enseignement.html

[email protected]

+ mini-projet (travail personnel)

Banques de données

Interrogation textuelle (SRS,Entrez) Prédictions ab

initio

Alignement de 2 séquences

Introduction à la comparaison de séquences

Alignement multiple

Motifs Profils

ASM2O. Lecompte – IGBMC

Recherche de similarité (Fasta,Blast)

Motifs, Profils

Phylogénie moléculaire

Recherche de similarité

2

Multiple alignment / Pairwise alignment


Query: 177 EMGDTGPCGPCSEIHYDRIGGRDAAHLVNQDDPNVLEIWNLVFIQYNR---EADG----I 229G G GP E+ Y LE+ LVF+QY + AD I

Sbjct: 193 AGG--GNAGPAFEVLYKG-----------------LEVATLVFMQYKKAPANADPSQVVI 233

Query: 230 LK-----PLPKKSIDTGMGLERLVSVLQNKMSNYDTDLFVPYFEAIQKGTGARPYTGKVG 284+K P+ K +DTG GLERLV + Q + YD L E +++ G ++

Sbjct: 234 IKGEKYVPMETKVVDTGYGLERLVWMSQGTPTAYDAVLGY-VIEPLKRMAGVEKIDERIL 292

Query: 285 AEDA---------DGIDMAYR--------------------------VLADHARTITVAL 309E++ D D+ Y +ADH + +T L

Sbjct: 293 MENSRLAGMFDIEDMGDLRYLREQVAKRVGISVEELERLIRPYELIYAIADHTKALTFML 352

Additional domain

Transmembraneregion

B��

A��

E��

B��

Error in ORFdefinition

1��

FAMILY

2��

FAMILY

Phosphorylation site

domain organization, structural motifskey functional residues, ORF definition

localization signals, conservation pattern...

Intra-group conservation

Universal conservation

Differential conservation between

the two families

NLS

FunctionalFunctionalgenomicsgenomics

EvolutionaryEvolutionarystudiesstudies

StructureStructuremodeling modeling

Drug designDrug designMutagenesis Mutagenesis experimentsexperiments

Lecompte et al Gene. 270:17-30 (2001)

3

Alignement multiple

Méthodes utilisées

Estimation de la qualité d’un alignement

Utilisation de l’alignement multiple


Utilisation de l alignement multiple


Alignement multiple optimal Alignement multiple optimalexemple : MSA (Lipman et al. 1989, Gupta et al. 1995)


4

application de la programmation dynamique utilisée pour aligner 2 é N di i

Alignement multiple optimal

séquences => N dimensions

Exemple : alignement de 3 séquences


Problème : temps de calcul et mémoire

Temps requis proportionnel à Nk pour k séquences de longueur N

=> dans la pratique, utilisation impossible pour plus de 10 séquences

OMA (Reinert et al. 2000) combine l’alignement optimal et une éth d é i d t “di id d ”

Alignement multiple optimal

méthode récursive de type “divide-and-conquer”.

Divide

Divide Divide

Align optimally


Alignment of 5 sulfate binding proteins, length 224-263 residues:MSA OMA ClustalW>12hours 62.9min 0.6sec

Concatenate

5


Alignement multiple optimal Alignement multiple optimalex : MSA, OMA

Alignement multiple progressifClustalW (Thompson et al. Nucleic Acids Res. 1994)ClustalX (Thompson et al. Nucleic Acids Res. 1997)


Alignement multiple progressif

P i i Principe :aligner progressivement les séquences (ou groupes de séquences) par paires

Problème :

Par qui commencer ? Dans quel ordre procéder ?

aligner d’abord les séquences les plus proches


g q p p

Comment évaluer la distance entre les séquences ? aligner toutes les séquences deux à deux

calculer la distance entre séquences à partir des alignements

6

Alignement multiple progressif1) Alignements 2 à 2 de toutes les séquences (pairwise alignments)g q p g

L’alignement peut être obtenu par :

Hbb_human 3 LTPEEKSAVTALWGKV..NVDEVGGEALGRLLVVYPWTQRFFESFGDLST ...|.| :|. | | |||| . | | ||| |: . :| |. :| | |||

Hba_human 2 LSPADKTNVKAAWGKVGAHAGEYGAEALERMFLSFPTTKTYFPHF.DLS. ...

Hbb_human 1 VHLTPEEKSAVTALWGKVNVDEVGGEALGRLLVVYPWTQRFFESFGDLST ...| |. |||.|| ||| ||| :|||||||||||||||||||||:||||||

Hbb_horse 1 VQLSGEEKAAVLALWDKVNEEEVGGEALGRLLVVYPWTQRFFDSFGDLSN ...

Hba_human 2 LSPADKTNVKAAWGKVGAHAGEYGAEALERMFLSFPTTKTYFPHF.DLSH ...|| :| | | | || | | ||| |: . :| |. :| | |||.

Hbb_horse 3 LSGEEKAAVLALWDKVNEE..EVGGEALGRLLVVYPWTQRFFDSFGDLSN ...

Ex : alignements pairwise locaux de séquences

d’hémoglobine


Lalignement peut être obtenu par :- méthode globale ou locale- programmation dynamique ou méthodes heuristiques

Exemple dans le programme Clustalx:=> alignements locaux=> choix entre :

- méthode heuristique (utilisée dans Fasta) => plus rapide- programmation dynamique (Smith & Waterman) => plus fiable


Exemple dans Clustalx :

-.17 -

Hbb_humanHbb horse

12

Ex : 7 globin sequences

2) Construction d’une matrice de distances

Exemple dans Clustalx :

distance entre 2 séquences = 1- nb résidus identiquesnb résidus comparés


.17

.59 .60 -

.59 .59 .13 -

.77 .77 .75 .75 -

.81 .82 .73 .74 .80 -

.87 .86 .86 .88 .93 .90 -

Hbb_horseHba_humanHba_horseMyg_phycaGlb5_petmaLgb2_lupla

234567

1 2 3 4 5 6 7

7


• Sequential branching3) Détermination de l’ordre d’alignement

• Sequential branching

• Construction d’un arbre- Neigbor-Joining (NJ)- UPGMA- Maximum likelihood

Progressive alignment using sequential branching

Hba human

Progressive alignment following a guide tree

Hbb human.081226


Hba_human

Hba_horse

Hbb_horse

Hbb_human

Myg_phyca

Glb5_petma

Lgb2_lupla

12

34

56

Hbb_human

Hbb_horse

Hba_human

Hba_horse

Myg_phyca

Glb5_petma

Lgb2_lupla

13

45

6

2.084.055

.065

.226

.219

.398

.389

.442

.015

.061

.062


Les séquences sont progressivement alignées (algorithme global ou local) :- alignement de 2 séquences- alignement d’une séquence et d’un profil- alignement de 2 profils

4) Alignement progressif


xxxxxxxxxxxxxxx

xxxxxxxxxxxxxxx

xxxxxxxxxxxxxxxxxxxxxxxxxxxxxx

xxxxxxxxxxxxxxx

xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx

8

Construction d’un profil

Profil = matrice de scores position-spécifiques (Position-specific scoring matrix=PSSM)

Profile (pos p, res r) = wd x Mat (res d, res r)

avec Mat : la matrice de substitutionwd : le poids du résidu d à la position p

d=1

20

Profil :

Profil matrice de scores position spécifiques (Position specific scoring matrix PSSM)

20 aa

Scores calculés à partir :

• d’une matrice de substitutions (Blosum…)

• des fréquences des résidus à chaque position

p=1Alignement :


A C D E F G H I K L M N P Q R S T V W YPos 1 –1 –1.3Pos 2Pos 3Pos 4

Calcul (avec poids des résidus = fréquence des résidus) :Profile (pos 1, A) = 2/3 x Blo62 (T,A) + 1/3 x Blo62 (W,A) = 2/3 x 0 + 1/3 x (–3) = -1 Profile (pos 1, C) = 2/3 x Blo62 (T,C) + 1/3 x Blo62 (W,C) = 2/3 x (-1) + 1/3 x (-2) = -1.3...

positionsSeq 1 T X X XSeq 2 T X X XSeq 3 W X X X

i=1

Séquence consensus 20 aa

Profil (PSSM)

SFVCQACRKAKTKCDLFVCQACWKSKTKCDRLVCLQCKKIKRKCDSFVCLRCKQRKIKCDSKACDNCRKRKIKCNSTACVNCRKRKIKCT

A C D E F G H I K L M N P Q R S T V W Y S 0 -4 -3 -3 -11 -7 -4 -10 -1 -8 -4 -1 -6 0 -1 9 5 -7 -18 -9 F -7 -6 -10 -7 0 -15 -1 -4 -1 -2 -1 -6 -11 -3 -2 -6 -3 -4 -12 -2 A 8 -1 -11 -7 -10 -11 -9 1 -6 -5 -2 -9 -6 -5 -7 -3 -1 7 -19 -10 C 0 32 -16 -15 -3 -18 -5 -6 -9 -8 -2 -11 -11 -12 -9 -2 -5 -2 -20 -4 D -5 -11 2 -1 -10 -10 -4 -7 -3 -5 -4 -1 -8 0 -5 -3 -1 -6 -20 -9 N 0 -9 -1 0 -11 -5 0 -11 2 -10 -4 4 -6 6 3 0 -1 -9 -19 -7 C 0 32 -16 -15 -3 -18 -5 -6 -9 -8 -2 -11 -11 -12 -9 -2 -5 -2 -20 -4 R -6 -10 -6 -3 -11 -11 1 -12 7 -10 -6 -3 -9 1 10 -4 -4 -11 -9 -6

q


STACVNCRKRKIKCTSHACDQCRRKRIKCRSRACDQCRKKKIKCDTKACDRCHRKKIKCNTVVCTNCKKRKSKCD

R 6 10 6 3 11 11 1 12 7 10 6 3 9 1 10 4 4 11 9 6 K -3 -10 -3 0 -13 -9 0 -11 13 -9 -4 0 -6 5 9 -2 -1 -9 -18 -9 R -2 -7 -5 -2 -12 -10 -3 -9 6 -8 -4 -3 -7 1 8 -1 -1 -7 -15 -9 K -3 -9 -3 0 -14 -9 0 -11 16 -10 -5 0 -6 4 9 -2 -2 -9 -18 -10 I -4 -6 -12 -10 -6 -17 -9 7 -6 -1 0 -8 -10 -5 -7 -6 0 3 -16 -9 K -3 -9 -2 1 -14 -9 0 -11 17 -10 -5 0 -6 4 7 -2 -1 -9 -19 -10 C 0 32 -16 -15 -3 -18 -5 -6 -9 -8 -2 -11 -11 -12 -9 -2 -5 -2 -20 -4 D -6 -13 12 2 -15 -4 -2 -15 0 -14 -9 7 -7 0 -2 0 0 -13 -21 -8

9

.081.226 Hbb_human

Pondération des résidus dans un profil

Hbb h 0 081 + 0 226/2 + 0 061/4 + 0 015/5 + 0 062/6 221

.084

.055

.065.219

.398

.389

.442

.015

.061

.062

Hbb_horse

Hba_human

Hba_horse

Myg_phyca

Glb5_petma

Lgb2_lupla

ClustalW diminue le poids des séquences sur-représentées


Hbb_human 0.081 + 0.226/2 + 0.061/4 + 0.015/5 + 0.062/6

Hbb_horse 0.084 + 0.226/2 + 0.061/4 + 0.015/5 + 0.062/6

Hba_human 0.055 + 0.219:2 + 0.061/4 + 0.015/5 + 0.062/6

Hba_horse 0.065 + 0.219:2 + 0.061/4 + 0.015/5 + 0.062/6

Myg_phyca 0.398 + 0.015/5 + 0.062/6

Glb5_petma 0.389 + 0.062/6

Lgb2_lupla 0.442

= .221

= .225

= .194

= .203

= .411

= .398

= .442

• Pénalité linéaire (affine) : P = x + y L

Pénalités des gaps

• Les pénalités position-spécifiques et résidu-spécifiques :

Dans ClustalW, les pénalités liées à l’introduction de gap sont :

- diminuées aux positions où préexiste un gap

- augmentées à proximité d’un gap préexistant (à moins de 8 résidus)

- diminuées dans les régions hydrophiles (loop)

sinon : les pénalités d’ouverture de gap sont modifiées selon une table résidu- spécifique (Pascarella & Argos, 1992) => fréquence relative des résidus adjacents aux gaps


HLTPEEKSAVTALWGKVN--VDEVGGEALGRLLVVYPWTQRFFESFGDLQLSGEEKAAVLALWDKVN--EEEVGGEALGRLLVVYPWTQRFFDSFGDLVLSPADKTNVKAAWGKVGAHAGEYGAEALERMFLSFPTTKTYFPHFDLSVLSAADKTNVKAAWSKVGGHAGEYGAEALERMFLGFPTTKTYFPHFDLS

0

10

20

30

L’alignement final se présentera sous forme

de blocs étendus.Certains résidus isolés

peuvent être mal alignés.

10


HBB_HUMAN --------VHLTPEEKSAVTALWGKVN--VDEVGGEALGRLLVVYPWTQRFFESFGDLSTPDAVMGNPKVKAHGKKVLGAFSDGLAHLDNHBB_HORSE --------VQLSGEEKAAVLALWDKVN--EEEVGGEALGRLLVVYPWTQRFFDSFGDLSNPGAVMGNPKVKAHGKKVLHSFGEGVHHLDNHBA_HUMAN ---------VLSPADKTNVKAAWGKVGAHAGEYGAEALERMFLSFPTTKTYFPHF-DLS-----HGSAQVKGHGKKVADALTNAVAHVDDHBA_HORSE ---------VLSAADKTNVKAAWSKVGGHAGEYGAEALERMFLGFPTTKTYFPHF-DLS-----HGSAQVKAHGKKVGDALTLAVGHLDDMYG_PHYCA ---------VLSEGEWQLVLHVWAKVEADVAGHGQDILIRLFKSHPETLEKFDRFKHLKTEAEMKASEDLKKHGVTVLTALGAILKKKGHGLB5_PETMA PIVDTGSVAPLSAAEKTKIRSAWAPVYSTYETSGVDILVKFFTSTPAAQEFFPKFKGLTTADQLKKSADVRWHAERIINAVNDAVASMDDLGB2_LUPLU --------GALTESQAALVKSSWEEFNANIPKHTHRFFILVLEIAPAAKDLFSFLKGTSEVP--QNNPELQAHAGKVFKLVYEAAIQLQV *: : : * . : .: * : * : . . .:: *. : .

HBB_HUMAN -----LKGTFATLSELHCDKLHVDPENFRLLGNVLVCVLAHHFGKEFTPPVQAAYQKVVAGVANALAHKYH------HBB_HORSE -----LKGTFAALSELHCDKLHVDPENFRLLGNVLVVVLARHFGKDFTPELQASYQKVVAGVANALAHKYH------HBA HUMAN MPNALSALSDLHAHKLRVDPVNFKLLSHCLLVTLAAHLPAEFTPAVHASLDKFLASVSTVLTSKYR

H1 H2 H3 H4

H6 H7H5


HBA_HUMAN -----MPNALSALSDLHAHKLRVDPVNFKLLSHCLLVTLAAHLPAEFTPAVHASLDKFLASVSTVLTSKYR------HBA_HORSE -----LPGALSNLSDLHAHKLRVDPVNFKLLSHCLLSTLAVHLPNDFTPAVHASLDKFLSSVSTVLTSKYR------MYG_PHYCA -----HEAELKPLAQSHATKHKIPIKYLEFISEAIIHVLHSRHPGDFGADAQGAMNKALELFRKDIAAKYKELGYQGGLB5_PETMA T--EKMSMKLRDLSGKHAKSFQVDPQYFKVLAAVIADTVAAG---------DAGFEKLMSMICILLRSAY-------LGB2_LUPLU TGVVVTDATLKNLGSVHVSKG-VADAHFPVVKEAILKTIKEVVGAKWSEELNSAWTIAYDELAIVIKKEMNDAA--- : *. * . : : . : : .: ... . :

ClustalX


11


Local Global

SB

MLUPGMA

NJ

SBpima multal

multalignclustalx

MLpima


pileupMLpima

SB - sequential branching UPGMA- Unweighted Pair Grouping MethodML - maximum likelihoodNJ - neighbor-joining


Alignement multiple optimal Alignement multiple optimalex : MSA, OMA

Alignement multiple progressifex : ClustalW, ClustalX

Alignement multiple itératif


g pex : PRRP, SAGA

12

Iterative refinement

PRRP (Gotoh, 1993) refines an initial progressive multiple alignment by iteratively dividing the alignment into 2 profiles and realigning them.

initial alignment

divide sequencesinto 2 groups

profile 1

pairwiseprofile

alignmentrefined

alignment


g

profile 2

converged?

no

Globalprogressif

Genetic AlgorithmsSAGA (Notredame et al.1996) evolves a population of alignments in a quasi

select a number of individuals to be parents

modify the parents by shuffling gaps merging 2 alignments etc

population n

select a number of individuals to be parents

modify the parents by shuffling gaps merging 2 alignments etc

population n

evolutionary manner, iteratively improving the fitness of the population


modify the parents by shuffling gaps, merging 2 alignments etc.

evaluation of the fitness using OF (sum-of-pairs or COFFEE)

END

population n+1

modify the parents by shuffling gaps, merging 2 alignments etc.

evaluation of the fitness using OF (sum-of-pairs or COFFEE)

END

population n+1

13

Progressive

Méthodes d’alignement multiple

Local Global

SB

ML UPGMANJ

SBpima multal

multalignil

clustalx

MLpima


Iterative

Genetic Algo. HMM

pileup

dialignsaga hmmt

prrp

BaliBASE

Comparaison des programmes

BaliBASE(Thompson et al. Bioinformatics. 1999 ; Bahr et al, NAR 2001 ; Thompson et al., Proteins 2005)

• alignements basés sur des superpositions de structures tridimensionnelles

• alignements comparés uniquement pour les régions superposables

• différents cas :

- nombre de séquences

longueur des séquences


- longueur des séquences

- similarité entre séquences

- séquence “orpheline” / famille de séquences

- sous-familles

- insertions, extensions

- …

14

“Orphan” Sequences Families of Sequences

BaliBASE

p qFamily (>25% ID) Family 1 (>25% ID)

Family 2 (>25% ID)


Family 3 (>25% ID)

Orphan Sequence (

15

Choix d’un programme

Progressif/ItératifLes méthodes itératives améliorent généralement l’alignementProblèmes :

89 seq histone (66 à 92 aa)

Global/Local Séquences colinéaires => méthodes globalesExtensions N/C-terminales ou insertions => méthodes locales


- Séquences orphelines- Le processus itératif peut être très long !

89 seq histone (66 à 92 aa)ClustalW 2 mins 41 secsPRRP 3 hours 40 minsDialign 3 hours 48 mins

Pour améliorer l’alignement, il faut inclure autant de séquences que possible !

Méthodes d’alignement multiple

Co-operativestrategies2000

MAFFTMUSCLE

ProbCons

DbClustalT-Coffee

Iterativestrategies1996

PRRPSAGADialign

HMMER

Progressivealignment1987

Optimalalignment1975

Clustal

PIMA

MultAlignPileUp


20051975 1995 20001985 1990

1994McClure

1999BAliBASE

16

Combinaison d’approches

• T-Coffee (Notredame et al. 2000) performs local and global alignments for all pairs of sequences, then combines them in a progressive multiple alignment, similar to ClustalW.

• DbClustal (Thompson et al. 2000) is designed to align the sequences detected by a database search. Locally conserved motifs are detected using the Ballast program (Plewniak et al. 1999) and are used in the


global multiple alignment as anchor points.

• MAFFT (Katoh et al. 2002) detects locally conserved segments using a Fast Fourier Transform, then uses a restricted global DP and a progressive algorithm

DbClustal

A partir d ’une séquence « query » :

1) Recherche de séquences similaires=> Blast

2) Recherche de LMS (Local Maximum Segments)

Intègre recherche de similaritéCouplage local et global


=> Ballast

3) Alignement global intégrant les ancres locales fournies par Ballast

http://bips.u-strasbg.fr/PipeAlign/

17

Query

Ballast

E(N) < 0.1

E(N) > 0.1

ASM2O. Lecompte – IGBMC LMS (local maximum segments)Plewniak et al. Bioinformatics 2000

Ballast

S. cerevisiae GAL4 regulatory protein

I II III IV V VI VII VIII


Zn2 Cys6 Putative inhibitory domain

18

Blast Database Search Ballast Anchors DbClustal Alignment

DbClustal

Blast Database Search Ballast Anchors DbClustal Alignment

Query Sequence

Anchors

Query Sequence

Database Hits


Domain A

Domain B

Domain C

ClustalW

Comparaison ClustalW / DbClustal


DbClustal

19

MAFFT

• Local homologous segments detected using a Fast Fourier Transform

• Pairwise alignments are performed using restricted global dynamic programming

Multiple alignment is built up using a progressive algorithm


• Multiple alignment is built up using a progressive algorithm, similar to ClustalW

• Multiple alignment is then iteratively refined by dividing alignment into 2 parts and realigning

MAFFT

Pairwise alignments

GLWGKAAAEEEGLWLFF—---KGVFGAEQEGLFVFFGGK=2

k2-1

c(k)

g


1. Fast Fourier Transform

to detect local conserved segments

2. Segment Level Dynamic Programming

to select ‘consistent’ segments

3. Fix residues at the centre of each segment pair and

realign between fixed points (white regions only)

-GLWGKAAAEEEGLWLFFKGVFGAEQEGLFVFFGG-K=-1

20

MUSCLEEdgar et al, NAR 2004


Analyse de SéquencesAnalyse de Séquences Macromoléculaires...

Documents

Transcript of Analyse de SéquencesAnalyse de Séquences Macromoléculaires...