Corrélations à longue portée dans les séquences génomiques :

Post on 13-Jan-2016

28 views 2 download

description

Corrélations à longue portée dans les séquences génomiques : relation avec la structure et la dynamique des nucléosomes. Analyse multi-échelles des génomes. Etude des corrélations à longue portée dans les génomes - PowerPoint PPT Presentation

Transcript of Corrélations à longue portée dans les séquences génomiques :

Corrélations à longue portée dans les séquences génomiques :

relation avec la structure et la dynamique des nucléosomes

Analyse multi-échelles des génomes

Etude des corrélations à longue portée dans les génomes

Etude des « propriétés globales » des séquences génomiques (Voss, 92 ; Peng et al., 92) :

• la nature d’un nucléotide dépend de celle des autres à grande distance (jusqu’à kb)

• corrélations à longue portée observées dans les introns (non codants) mais pas dans les

exons (codants)

• controverse méthodologique (hétérogénéité de composition des génomes)

Mécanismes biologiques proposés :

• dynamique des génomes :

• réplication- mutation (Li, 93) ;

• insertion-deletion (Buldyrev et al., 93) ;

• tandem repeats (Dokholian et al., 97 ; Li et al., 98)

Construction of a random sequence

the new nt

- without correlations : …ACAGTACT G does not depend on other nucleotides

- with short-range correlations : …CGATTAAC A depends on few neighbour nucleotides (Markov chains)

- with long-range correlations : …TCCGACGG A depends on all nucleotides over large distances with a power-law correlation function

(1/d )

In genomic sequences these correlation properties can extend over tens to thousands bp

Long-range correlations are scale-invariant

What are Long-Range Correlations in DNA sequences ?

Nucleotide compositions are correlated to each other in the same manner whatever the scale

Scale-invariant processes in genomic sequences

L

10 nt

DNA

10 L

100 nt

100 L

1000 nt

Scale invariant processes

- « zooming » on the sequence does not change the shape of the correlation function

- the correlation function is a power law:

C (L) ~ L- C (k L) ~ k- L-

Long-range correlations

Invariance by dilation - Fractal structure

these particular correlation properties

differ from repeated motifs or periodic patterns

Processes described by Markov Chains

- a nucleotide depends only on the adjacent other nucleotides

- the correlation function is :

C (L) ~ exp(-L/L0)

characteristic length L0 :

Short-range correlations

Invariance by translation

at distances larger than L0,,

the nucleotides have no influence

no characteristic scale

What biological mechanisms ?

Long-range correlations :

- observed in introns (non-coding regions)

- absent in exons (coding regions)

Long-range correlations can be generated by genome dynamics :

- expansion-modification systems (duplication-mutation systems, Li, 1991)

- oligonucleotide repeats (Li and Kaneko, 1992)

- insertion-deletion of pseudogenes (Buldyrev et al., 1993

- tandem repeats (Dokholian et al., 97 ; Li et al., 98)

duplications

exon intron

FIRST HYPOTHESIS: long-range correlations are a consequence of genome dynamics

Processes without a characteristic length scale

If the duplication rate is high (e.g. 0.9) and the mutation rate is small (0.1)

as the sequence becomes longer and longer, the sequence exhibits long-range correlations (1/f power spectra)

time

mutations

DNA fragment

Duplication-mutation

Quantification of LRCQuantification of LRC

Sequence length = 260 000 nt (50 % purines)Sequence length = 260 000 nt (50 % purines)

uncorrelated uncorrelated correlated sequencesequence sequence

w2 = 512 pbw2 = 512 pb

CPu

(32/512)(32/512)0.50.5-1-1 = 4 (32/512)= 4 (32/512)0.9-1 = 1.3= 1.3

w2 = 512 pbw2 = 512 pb

w1 = 32 pbw1 = 32 pb w1 = 32 pbw1 = 32 pb

CPu

Pu/Pyr codingPu/Pyr coding

w1 = 32 pbw1 = 32 pb

Pu/Pyr codingPu/Pyr coding

w2 = 512 pbw2 = 512 pb

11

22 w2w2

w1w1( )

H - 1

H = 0.5 No LRC

H > 0.5 LRC loglog = (H-1) logw = (H-1) logw + Cte+ Cte

roughness exponent H

Pu/Pyr codingPu/Pyr coding

w = 1 pbw = 1 pb

Pu/Pyr codingPu/Pyr coding

w = 32 pbw = 32 pb

Pu/Pyr codingPu/Pyr coding

w = 512 pbw = 512 pb

Properties of LRCProperties of LRC

Sequence length = 260 000 nt (50 % purines)Sequence length = 260 000 nt (50 % purines)

uncorrelated uncorrelated correlated sequencesequence sequence

H > 0.5 LRC

persistence(small “roughness”)

H = 0.5 No LRC

log

log 22

((w

tw

t ww))

loglog22ww

H=0.8H=0.8

H=0.5H=0.5

H = 0.5 NO LRCH = 0.5 NO LRC

H > 0.5 LRCH > 0.5 LRC

1 - Straight line scale invariance properties

2 - The slope gives the roughness exponent H

log

log 22

((w

tw

t ww)

- 0.

6 l

og)

- 0.

6 l

og22ww

loglog22ww

H=0.8H=0.8

H=0.5H=0.5

A unique way to display results

loglog = (H-1) logw = (H-1) logw + Cte+ Cte11

22 w2w2w1w1

( )

H - 1

T [f x0 , w f(x) ( ) x0 : spacespace parameterparameter

w : scalescale parameterparameter

Computation of the wavelets coefficients

Advantage : élimination of composition biaisesAdvantage : élimination of composition biaises

-∞

x - x0

+∞

w1w

A WAY TO MEASURE : THE WAVELET TRANSFORM

A. Grossmann & J. Morlet 1984

The wavelet transform eliminates the composition biaises

g(1)

g(2)

(w8)

DNADNA codingcoding SignalSignal

WTWT

SignalSignal

SignalSignal

88 pbpb

128128 pb pb

w = w = 88 pb pb

w = w = 128128 pb pb

HH

SignalSignal WaveletWavelet

128128 pb pb

wt largewt large

wt smallwt small

loglog22ww

log

log 22

(C

O

(CO

ww)-

0.6l

og)-

0.6l

og22ww

T [f x0 , w f(x) ( ) dx+ x - x0

w1w -

(w128)

0.80.8

0.50.5

Quantification of LRC

Presence of LRC in exonic sequences (human)

0

-0.05

-0.1

100

intron

all exons

exon (high GC)

Presence of LRC in exonic sequences

-1 0 1

IH = 0.6

IIH = 0.8

IH = 0.6

IIH = 0.8

A A

IH = 0.6

IIH = 0.8

S. cerevisiae

Two regimes of LRC

E. coli Human

I

H = 0.5

II

H = 0.8

I

H = 0.6

II

H = 0.8

Two regimes of LRC

nucleosomes ?

Two regimes of LRC

E. coli

I

H = 0.5

II

H = 0.8

STRUCTURAL HYPOTHESIS :

the LRC are assocated to the bending of DNA in nucleosomes

Long-range correlations between DNA bending sites ?

Presence of LRC in exonic sequences

necessity of a new hypothesis

Test

Existence of long-range correlations between di-, tri-nucleotides associated to DNA

bending in nucleosomes ?

- nucleosomal DNA bending table (Pnuc) -> LRC ?

(Andrew & Travers, 1986)

Control :

- DNase bending table (Dnase) -> no LRC ?

(Satchwell et al., 1995)

- eubacteria (no nucleosomes) -> no LRC ?

Nucleosome based bending table(Pnuc)

nucleasedigestion oflinker DNA

released nucleosomes

dissociation of histones

146 nucleotideDNA fragments

cloning and sequencing of nucleosomal DNA

chromatin fiber

cloning and sequencing of nucleosomal DNA

sequence analysis of aligned nucleosomal fragments

(Fourier transform)

Pnuc table

Dnase I bending table(DNase)

DNase I induces bending

Dnase activity is favoured by DNA flexibility

measurement of cutting efficiencyalong the DNA molecule

sequence analysis of the cutting profile

Dnase table

digestion of known DNA fragments

by Dnase I

(Luger et al., Nature, 1997)

A - tracts preferred here (minor groove inside)

position0

AAAfrequency

20 40 60

Analysis of nucleosomal DNA

Fourier analysis

DNA sequenceDNA sequence signalsignal

text profiletext profile

nucleosomal profilenucleosomal profile

flexibility profileflexibility profile

codingcoding treatmenttreatment HH

MononucleotideMononucleotide

A T G A T CA T G A T C+1 -1 -1 +1 -1 -1 +1 -1 -1 +1 -1 -1

PnucPnuc

A T G A T CA T G A T C

DnaseDnase

A T G A T CA T G A T C

6.7 5.4

8.7 10

Different ways of coding sequences

Pnuc

Dnase

I II H = 0.5 H = 0.8

H (Pnuc) > H (Dnase)

Dnase

I II H = 0.6 H = 0.8

Human

random table

Dnase

Pnuc

Human (chr 21)

EUKARYOTES

Human

C. elegans

D. melanogaster

A. thaliana

EUKARYOTES EUBACTERIA

Human

B. subtilisC. elegans

M. pneumoniaeD. melanogaster

H. influenzae

A. thaliana Synechocystis

DNA viruses

T4

Lambda

SPBc2

Bacteriophages

T4

Lambda

SPBc2

Bacteriophages

Adenovirus

Animal viruses

DNA viruses

T4

Lambda

SPBc2

Bacteriophages

Adenovirus

Herpesvirus

Animal viruses

DNA viruses

T4

Lambda

SPBc2

Bacteriophages

Adenovirus

Herpesvirus

M. Sanguinipes (Pox)

Animal viruses

DNA viruses

SS RNA (-)

SS RNA (+)

dS RNA

SS RNA (-)

SS RNA (+)

dS RNA

RNA viruses

SS RNA (-)

SS RNA (+)

dS RNA

SS RNA (-)

SS RNA (+)

dS RNA

Spumavirus

Retroviruses

RNA viruses

SS RNA (-)

SS RNA (+)

dS RNA

HIV (1,2)

SS RNA (-)

SS RNA (+)

dS RNA

Spumavirus

MMTV

Retroviruses

RNA viruses

SS RNA (-)

SS RNA (+)

dS RNA

HIV (1,2)

SS RNA (-)

SS RNA (+)

dS RNA

Spumavirus

MMTV

Retroviruses

Retroviruses

RNA viruses

new test of the structural hypothesis

- A’s present LRC - A tracts induce DNA curvature- are these LRC specific of A tracts ?

ALRC+

A tracts (curvature) A isolated

LRC ? LRC ?

Test Control

Human (chr 21)

LRC are associated to A tracts, not isolated A

A

Aiso

AAPnuc

Dnase

structural hypothesis : LRC are associated to DNA curvature

AA

Aiso

Pnuc

Dnase

A

Question

- to what extent the sequence of DNA contributes to its own packaging into nucleosomes ?

Contradictory answers

- Nucleosomal DNA is « periodic »

(Drew & Travers, 1985, JMB; Bina, 1994, JMB)

- Affinity of Eukaryotic DNA for histone octamer

(Lowary & Widom, 1997, JMB) :

5 % of genomic sequences strong affinity

95 % of bulk genomic DNA ~ random DNA

Two types of nucleosomes :

I - strongly binded : periodic repartition of bending sites

5 % genomic DNA

II - weakly binded : same bending sites « apparently random »

95 % genomic DNA

Model

For most nucleosomes (weakly binded) the bending sites are distributed with long-range correlations.

The persistent nature of the distribution of bending sites favours the dynamics of nucleosome formation

and diffusion : displacement requires less energy as in super-diffusive processes.

This organisation of genome sequences favors dynamical processes.

Periodic Long-range correlations : persistence

H > 0.5H not defined

DNA

weakly binded nucleosomes

Human globin locus (70 kb)

globin genes

bp

Presence of LRC in organelles

Few bacteria present LRC in the 0 - 200 nt range

Hypothesis : DNA pakaging in the 0 - 200 nt rangespecific of these bacteria ?

Archaeoglobus fulgidus

Presence of LRC (in the 0 - 200 nt range)in archaebacteria

G

Archaeoglobus fulgidus

The Pnuc coding does not best « extract » LRC in archaebacteria

G

Aeropyrum pernix (56.3% GC)

Sulfolobus solfataricus (35.8% GC)

Aeropyrum pernix (56.3% GC)

Conclusion

Long-range correlations between DNA bending sites, in the 10-200 nt range are a signature of

nucleosomes.

Model

The persistent nature of the distribution of bending sites favours the dynamics of chromatin

Perspectives

Find the DNA structural codings (related to DNA packaging?) that better “extract” the LRC in

genomic sequences

Samuel Nicolay

Cédric Vaillant

Alain Arnéodo

ENS-Lyon

Benjamin Audit

EMBL-EBI, Cambridge

Marie Touchon

Yves d'Aubenton-Carafa

C. Thermes

CGM, Gif sur Yvette