IFT 6299 AUTOMNE 2008 - Université de Montréalcsuros/IFT6299/A2009/materiel/... · 2009. 9....

31
Le cours ? IFT6299 A2009 ? UdeM ? Mikl´ os Cs˝ ur¨ os IFT 6299 A UTOMNE 2008 enomique ´ evolutionnaire (Sujets sp´ eciaux en bio-informatique) Mikl´ os Cs˝ ur¨ os Andr´ e-Aisenstadt 3149 [email protected] http ://www.iro.umontreal.ca/˜csuros/IFT6299/

Transcript of IFT 6299 AUTOMNE 2008 - Université de Montréalcsuros/IFT6299/A2009/materiel/... · 2009. 9....

Page 1: IFT 6299 AUTOMNE 2008 - Université de Montréalcsuros/IFT6299/A2009/materiel/... · 2009. 9. 3. · Objective Le cours ? IFT6299 A2009 ? UdeM ? Mikl os Cs}ur os ii But : se familiariser

Le cours ? IFT6299 A2009 ? UdeM ? Miklos Csuros

IFT 6299 AUTOMNE 2008

Genomique evolutionnaire(Sujets speciaux en bio-informatique)

Miklos Csuros

Andre-Aisenstadt [email protected]

http ://www.iro.umontreal.ca/˜csuros/IFT6299/

Page 2: IFT 6299 AUTOMNE 2008 - Université de Montréalcsuros/IFT6299/A2009/materiel/... · 2009. 9. 3. · Objective Le cours ? IFT6299 A2009 ? UdeM ? Mikl os Cs}ur os ii But : se familiariser

Bioinformatique

Le cours ? IFT6299 A2009 ? UdeM ? Miklos Csuros i

Souvent defini comme l’application de methodes informatiques a l’analyse de donneesbiologiques

Ici, on essayera de developper plutot une approche integrative : biologie et infor-matique

Pour cela, on examinera des etudes recentes en genomique, en discussant les decouvertesbiologiques ainsi que les methodes algorithmiques et modeles formels pour y arri-ver.

Page 3: IFT 6299 AUTOMNE 2008 - Université de Montréalcsuros/IFT6299/A2009/materiel/... · 2009. 9. 3. · Objective Le cours ? IFT6299 A2009 ? UdeM ? Mikl os Cs}ur os ii But : se familiariser

Objective

Le cours ? IFT6299 A2009 ? UdeM ? Miklos Csuros ii

But : se familiariser avec la genomique et les methodes informatiques utilisees lorsde l’etude d’evolution de genomes

Benefices secondaires :- techniques algorithmiques (programmation dynamique, analyse de graphes, fil-trage heuristique)- introduction a computational biology sans prealables en biologie- lecture d’articles bien ecrits

Page 4: IFT 6299 AUTOMNE 2008 - Université de Montréalcsuros/IFT6299/A2009/materiel/... · 2009. 9. 3. · Objective Le cours ? IFT6299 A2009 ? UdeM ? Mikl os Cs}ur os ii But : se familiariser

Plan de cours

Le cours ? IFT6299 A2009 ? UdeM ? Miklos Csuros iii

Prealables :(aucun formel)– structures discretes et algorithmes : graphes, arbres [IFT2010, IFT2121]– probabilites : distribution, variables aleatoires, esperance, processus stochastiques

[MAT1978]– tres peu de bio– aucun (ou tres peu) de chevauchement avec IFT6291/BIN6000Horaire :a determiner

Page 5: IFT 6299 AUTOMNE 2008 - Université de Montréalcsuros/IFT6299/A2009/materiel/... · 2009. 9. 3. · Objective Le cours ? IFT6299 A2009 ? UdeM ? Mikl os Cs}ur os ii But : se familiariser

Materiel

Le cours ? IFT6299 A2009 ? UdeM ? Miklos Csuros iv

Style : seminaire interactif, discussions

aucun livre, 70+ articles recents

notes de cours affiches avant ou juste apres le cours

a peu pres un article a lire chaque semaine

4 mini-questionnaires sur la lecture obligatoire : 10% de notes (pas de questionsmechantes. . . )

projet 30%

2 devoirs theoriques 30%

presentation d’un sujet 30%

Page 6: IFT 6299 AUTOMNE 2008 - Université de Montréalcsuros/IFT6299/A2009/materiel/... · 2009. 9. 3. · Objective Le cours ? IFT6299 A2009 ? UdeM ? Mikl os Cs}ur os ii But : se familiariser

Sujets

Le cours ? IFT6299 A2009 ? UdeM ? Miklos Csuros v

• Notions de base en biologie moleculaire : genes, transcription, compositiond’un genome

• Notions de homologie entre genes

• Evolution de repertoire de genes, inference de fonction par similarite de sequenceset contexte genomique

• Alignement de genomes, recherche heuristique de homologies, techniques dehachage pour sequences moleculaires

• Annotation de genomes, fureteurs

• Modeles probabilistes d’evolution de sequences moleculaires : alignement sta-tistique, inference phylogenetique

• Genomique comparative pour eucaryotes : footprinting, shadowing

• Reseaux biologiques — caracteristiques, evolution et inference

• Sequencage de nouvelle generation

Page 7: IFT 6299 AUTOMNE 2008 - Université de Montréalcsuros/IFT6299/A2009/materiel/... · 2009. 9. 3. · Objective Le cours ? IFT6299 A2009 ? UdeM ? Mikl os Cs}ur os ii But : se familiariser

Introduction — bio ? IFT6299 A2009 ? UdeM ? Miklos Csuros

NOTIONS BIOLOGIQUES

Page 8: IFT 6299 AUTOMNE 2008 - Université de Montréalcsuros/IFT6299/A2009/materiel/... · 2009. 9. 3. · Objective Le cours ? IFT6299 A2009 ? UdeM ? Mikl os Cs}ur os ii But : se familiariser

Francois Jacob

Introduction — bio ? IFT6299 A2009 ? UdeM ? Miklos Csuros i

prix Nobel en 1965, membre de l’Academie francaise depuis 1996

Nous sommes faits d’un etrange melange d’acides nucleiques et de souvenirs, de reves etde proteines, de cellules et de mots.

[. . . ]

Le monde vivant est fait de combinaisons d’elements en nombres finis et ressemble auxproduits d’un gigantesque Meccano resultant d’un bricolage incessant de l’evolution.

“Meccano”, Wikipedia

Page 9: IFT 6299 AUTOMNE 2008 - Université de Montréalcsuros/IFT6299/A2009/materiel/... · 2009. 9. 3. · Objective Le cours ? IFT6299 A2009 ? UdeM ? Mikl os Cs}ur os ii But : se familiariser

L’ADN ?

Introduction — bio ? IFT6299 A2009 ? UdeM ? Miklos Csuros ii

Polynucleotide :polymere (ie, chaıne) de nucleotides

b1 b2 b3 bN

Page 10: IFT 6299 AUTOMNE 2008 - Université de Montréalcsuros/IFT6299/A2009/materiel/... · 2009. 9. 3. · Objective Le cours ? IFT6299 A2009 ? UdeM ? Mikl os Cs}ur os ii But : se familiariser

Nucleotides

Introduction — bio ? IFT6299 A2009 ? UdeM ? Miklos Csuros iii

sucre

groupementphospate base

azotée

4 bases :

Page 11: IFT 6299 AUTOMNE 2008 - Université de Montréalcsuros/IFT6299/A2009/materiel/... · 2009. 9. 3. · Objective Le cours ? IFT6299 A2009 ? UdeM ? Mikl os Cs}ur os ii But : se familiariser

ADN : bases complementaires

Introduction — bio ? IFT6299 A2009 ? UdeM ? Miklos Csuros iv

Access Excellence

Page 12: IFT 6299 AUTOMNE 2008 - Université de Montréalcsuros/IFT6299/A2009/materiel/... · 2009. 9. 3. · Objective Le cours ? IFT6299 A2009 ? UdeM ? Mikl os Cs}ur os ii But : se familiariser

Watson et Crick : double helice

Introduction — bio ? IFT6299 A2009 ? UdeM ? Miklos Csuros v

1 pas de l’helice : 10 paires de base, 3.4 nm (forme usuelle)

Page 13: IFT 6299 AUTOMNE 2008 - Université de Montréalcsuros/IFT6299/A2009/materiel/... · 2009. 9. 3. · Objective Le cours ? IFT6299 A2009 ? UdeM ? Mikl os Cs}ur os ii But : se familiariser

L’article

Introduction — bio ? IFT6299 A2009 ? UdeM ? Miklos Csuros vi

Page 14: IFT 6299 AUTOMNE 2008 - Université de Montréalcsuros/IFT6299/A2009/materiel/... · 2009. 9. 3. · Objective Le cours ? IFT6299 A2009 ? UdeM ? Mikl os Cs}ur os ii But : se familiariser

Un autre modele — le triple helice

Introduction — bio ? IFT6299 A2009 ? UdeM ? Miklos Csuros vii

Pauling & Corey, PNAS 39 : 84 (1953)

Page 15: IFT 6299 AUTOMNE 2008 - Université de Montréalcsuros/IFT6299/A2009/materiel/... · 2009. 9. 3. · Objective Le cours ? IFT6299 A2009 ? UdeM ? Mikl os Cs}ur os ii But : se familiariser

ADN - structure

Introduction — bio ? IFT6299 A2009 ? UdeM ? Miklos Csuros viii

la molecule d’ADN peut etrelineaire (nos chromosomes), oucirculaire (bacteries, organelles)

Ross Inman

Page 16: IFT 6299 AUTOMNE 2008 - Université de Montréalcsuros/IFT6299/A2009/materiel/... · 2009. 9. 3. · Objective Le cours ? IFT6299 A2009 ? UdeM ? Mikl os Cs}ur os ii But : se familiariser

ADN

Introduction — bio ? IFT6299 A2009 ? UdeM ? Miklos Csuros ix

stockage de l’ADN : dans le noyau de la cellule (chez les eucaryotes), organise enchromosomes

[exceptions : organelles, plasmids]

1. duplication de l’information en ADN : heredite

2. duplication de l’information en ARN (souvent traduit en proteines)

Page 17: IFT 6299 AUTOMNE 2008 - Université de Montréalcsuros/IFT6299/A2009/materiel/... · 2009. 9. 3. · Objective Le cours ? IFT6299 A2009 ? UdeM ? Mikl os Cs}ur os ii But : se familiariser

Chromosomes

Introduction — bio ? IFT6299 A2009 ? UdeM ? Miklos Csuros x

22 chromosomes diploıdes et 2 chromosomes sexuels

tailles : 250 · 106–23 · 106 paires de basetaille totale du genome humaine : 3 · 109 pb

NCBI

Page 18: IFT 6299 AUTOMNE 2008 - Université de Montréalcsuros/IFT6299/A2009/materiel/... · 2009. 9. 3. · Objective Le cours ? IFT6299 A2009 ? UdeM ? Mikl os Cs}ur os ii But : se familiariser

Genome

Introduction — bio ? IFT6299 A2009 ? UdeM ? Miklos Csuros xi

Definition 1 Genome is the whole hereditary information of an organism.

In other words, it’s the blueprint for an organism.

Physically, it is stored by DNA

Definition 2 Genome is the DNA sequence of a complete set of chromosomes.

(Chromosome = DNA molecule “packaged” by proteins)

[My] problem with the theoretical definition : really all information ?

Page 19: IFT 6299 AUTOMNE 2008 - Université de Montréalcsuros/IFT6299/A2009/materiel/... · 2009. 9. 3. · Objective Le cours ? IFT6299 A2009 ? UdeM ? Mikl os Cs}ur os ii But : se familiariser

What’s in the genome

Introduction — bio ? IFT6299 A2009 ? UdeM ? Miklos Csuros xii

This is a part of chromosome 16 some interesting genes

classes of frequently occurring (repeated) elements

"stutter"

University of California Santa Cruz genome browser

Page 20: IFT 6299 AUTOMNE 2008 - Université de Montréalcsuros/IFT6299/A2009/materiel/... · 2009. 9. 3. · Objective Le cours ? IFT6299 A2009 ? UdeM ? Mikl os Cs}ur os ii But : se familiariser

Genome constitution

Introduction — bio ? IFT6299 A2009 ? UdeM ? Miklos Csuros xiii

human genome3200 M

inter-genicsequence 2000 M

interspersed repeats1400 M

tandem repeats90 M

unique510 M

genes48 M

introns, ...

gene fragments

pseudogenes

gene-related1152 M

after Watson et al Molecular Biology of the Gene (2004)

Page 21: IFT 6299 AUTOMNE 2008 - Université de Montréalcsuros/IFT6299/A2009/materiel/... · 2009. 9. 3. · Objective Le cours ? IFT6299 A2009 ? UdeM ? Mikl os Cs}ur os ii But : se familiariser

Quick calculation

Introduction — bio ? IFT6299 A2009 ? UdeM ? Miklos Csuros xiv

Human genome : 3 billion letters

About a million pages (3000 characters per page)

A book that is 60 meters (200 feet) thick . . .

Only 20 thousand genes — word count for a novella

What is the rest ?- “junk ?”- unknown functionality ? — maybe we have such a large genome because we aresuch complex organisms . . .

Page 22: IFT 6299 AUTOMNE 2008 - Université de Montréalcsuros/IFT6299/A2009/materiel/... · 2009. 9. 3. · Objective Le cours ? IFT6299 A2009 ? UdeM ? Mikl os Cs}ur os ii But : se familiariser

Many genomes

Introduction — bio ? IFT6299 A2009 ? UdeM ? Miklos Csuros xv

National Center for Biotechnology Information (US)

Page 23: IFT 6299 AUTOMNE 2008 - Université de Montréalcsuros/IFT6299/A2009/materiel/... · 2009. 9. 3. · Objective Le cours ? IFT6299 A2009 ? UdeM ? Mikl os Cs}ur os ii But : se familiariser

Genome size and organismal complexity

Introduction — bio ? IFT6299 A2009 ? UdeM ? Miklos Csuros xvi

Gregory Nat Rev Genet, 6 :699-708, 2005

Page 24: IFT 6299 AUTOMNE 2008 - Université de Montréalcsuros/IFT6299/A2009/materiel/... · 2009. 9. 3. · Objective Le cours ? IFT6299 A2009 ? UdeM ? Mikl os Cs}ur os ii But : se familiariser

Molecular evolution

Introduction — bio ? IFT6299 A2009 ? UdeM ? Miklos Csuros xvii

pathway’’ or ‘‘allelic genealogy’’) entirely suitable for phylo-genetic or cladistic examination (20, 21).

Genealogical Pathways and Organismal Phylogenies. Phy-logeny at the level of populations and species. Ever since thepublication of Genetics and the Origin of Species, geographyand demography have played key roles in most biologicalspeciation scenarios (22). During the evolutionary sequence ofevents by which an extended reproductive community oforganisms (a field for gene recombination) becomes sundered,a curtailment of population genetic exchange by environmen-tal separation typically is envisioned as a necessary prerequi-site for the eventual evolution of the intrinsic (genetic) repro-ductive isolating barriers (RIBs) that are the hallmark of theBSC (Fig. 1). The initial genomic sundering may involve sisterpopulations distributed across broad areas (species D and E inFig. 1), small founding populations on the periphery of aspecies’ range (species A) (23), or, in some cases (24), localsyntopic populations separated by microhabitat (species B). Ineach case, population genomic differentiation facilitated byenvironmental impediments to interbreeding initiates or even-tually may lead to an elaboration of intrinsic reproductivebarriers. Biological speciation also can take place suddenly insmall populations via reproductive sundering processes such aspolyploidization, chromosomal rearrangements, or changes inthe mating system (20). Species A and B in Fig. 1 could beinterpreted as examples.

Each such geographic–demographic model yields logicalpredictions about the coarse-focus phylogeny for particularextant populations or biological species (25). For example,from a traditional perspective, taxa D and E (Fig. 1) are sisterbiological species that comprise a clade. On the other hand, thewidely distributed species C that recently spawned a peripheralisolate A, or a syntopic species B, is paraphyletic with respectto each of these latter forms. As emphasized by Patton andSmith (26), most mechanisms of speciation currently advo-cated by evolutionary biologists ‘‘will result in paraphyletictaxa as long as reproductive isolation forms the basis forspecies definition.’’ Such statements pertain to the historicalsubdivisions of gene pools at the levels of species or well-demarcated populations. In reality, intermediate situationsalso exist in which biotic subdivisions display incompletephylogenetic separation because of a semipermeability in theextrinsic or intrinsic barriers to genetic exchange.

Phylogeny at the level of alleles. In principle, any represen-tation of phylogeny for separated populations or species mightbe examined under finer focus by reference to organismalpedigrees (Fig. 2). Ineluctably, pedigrees define extendedpathways of genetic transmission that constitute rivulets in‘‘the stream of heredity (that) makes phylogeny’’ (27). Con-sider, for example, the matrilineal pathway of transmission (F3 F3 F3 F . . . , where F signifies female) for mitochondrial(mt) DNA (Fig. 3 Upper Left). All extant females in taxon Etrace genealogically through female ancestors to a sharedprogenitress at t ! 5, those in D coalesce at t ! 9, and thosein the D " E assemblage stem to a common ancestor at t ! 12.The great-great . . . -great matrilineal grandmother of allextant individuals in the pedigree existed at t ! 20. Withrespect to the matrilines in the A–C complex (which coalesceat t ! 11), C1 is paraphyletic to A, and C2 is paraphyletic toB. All such statements reflect the realities of allelic-levelancestry through heredity, as to be distinguished from anyestimates of ancestry in empirical appraisals based on molec-ular or any other data.

Similarly, other gender-described classes of genealogicalpathways can be envisioned. In any pedigree for sexuallyreproducing organisms, only four such transmission routesare mutually exclusive in every generation: the matrilinealpathway already mentioned; the patrilineal analogue (M 3M 3 M 3 M . . . , where M signifies male; the route, forexample, of the mammalian Y chromosome); and the gen-eration-to-generation alternating reciprocal pathways ‘‘M3F 3 M 3 F . . . ’’ and ‘‘F 3 M 3 F 3 M . . . . ’’ As tracedthrough the organismal pedigree under consideration (Fig.3), a comparison of these ‘‘independent’’ pathways illustratestwo fundamental points. First, the coalescent trees for the

FIG. 1. (a) Phylogeny for five biological species (A–E) and twogeographically separated populations (C1 and C2) of C. Branch widthsare proportional to the populations’ or species’ sizes and also indicatea geographic orientation. Thus, A is a peripheral isolate from C1, andB arose within the range of C2. The sundering agents are intrinsic RIBs(black areas), extrinsic barriers to gene flow (gray areas), or both intemporal order of appearance (gray then black). (b) Simplified ‘‘stick’’representation of the phylogeny in a.

FIG. 2. Same phylogeny as in Fig. 1 but here depicting organismalpedigrees through 21 discrete generations leading to the present. Thetwo lines tracing from each male (!) or female (E) in any generationidentify the parents of that individual. They also describe the geo-graphic dispersal of offspring (which is assumed to be distance-limited)and the mating events.

Colloquium Paper: Avise and Wollenberg Proc. Natl. Acad. Sci. USA 94 (1997) 7749

generations in a mating population

population split = speciation

extant species: A,B,C,D,E(C is a polyphyletic group)

Avise & Wollenberg PNAS 94 :7748–7755, 1997

Page 25: IFT 6299 AUTOMNE 2008 - Université de Montréalcsuros/IFT6299/A2009/materiel/... · 2009. 9. 3. · Objective Le cours ? IFT6299 A2009 ? UdeM ? Mikl os Cs}ur os ii But : se familiariser

Evolutionary change

Introduction — bio ? IFT6299 A2009 ? UdeM ? Miklos Csuros xviii

Central notion : fixation — a genetic mutation that appears within a single indivi-dual has to spread to the whole population in upcoming generations

number of “mutants”

time

mutation is

ultimate

ly fixed

mutation ultimately disappears

genetic diversitywithin the population

Page 26: IFT 6299 AUTOMNE 2008 - Université de Montréalcsuros/IFT6299/A2009/materiel/... · 2009. 9. 3. · Objective Le cours ? IFT6299 A2009 ? UdeM ? Mikl os Cs}ur os ii But : se familiariser

Genetic drift

Introduction — bio ? IFT6299 A2009 ? UdeM ? Miklos Csuros xix

Chances of fixation are determined by(1) population size , and(2) fitness effects of mutation

(1) - mutations have a higher chance of fixation in a smaller population- large populations will display more diversity

(2) - even deleterious mutations have a chance to get fixed- especially, if the population is small (“population bottlenecks”)- advantageous mutations have a higher chance to get fixed

Page 27: IFT 6299 AUTOMNE 2008 - Université de Montréalcsuros/IFT6299/A2009/materiel/... · 2009. 9. 3. · Objective Le cours ? IFT6299 A2009 ? UdeM ? Mikl os Cs}ur os ii But : se familiariser

Neutral theory of evolution

Introduction — bio ? IFT6299 A2009 ? UdeM ? Miklos Csuros xx

Terminology :- negative or purifying selection (against deleterious mutations)- positive selection (for advantageous mutations)- neutral selection (mutations not affecting evolutionary fitness)

Which one is more common ?

The main cause of evolutionary change at the molecular level — changes in the geneticmaterial itself — is random fixation of selectively neutral or nearly neutral mutants.

(Motoo Kimura)

immediate use of the second sequenced mammalian genome [mouse] : discern slowly evolvingregions in the human genome : about 5% of the human genome evolves slowly (Chiaromonte etal., 2003)

maybe they evolve slowly because they are “mutation coldspots :” slowly evolving regions really areunder negative selection (Drake et al. 2006)

Page 28: IFT 6299 AUTOMNE 2008 - Université de Montréalcsuros/IFT6299/A2009/materiel/... · 2009. 9. 3. · Objective Le cours ? IFT6299 A2009 ? UdeM ? Mikl os Cs}ur os ii But : se familiariser

Slowly evolving regions

Introduction — bio ? IFT6299 A2009 ? UdeM ? Miklos Csuros xxi

mutations tend to be selected against in functional regions of the genome

⇒ regions of common evolutionary origin resemble more between organisms ifthey have a function (since they have evolved more slowly)

Page 29: IFT 6299 AUTOMNE 2008 - Université de Montréalcsuros/IFT6299/A2009/materiel/... · 2009. 9. 3. · Objective Le cours ? IFT6299 A2009 ? UdeM ? Mikl os Cs}ur os ii But : se familiariser

Functional regions

Introduction — bio ? IFT6299 A2009 ? UdeM ? Miklos Csuros xxii

Encyclopedia of DNA Elements (ENCODE) project : 1% of the human genomeanalyzed thoroughly for functionality with every experimental technique available

(1) genes account for 1.5%, and the remaining 3.5% is largely uncharted territoryof functional elements

majority of constrained bases reside within the non-coding portionof the human genome. Meanwhile, increasingly rich data sets ofpolymorphisms across the human genome have been used exten-sively to establish connections between genetic variants and disease,but far fewer analyses have sought to use such data for assessingfunctional constraint85.

The ENCODE Project provides an excellent opportunity for morefully exploiting inter- and intra-species sequence comparisons toexamine genome function in the context of extensive experimentalstudies on the same regions of the genome. We consolidated theexperimentally derived information about the ENCODE regionsand focused our analyses on 11 major classes of genomic elements.These classes are listed in Table 4 and include two non-experimentallyderived data sets: ancient repeats (ARs; mobile elements that insertedearly in the mammalian lineage, have subsequently become dormant,and are assumed to be neutrally evolving) and constrained sequences(CSs; regions that evolve detectably more slowly than neutralsequences).Comparative sequence data sets and analysis.Wegenerated 206Mbof genomic sequence orthologous to the ENCODE regions from 14mammalian species using a targeted strategy that involved isolating89

and sequencing90 individual bacterial artificial chromosome clones.For an additional 14 vertebrate species, we used 340Mb of ortholo-gous genomic sequence derived from genome-wide sequencingefforts3–8,91–93. The orthologous sequences were aligned using threealignment programs: TBA94, MAVID95 and MLAGAN96. Four inde-pendent methods that generated highly concordant results97 werethen used to identify sequences under constraint (PhastCons88,GERP87, SCONE98 and BinCons86). From these analyses, we deve-loped a high-confidence set of ‘constrained sequences’ that corre-spond to 4.9% of the nucleotides in the ENCODE regions. Thethreshold for determining constraint was set using a FDR rate of5% (see ref. 97); this level is similar to previous estimates of thefraction of the human genome under mammalian constraint4,86–88

but the FDR rate was not chosen to fit this result. The median lengthof these constrained sequences is 19 bases, with the minimum being8 bases—roughly the size of a typical transcription factor bindingsite. These analyses, therefore, provide a resolution of constrainedsequences that is substantially better than that currently availableusing only whole-genome vertebrate sequences99–102.

Intra-species variation studiesmainly used SNP data from Phases Iand II, and the 10 re-sequenced regions in ENCODE regions with 48individuals of the HapMap Project103; nucleotide insertion or dele-tion (indel) data were from the SNP Consortium and HapMap.Wealso examined the ENCODE regions for the presence of overlaps withknown segmental duplications104 and CNVs.Experimentally identified functional elements and constrainedsequences. We first compared the detected constrained sequences

with the positions of experimentally identified functional elements. Atotal of 40% of the constrained bases reside within protein-codingexons and their associated untranslated regions (Fig. 10) and, inagreement with previous genome-wide estimates, the remainingconstrained bases do not overlap the mature transcripts of protein-coding genes4,5,88,105,106. When we included the other experimentalannotations, we found that an additional 20% of the constrainedbases overlap experimentally identified non-coding functionalregions, although far fewer of these regions overlap constrainedsequences compared to coding exons (see below).Most experimentalannotations are significantly different from a random expectation forboth base-pair or element-level overlaps (using the GSC statistic, seeSupplementary Information section 1.3), with a more striking devi-ation when considering elements (Fig. 11). The exceptions to this arepseudogenes, Un.TxFrags and RxFrags. The increase in significancemoving from base-pair measures to the element level suggests thatdiscrete islands of constrained sequence exist within experimentallyidentified functional elements, with the surrounding bases appar-ently not showing evolutionary constraint. This notion is discussedin greater detail in ref. 97.

We also examined measures of human variation (heterozygosity,derived allele-frequency spectra and indel rates) within the sequencesof the experimentally identified functional elements (Fig. 12). Forthese studies, ARs were used as a marker for neutrally evolvingsequence. Most experimentally identified functional elements areassociated with lower heterozygosity compared to ARs, and a fewhave lower indel rates compared with ARs. Striking outliers are39UTRs, which have dramatically increased indel rates without anobvious cause. This is discussed in more depth in ref. 107.

These findings indicate that the majority of the evolutionarilyconstrained, experimentally identified functional elements showevidence of negative selection both across mammalian species andwithin the human population. Furthermore, we have assigned at leastonemolecular function to themajority (60%) of all constrained basesin the ENCODE regions.Conservation of regulatory elements. The relationship betweenindividual classes of regulatory elements and constrained sequencesvaries considerably, ranging from cases where there is strong evo-lutionary constraint (for example, pan-vertebrate ultraconservedregions108,109) to examples of regulatory elements that are not con-served between orthologous human and mouse genes110. Withinthe ENCODE regions, 55% of RFBRs overlap the high-confidence

All 44 ENCODE regions(29,998 kb)

4.9% Coding 32%

8% UTRs

Unannotated

20% Other ENCODE experimental annotations

40%

Constrained

Non-constrained

Figure 10 | Relative proportion of different annotations amongconstrained sequences. The 4.9% of bases in the ENCODE regionsidentified as constrained is subdivided into the portions that reflect knowncoding regions, UTRs, other experimentally annotated regions, andunannotated sequence.

Table 4 | Eleven classes of genomic elements subjected to evolutionaryand population-genetics analyses

Abbreviation Description

CDS Coding exons, as annotated by GENCODE59UTR 59 untranslated region, as annotated by GENCODE39UTR 39 untranslated region, as annotated by GENCODEUn.TxFrag Unannotated region detected by RNA hybridization to tiling

array (that is, unannotated TxFrag)RxFrag Region detected by RACE and analysis on tiling arrayPseudogene Pseudogene identified by consensus pseudogene analysisRFBR Regulatory factor binding region identified by ChIP-chip assayRFBR-SeqSp Regulatory factor binding region identified only by ChIP-chip

assays for factors with known sequence-specificityDHS DNaseI hypersensitive sites found in multiple tissuesFAIRE Region of open chromatin identified by the FAIRE assayTSS Transcription start siteAR Ancient repeat inserted early in the mammalian lineage and

presumed to be neutrally evolvingCS Constrained sequence identified by analysing multi-sequence

alignments

ARTICLES NATURE |Vol 447 | 14 June 2007

810Nature ©2007 Publishing Group

UTRs = untranslated regions

(2) but 95% of our genome really is neutrally evolving (“junk”)

ENCODE Project Consortium Nature 447 :799–816, 2007

Page 30: IFT 6299 AUTOMNE 2008 - Université de Montréalcsuros/IFT6299/A2009/materiel/... · 2009. 9. 3. · Objective Le cours ? IFT6299 A2009 ? UdeM ? Mikl os Cs}ur os ii But : se familiariser

Mutations

Introduction — bio ? IFT6299 A2009 ? UdeM ? Miklos Csuros xxiii

(1) replication errors : alteration, deletion, or insertion of a few nucleotides whencopying DNA

(2) genomic duplication : while untangling a “knot” between single DNA strands(recombination), the resulting double-stranded DNAs end up with an extra stretchof DNA or an erasure

→ genomic duplications may create duplicate genes which can then evolveseparately (main mechanism for appearance of new functions)

(3) reverse transcription : copy of an expressed element (RNA) gets incorporatedinto DNA

→ Long Interspersed Nuclear Elements (LINEs) encode genes with com-plete functionality for their own propagation through reverse transcription — theyamount to 1

5 of our genome (900 thousand copies)

(. . . ) other esoteric phenomena : e.g., mobile DNA elements (without RNA in-termediate)

Page 31: IFT 6299 AUTOMNE 2008 - Université de Montréalcsuros/IFT6299/A2009/materiel/... · 2009. 9. 3. · Objective Le cours ? IFT6299 A2009 ? UdeM ? Mikl os Cs}ur os ii But : se familiariser

Genome evolution

Introduction — bio ? IFT6299 A2009 ? UdeM ? Miklos Csuros xxiv

Main features : randomness and duplication

There is no genetics without “genetic drift.” The modern theory of mutations has clearlydemonstrated that a code, which necessarily relates to a population, has an essential mar-gin of decoding : not only does every code have supplements capable of free variation,but a single segment may be copied twice, the second copy left free for variation. Inaddition, fragments of code may be transferred from the cells of one species to another.

(Deleuze & Gattari : A Thousand Plateaus : Capitalism and Schizophrenia, tr. Brian Massumi)