Raymond Ripp, Julie D. Thompson, Frédéric Plewniak, Jean-Claude Thierry, Olivier Poch Laboratoire...

1
Raymond Ripp, Julie D. Thompson, Frédéric Plewniak, Jean-Claude Thierry, Olivier Poch Laboratoire de BioInformatique et Génomique Intégratives du Département de Biologie et Génomique Structurales IGBMC (CNRS – UMR 7104), 1 rue Laurent Fries, Illkirch 67404, Strasbourg France http://www-bio3d-igbmc.u-strasbg.fr/Spine The Identity Card associated to each target registered at EBI points to this web page showing the MACSIM files. Available features can be displayed on the multiple alignment : P-FAM and structural domains, conserved blocks, secondary structures, low complexity and transmembrane regions, bfunctional sites, sequence errors, splicing variants, etc. Additional information such as blast output, homologues description, phylogenetic tree, can be easily viewed or downloaded. An integrated software platform for target selection and characterisation References : Lecompte,O., Thompson,J.D., Plewniak,F., Thierry,J. and Poch,O. (2001) Multiple alignment of complete sequences (MACS) in the post-genomic era. Gene, 270, 17–30. Plewniak F, Bianchetti L, Brelivet Y, Carles A, Chalmel F, Lecompte O, Mochel T, Moulinier L, Muller A, Muller J, Prigent V, Ripp R, Thierry JC, Thompson JD, Wicker N, Poch O. PipeAlign: A newtoolkit for protein family analysis.Nucleic Acids Res. 31, 3829-32 Plewniak,F., Thompson,J.D. and Poch,O. (2000) Ballast: blast post-processing based on locally conserved segments. Bioinformatics, 9, 750–759. Thompson,J.D., Plewniak,F., Thierry,J. and Poch,O. (2000) DbClustal: rapid and reliable global multiple alignments of protein sequences detected by database searches. Nucleic Acids Res., 15, 2919–2926. Thompson,J.D., Plewniak,F., Thierry,J. and Poch,O. (2003) RASCAL: Rapid scanning and correction of multiple sequence alignment programs. Bioinformatics, 19, 1155-61. Thompson,J.D., Plewniak,F., Ripp,R., Thierry,J.C. and Poch,O. (2001) Towards a reliable objective function for multiple sequence alignments. J. Mol. Biol., 4, 937–951. Wicker,N., Perrin,G.R., Thierry,J.C. and Poch,O. (2001) Secator: a program for inferring protein subfamilies from phylogenetic trees. Mol. Biol. Evol., 8, 1435–1441. D N A and/orProteom e D N A and/orProteom e Pipe Pipe- A lign A lign D atabase Searches D atabase Searches •B lastP on Sw issprot, TrEm bl,PDB •B lastX form issing O RFs •TB lastN on com plete genom es D N A processing D N A processing •O RF location •External program s (G lim m er,tRN Ascan,CodonW ...) Initial InitialG Scope G Scope databasecreation databasecreation C ommon steps C ommon steps Clustering schem es Clustering schem es G eneral Inform ation Values G eneralInform ation Values O RF (O verlap,length,… ) G C content Codon U sage ShineD algarno presence Startcodon (M , V orL) H om ologue Counts H om ologue Counts -overall -structures -paralogues -in com plete genom e Sequence validation Sequence validation : H om olog detection agreem ent Validated startcodon Phylogenetic Phylogenetic relationships relationships : G ene clusterm aintenance G ene losses D istance tree analysis Structuralinform ation Structuralinform ation : D om ain organisation Production in E.coli, Y east Hydrophobicityindex Hydrophobic helices… Targetidentification: Targetidentification: X -H D A analysis Validated G O (z score) Integratedsynteny Spine Spine specific specific annotation : annotation : targetcharacterision… D atacross D atacross- correlation correlation Predictions Predictions Specialised Specialised steps steps Gscope Gscope platform platform M A CSIM : M A CSIM : (M ultiple (M ultiple Alignm ent Alignm ent , ,Clustering Clustering and and Selected Selected Inform ation Inform ation M anadgm ent M anadgment) Integration ofm ined structural/functionalinform ation Integration ofm ined structural/functionalinform ation G raphical G raphicalinterface interface to accessthe inform ation to accessthe inform ation Cross Cross- validation analysisand propagation validation analysisand propagation M A CSIM functionalannotation Target:nuclearreceptorcoactivator2 (N CO A -2) ODD PA C PA S SW ISSPRO T dom ains H LH D na-binding Pfam domains NTAD CTAD Interaction w ith CREBBP A cetyltransferase activity ID N CoA -2 Clock HIF-1 Single- m inded BM AL N CoA -3 PA C PA S HLH CREBBP interaction AT Poly-G ln PA S LXXLL LXXLL LXXLL acetylation (by CREBBP) S-nitrosylation N CoA -2 Receptor-interacting dom ain >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>> >>>>>>> >>>>>>> >>>>>>> >>>>>>> >>>>>>> >>>>>>> >>>>>>> >>>>>>> >>>>>>> >>>>>>> >>>>>>> >>>>>>> >>>>>>> >>>>>>> >>>>>>> >>>>>>> >>>>>>> >>>>>>> >>>>>>> >>>>>>> >>>>>>> >>>>>>> >>>>>>> >>>>>>> >>>>>>> >>>>>>> >>>>>>> >>>>>>> >>>>>>> >>>>>>> >>>>>>> >>>>>>> >>>>>>> >>>>>>> >>>>>>> >>>>>>> >>>>>>> >>>>>>> >>>>>>> >>>>>>> >>>>>>> >>>>>>> >>>>>>> >>>>>>> >>>>>>> >>>>>>> >>>>>>> >>>>>>> >>>>>>> >>>>>>> >>>>>>> >>>>>>> >>>>>>> >>>>>>> >>>>>>> >>>>>>> >>>>>>> >>>>>>> >>>>>>> >>>>>>> >>>>>>> >>>>>>> >>>>>>> >>>>>>> >>>>>>> >>>>>>> >>>>>>> >>>>>>> >>>>>>> >>>>>>> >>>>>>> >>>>>>> >>>>>>> >>>>>>> >>>>>>> >>>>>>> >>>>>>> >>>>>>> >>>>>>> >>>>>>> >>>>>>> >>>>>>> >>>>>>> >>>>>>> >>>>>>> >>>>>>> >>>>>>> >>>>>>> >>>>>>> >>>>>>> >>>>>>> >>>>>>> >>>>>>> >>>>>>> >>>>>>> >>>>>>> >>>>>>> >>>>>>> >>>>>>> >>>>>>>>>>>>>> >>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>> >>>>>>>>>>>>>> >>>>>>>>>>>>>> >>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>> hydroxylation S-nitrosylation acetylation LX X LL LX X LL LXXLL LXXLL LXXLL LXXLL Spine TargetsW ebsite atIG BM C PipeAlign is a five step process ranging from the search for sequence homologues in protein and 3D structure databases to the definition of the hierarchical relationships within and between subfamilies. General overview of Gscope Gscope automatically various analysis and correlation. It allows data management and visualisation through a userfriendly interface. Protein analysis based on homology rely on the validated multiple alignment of complete sequences computed by DbClustal within the PipeAlign process. Beside the analysis of isolated sequences, Gscope provides interesting clustering schemes about sets of nucleic or peptidic sequences, focusing especially on structural genomics insights. The first step of Gscope is to create a database containing the basal information for each sequence of the project. Then database searches are performed using each protein sequence. This defines sets of orthologs used in further analysis. L MS (local maximum segments) L MS (local maximum segments) BlastP search Plew niak et al. (2000)Bioinform atics. BallastAnchors DbClustalAlignm ent Q uery Sequence A nchors Thom pson et al. (2000)N uclA cidsRes. Scanned Scanned and and C orrected C orrected M ACS M ACS (Multiple (Multiple Alignment Alignment of of Complete Complete Sequences Sequences) Thom pson et al. (2003)Bioinform atics. Secator Secator /D PC : /D PC :automatic automatic clustering clustering algorithms algorithms W ickeret al. (2001)M olBiolEvol. W ickeret al. (2002) N uclA cidsRes. PipeAlign PipeAlign : : automatic automatic protein protein family family analysis analysis Thom pson et al. (2004)N uclA cidsRes. Hom ologousproteins Hom ologousproteins Thom pson et al. (2001)JM olBiol. O rdAli O rdAli : : O rd O rd ered ered Ali Ali gnm ent gnm entanalysis analysis of ofdifferentially differentially conserved conserved residues residues w ith w ith automatic automatic visualization visualization on structure on structure Strictly/m ostly conserved (black,grey) Conserved betw een groups (red + yellow = orange) Conserved w ithin group (red,yellow,blue) http:// http:// bips.u bips.u- strasbg.fr strasbg.fr/ PipeAlign PipeAlign/ Plew niak et al. (2003)N ucleic A cidsRes. MACSIM highlights following features : Functional residues, Domain organisation 3D structure environment, Mutagenesis experiments, Comparative genomics (phylo. distribution) Abstract. In order to fully understand the potential biomedical role of a target protein, such diverse data as the type of organism, domain organisation, splicing variants, 2D/3D structures and mutations and their associated illnesses, must be organised into an information network for presentation to the experimentalist. The Gscope genomic annotation and analysis platform has been developed to allow automatic, high-throughput data collection, cross-validation and analysis of such heterogeneous information in a single, integrated environment. The integration of the protein in the context of the complete family is the essential first step in this process. Gscope therefore incorporates the PipeAlign protein family analysis toolkit in order to construct high quality, clustered multiple alignments of a potential target and its homologues identified by in-depth database searches. This provides the basis for the definition of the hierarchical relationships within and between subfamilies and for the reliable integration of all the structural and functional information available for the protein family. A new program, MACSIM, has been developed whose primary goal is to validate the quality of the data mined from the public databases and to propagate this information to the target of interest. The Gscope platform has been used to perform PipeAlign analyses of all targets in the SPINE target database and an “identity card” has been created for each potential target. These “identity cards” are provided in XML format and are accessible via the SPINE Web Site at http://www-bio3d-igbmc.u-strasbg.fr/Spine/. Gscope integrates tools for the design, ordering and database management of oligos, inserts, pcr products, recombinants, sequence verification. These data can be easily sent to any Laboratory Information Management System. MACSIM perfoms propagation of pertinent information from sets of well annotated sequences their homologs detected within the multiple alignment. All Spine Targets are registered at the EBI Web Site http://www.ebi.ac.uk/msd- srv/msdtarget Their description and characterisation, task status and associated information are hosted and updated at EBI. The Identity Card link points to the IGBMC Spine Target Website. The XML files of the Spine Targets were downloaded from http://www.ebi.ac.uk/ and processed by Gscope at IGBMC using their protein sequence, gene name and definition. Results are hosted at the IGBMC Website and protected by a login and password. Specific additional analysis can be done on request, in particular DNA sequence analysis (codon usage, GC content, chromosome localisation, ...) SpineTargetsatEBI SpineTargetsatEBI Determ ination oftheProtein Contructs and Oligo Design InsertSequenceV erification Oligonucleotidesordering
  • date post

    21-Dec-2015
  • Category

    Documents

  • view

    213
  • download

    0

Transcript of Raymond Ripp, Julie D. Thompson, Frédéric Plewniak, Jean-Claude Thierry, Olivier Poch Laboratoire...

Page 1: Raymond Ripp, Julie D. Thompson, Frédéric Plewniak, Jean-Claude Thierry, Olivier Poch Laboratoire de BioInformatique et Génomique Intégratives du Département.

Raymond Ripp, Julie D. Thompson, Frédéric Plewniak, Jean-Claude Thierry, Olivier Poch Laboratoire de BioInformatique et Génomique Intégratives du Département de Biologie et Génomique Structurales

IGBMC (CNRS – UMR 7104), 1 rue Laurent Fries, Illkirch 67404, Strasbourg France

http://www-bio3d-igbmc.u-strasbg.fr/SpineThe Identity Card associated to each target registered at EBI points to this web page showing the MACSIM files. Available features can be displayed on the multiple alignment : P-FAM and structural domains, conserved blocks, secondary structures, low complexity and transmembrane regions, bfunctional sites, sequence errors, splicing variants, etc. Additional information such as blast output, homologues description, phylogenetic tree, can be easily viewed or downloaded.

An integrated software platformfor target selection and characterisation

References : Lecompte,O., Thompson,J.D., Plewniak,F., Thierry,J. and Poch,O. (2001) Multiple alignment of complete sequences (MACS) in the post-genomic era. Gene, 270, 17–30. Plewniak F, Bianchetti L, Brelivet Y, Carles A, Chalmel F, Lecompte O, Mochel T, Moulinier L, Muller A, Muller J, Prigent V, Ripp R, Thierry JC, Thompson JD, Wicker N, Poch O. PipeAlign: A newtoolkit for protein family analysis.Nucleic Acids Res. 31, 3829-32 Plewniak,F., Thompson,J.D. and Poch,O. (2000) Ballast: blast post-processing based on locally conserved segments. Bioinformatics, 9, 750–759. Thompson,J.D., Plewniak,F., Thierry,J. and Poch,O. (2000) DbClustal: rapid and reliable global multiple alignments of protein sequences detected by database searches. Nucleic Acids Res., 15, 2919–2926. Thompson,J.D., Plewniak,F., Thierry,J. and Poch,O. (2003) RASCAL: Rapid scanning and correction of multiple sequence alignment programs. Bioinformatics, 19, 1155-61. Thompson,J.D., Plewniak,F., Ripp,R., Thierry,J.C. and Poch,O. (2001) Towards a reliable objective function for multiple sequence alignments. J. Mol. Biol., 4, 937–951. Wicker,N., Perrin,G.R., Thierry,J.C. and Poch,O. (2001) Secator: a program for inferring protein subfamilies from phylogenetic trees. Mol. Biol. Evol., 8, 1435–1441. Thompson, J.D., Prigent, V., Poch, O. (2004) LEON: multiple aLignment Evaluation Of Neighbours. Nucleic Acids Res. 32, 1298-307. Wicker N, Dembele D, Raffelsberger W, Poch O. (2002) Density of points clustering, application to transcriptomic data analysis.Nucleic Acids Res., 30, 3992-4000.

DNA and/or ProteomeDNA and/or Proteome

PipePipe--AlignAlign

Database SearchesDatabase Searches• BlastP on Swissprot, TrEmbl, PDB• BlastX for missing ORFs• TBlastN on complete genomes

DNA processingDNA processing• ORF location• External programs

(Glimmer, tRNAscan, CodonW...)

Initial Initial GScopeGScope database creationdatabase creation

Common stepsCommon steps

Clustering schemesClustering schemes

General Information ValuesGeneral Information ValuesORF (Overlap, length,…) GC contentCodon UsageShineDalgarno presenceStart codon (M, V or L)

Homologue CountsHomologue Counts- overall- structures- paralogues- in complete genome

Sequence validationSequence validation :Homolog detection agreementValidated start codon

PhylogeneticPhylogenetic relationshipsrelationships :Gene cluster maintenanceGene lossesDistance tree analysis

Structural informationStructural information :Domain organisationProduction in E.coli, YeastHydrophobicity indexHydrophobic helices…

Target identification:Target identification:X-HDA analysisValidated GO (z score)Integrated synteny

SpineSpine specificspecific annotation :annotation :target characterision…

Data crossData cross--correlationcorrelationPredictionsPredictions

SpecialisedSpecialised stepssteps

GscopeGscope platformplatform MACSIM :MACSIM :(Multiple (Multiple AlignmentAlignment, , ClusteringClustering andand SelectedSelected Information Information ManadgmentManadgment))

•• Integration of mined structural/functional informationIntegration of mined structural/functional information•• GraphicalGraphical interfaceinterface to access the informationto access the information•• CrossCross--validation analysis and propagationvalidation analysis and propagation

MACSIM functional annotationTarget: nuclear receptor coactivator 2 (NCOA-2)

ODD

PACPAS

SWISSPROT domainsHLH Dna-binding

Pfam domains

NTAD CTAD

Interaction with CREBBP Acetyltransferase activityID

NCoA-2

Clock

HIF-1

Single-minded

BMAL

NCoA-3

PACPASHLH CREBBPinteraction AT Poly-GlnPAS

LXXLL LXXLLLXXLL

acetylation(by CREBBP)

S-nitrosylation

NCoA-2 Receptor-interacting domain

>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>

>>>>>>>>>>>>>>>>>>>>>>>>>>>>

>>>>>>>

>>>>>>>

>>>>>>>

>>>>>>>

>>>>>>>>>>>>>>

>>>>>>>

>>>>>>>

>>>>>>>

>>>>>>> >>>>>>>

>>>>>>>

>>>>>>>

>>>>>>>

>>>>>>>

>>>>>>>

>>>>>>>

>>>>>>>>>>>>>>

>>>>>>>

>>>>>>>

>>>>>>>

>>>>>>>

>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>

>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>

>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>

>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>

>>>>>>>

>>>>>>>>>>>>>>

>>>>>>>

>>>>>>>

>>>>>>>

>>>>>>>

>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>

>>>>>>>

>>>>>>>

>>>>>>>

>>>>>>>>>>>>>>>>>>>>>>>>>>>>

>>>>>>>>>>>>>>

>>>>>>>>>>>>>>

>>>>>>>

>>>>>>>>>>>>>>

>>>>>>>>>>>>>>

>>>>>>>>>>>>>>

>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>

>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>

>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>

hydroxylation

S-nitrosylation

acetylation

LXXLL LXXLL LXXLL

LXXLL LXXLL LXXLL

Spine Targets Website at IGBMC

PipeAlign is a five step process ranging from the search for sequence homologues in protein and 3D structure databases to the definition of the hierarchical relationships within and between subfamilies.

General overview of GscopeGscope is our high throughput integration and analysis platform allowing genome investigation, database searches, running automatically various analysis and correlation. It allows data management and visualisation through a userfriendly interface. Protein analysis based on homology rely on the validated multiple alignment of complete sequences computed by DbClustal within the PipeAlign process. Beside the analysis of isolated sequences, Gscope provides interesting clustering schemes about sets of nucleic or peptidic sequences, focusing especially on structural genomics insights.

The first step of Gscope is to create a database containing the basal information for each sequence of the project.Then database searches are performed using each protein sequence. This defines sets of orthologs used in further analysis.

LMS (local maximum segments)LMS (local maximum segments)

BlastP search

Plewniak et al. (2000) Bioinformatics.

Ballast Anchors DbClustal AlignmentQuery Sequence

Anchors

Thompson et al. (2000) Nucl Acids Res.

ScannedScanned andand CorrectedCorrected MACSMACS(Multiple (Multiple AlignmentAlignment of of CompleteComplete SequencesSequences))

Thompson et al. (2003) Bioinformatics.

•• SecatorSecator/DPC : /DPC : automaticautomatic clusteringclustering algorithmsalgorithmsWicker et al. (2001) Mol Biol Evol. Wicker et al. (2002) Nucl Acids Res.

PipeAlignPipeAlign : : automaticautomatic proteinprotein familyfamily analysisanalysis

Thompson et al. (2004) Nucl Acids Res.

Homologous proteins Homologous proteins

Thompson et al. (2001) J Mol Biol.

•• OrdAliOrdAli : : OrdOrderedered AliAlignmentgnment analysisanalysis of of differentiallydifferentially conservedconservedresiduesresidues withwith automaticautomatic visualizationvisualization on structureon structure

Strictly/mostlyconserved(black, grey)

Conserved between groups(red + yellow = orange)

Conserved within group(red, yellow, blue)

http://http://bips.ubips.u--strasbg.frstrasbg.fr//PipeAlignPipeAlign//

Plewniak et al. (2003) Nucleic Acids Res.

MACSIM highlights following features : Functional residues, Domain organisation 3D structure environment, Mutagenesis experiments, Comparative genomics (phylo. distribution)

Abstract. In order to fully understand the potential biomedical role of a target protein, such diverse data as the type of organism, domain organisation, splicing variants, 2D/3D structures and mutations and their associated illnesses, must be organised into an information network for presentation to the experimentalist. The Gscope genomic annotation and analysis platform has been developed to allow automatic, high-throughput data collection, cross-validation and analysis of such heterogeneous information in a single, integrated environment. The integration of the protein in the context of the complete family is the essential first step in this process. Gscope therefore incorporates the PipeAlign protein family analysis toolkit in order to construct high quality, clustered multiple alignments of a potential target and its homologues identified by in-depth database searches. This provides the basis for the definition of the hierarchical relationships within and between subfamilies and for the reliable integration of all the structural and functional information available for the protein family. A new program, MACSIM, has been developed whose primary goal is to validate the quality of the data mined from the public databases and to propagate this information to the target of interest. The Gscope platform has been used to perform PipeAlign analyses of all targets in the SPINE target database and an “identity card” has been created for each potential target. These “identity cards” are provided in XML format and are accessible via the SPINE Web Site at http://www-bio3d-igbmc.u-strasbg.fr/Spine/.

Gscope integrates tools for the design, ordering and database management of oligos, inserts, pcr products, recombinants, sequence verification.These data can be easily sent to any Laboratory Information Management System.

MACSIM perfoms propagation of pertinent information from sets of well annotated sequences their homologs detected within the multiple alignment.

All Spine Targets are registered at the EBI Web Site http://www.ebi.ac.uk/msd-srv/msdtargetTheir description and characterisation, task status and associated information are hosted and updated at EBI. The Identity Card link points to the IGBMC Spine Target Website.

The XML files of the Spine Targets were downloaded from http://www.ebi.ac.uk/ and processed by Gscope at IGBMC using their protein sequence, gene name and definition. Results are hosted at the IGBMC Website and protected by a login and password. Specific additional analysis can be done on request, in particular DNA sequence analysis (codon usage, GC content, chromosome localisation, ...)

SpineTargets at EBISpineTargets at EBI

Determination of the Protein Contructsand Oligo Design

Insert Sequence Verification

Oligonucleotides ordering