Giab ashg webinar 160224
-
Upload
genomeinabottle -
Category
Health & Medicine
-
view
462 -
download
0
Transcript of Giab ashg webinar 160224
Genome in a Bottle Consortium February 24, 2016
Reference Materials for Human Genome Sequencing
Justin Zook, Ph.D and Marc Salit, Ph.D. National Institute of Standards and Technology
Outline
• Genome in a Bottle (GIAB) products
• Current and future work• Best practices for using
GIAB products to benchmark variant calls
• Genome in a Bottle– Open consortium to
develop well-characterized genomes for benchmarking
– 100-150 public, private, and academic participants at workshops
GIAB Scope• The Genome in a Bottle Consortium is
developing the reference materials, reference methods, and reference data needed to assess confidence in human whole genome variant calls. • Priority is authoritative characterization of
human genomes.GIAB steering committee, Aug 2015
Well-characterized, stable RMs• Obtain metrics for
validation, QC, QA, PT• Determine sources and
types of bias/error• Learn to resolve difficult
structural variants• Improve reference
genome assembly• Optimization• Enable regulated
applications
Analytical Performance
• Use well-characterized genomic DNA reference materials to benchmark performance
• Tools to facilitate their use– With the Global Alliance
Data Working Group Benchmarking Team
Sample
gDNA isolation
Library Prep
Sequencing
Alignment/Mapping
Variant Calling
Confidence Estimates
Downstream Analysis
gene
ric m
easu
rem
ent p
roce
ss
High-confidence SNP/indel calls
• Methods to develop SNP/indel call set described in manuscript
• Broad and quick adoption of call set for benchmarking– struck nerve
Zook et al., Nature Biotechnology, 2014.
Candidate NIST Reference MaterialsGenome PGP ID Coriell ID NIST ID NIST RM #
CEPH Mother/Daughter
N/A GM12878 HG001 RM8398
AJ Son huAA53E0 GM24385 HG002 RM8391 (son)/RM8392 (trio)
AJ Father hu6E4515 GM24149 HG003 RM8392 (trio)
AJ Mother hu8E87A9 GM24143 HG004 RM8392 (trio)
Asian Son hu91BD69 GM24631 HG005 RM8393
Asian Father huCA017E GM24694 N/A N/A
Asian Mother hu38168C GM24695 N/A N/A
Note: RMs 8391 to 8393 are planned for release by end of Q2 2016
Dataset AJ Son AJ Parents Chinese son Chinese parents
NA12878
Illumina Paired-end X X X X XIllumina Long Mate pair X X X X XIllumina “moleculo” X X X X XComplete Genomics X X X X XComplete Genomics LFR X X XIon exome X X X XBioNano X X X X10X X X XPacBio X X XSOLiD single end X X XIllumina exome X X X XOxford Nanopore X
Paper describing the data…
Data Release: Real-time, Open, Public Release
Individual Datasets• Uploaded to GIAB FTP site
as data are collected• Includes raw reads, aligned
reads, and variant/reference calls
• 12 datasets described in bioRxiv paper
• Develop SNP, indel, and homozygous reference calls similar to NA12878
• Developing methods to form high-confidence calls for difficult variant types and regions
• Released calls are versioned• Preliminary call-sets will be
made available to be critiqued
Integrated High-confidence Calls
SNP/Indel Integration Method Update• Implementing refined integration methods
– Developed so others can readily reproduce results– Consistent results for all GIAB genomes– Simpler process taking advantage of best practices
for each technology• Validating with released NA12878 RM data
– Preliminary comparisons show minor changes• Application to PGP trios
– Plan to analyze AJ trio by Q2 2016– Release of NIST RMs in Q2 2016– Develop calls for GRCh38
Proposed approach to form high-confidence SV (and non-SV) calls
Generate Candidate Calls
Compare/evaluate calls using Parliament/MetaSV/svclassify/others?;
manual inspection
Integrate new and revised calls; manual inspection
Combine integrated calls; manual inspection; targeted experimental validation?
Aug/Dec 2015
Aug 2015-Jan 2016
Planning in Jan-Feb 2016
Feb 2016 and beyond
Preliminary comparisons of 17 Deletion CallsetsSensitivity to calls in 2 technologies
NOTE: These are preliminary comparisons of data under active development and likely different from true sensitivity of callers
Preliminary comparisons of 17 Deletion CallsetsDifference between predicted size and median predicted size
NOTE: These are preliminary comparisons of data under active development and likely different from true size accuracy
Preliminary comparisons of 17 Deletion CallsetsNumber of unique calls
NOTE: These are preliminary comparisons of data under active development without filtering and unique calls may be correct
GeT-RM Browser from NCBI and CDC• http://www.ncbi.nlm.nih.gov/variation/tools/get-rm/• Allows visualization of data underlying call each call
Global Alliance for Genomics and Health Benchmarking Task Team
Progress:
• Initial version of standardized definitions for performance metrics like TP, FP, and FN.
• Continued development of sophisticated benchmarking tools– vcfeval – Len Trigg– hap.py – Peter Krusche– vgraph – Kevin Jacobs
• Standardized intermediate and final file formats• Standardized bed files with difficult genome contexts for
stratification• github.com/ga4gh/benchmarking-tools
Proposed Performance Metrics Definitions
• Define TP/FP/FN/TN in 4 ways depending on required stringency of match:
• Loose match: TP if within x-bp of a true variant• Allelle match: TP if ALT allele matches• Genotype match: TP if genotype and ALT allele
match• Phasing match: TP if genotype, ALT allele, and
phasing with nearby variants all match• True negatives are difficult to define because an
infinite number of potential alleles exist
Approaches to Benchmarking Variant Calling
• Well-characterized whole genome Reference Materials
• Many samples characterized in clinically relevant regions
• Synthetic DNA spike-ins• Cell lines with engineered mutations• Simulated reads• Modified real reads• Modified reference genomes• Confirming results found in real samples over time
Challenges in Benchmarking Small Variant Calling
• It is difficult to do robust benchmarking of tests designed to detect many analytes (e.g., many variants)
• Easiest to benchmark only within high-confidence bed file, but…
• Benchmark calls/regions tend to be biased towards easier variants and regions– Some clinical tests are enriched for difficult sites
• Challenges with benchmarking complex variants near boundaries of high-confidence regions
• Always manually inspect a subset of FPs/FNs• Stratification by variant type and region is important• Always calculate confidence intervals on performance metrics
Benchmarking on PrecisionFDA
Acknowledgments
• FDA • Many members of
Genome in a Bottle– New members
welcome!– Sign up on website
for email newsletters
GIAB Steering Committee– Marc Salit – Justin Zook– David Mittelman – Andrew Grupe – Michael Eberle– Steve Sherry – Deanna Church – Francisco De La Vega– Christian Olsen – Monica Basehore – Lisa Kalman – Christopher Mason – Elizabeth Mansfield – Liz Kerrigan – Leming Shi – Melvin Limson – Alexander Wait Zaranek – Nils Homer – Fiona Hyland– Steve Lincoln – Don Baldwin – Robyn Temple-Smolkin – Chunlin Xiao– Kara Norman– Luke Hickey
For More Informationwww.genomeinabottle.org - sign up for general GIAB and Analysis Team google group emails
github.com/genome-in-a-bottle – Guide to GIAB data & ftp
www.slideshare.net/genomeinabottle
www.ncbi.nlm.nih.gov/variation/tools/get-rm/ - Get-RM Browser
Data: http://biorxiv.org/content/early/2015/09/15/026468
Global Alliance Benchmarking Team– https://github.com/ga4gh/benchmarking-tools
Twice yearly public workshops – Winter at Stanford University, California, USA– Summer at NIST, Maryland, USA
Justin Zook: [email protected] Salit: [email protected]