Genome Analysis by Filtering

Charlene Son-Rigby -

Variant Miner

Omicia Score and Other Variant Scores

Evidence

Transcripts and Variant Consequences

Filtering in Variant Miner

Reset Filters

Export

Gene Summary Dialog

 

 

Variant Miner

Click on the Variant Report associated with your genome to launch the Variant Miner tool. Variant Miner is a powerful data-mining tool for evaluating a genome’s annotation using a set of filtering criteria and biological context to identify variants of interest. Variant Miner provides a list of variants processed by the Opal Annotation Engine.

 

Variant Miner Fields

Review Priority A visual prioritization of variants based on three data elements: ClinVar, Allele Frequency and Effect. See Appendix 3 for more details.

Reports generated from Opal Pipeline 4.3 and below data use the Variant Classification (previously Predicted Class) field. See Appendix 7 for more details.

Gene HGNC (Hugo Gene Nomenclature Committee) symbol of the gene.
Position Chromosome and base pair position.
dbSNP dbSNP identifier if one exists (and an embedded URL link to dbSNP.)
Change Reference position and alleles reported in the sample genome. In addition, the HGVS notation for the nucleotide and protein change (if any) for a representative transcript.
Effect Impact of the variant on the gene and transcripts; i.e. synonymous, non-synonymous, stop gain/loss, indel/frameshift, and splice variants. Clicking on Effect provides a list of transcripts, as well as splice site confidence analysis from NNSplice. See the Transcripts and Variant Consequences section for more information on transcripts.
Zygosity Genotype of the variant (homozygote and heterozygote.)
Quality Phred-like base level quality value as reported in the variant file.
GQ Confidence in the genotype assignment as reported in the variant file.
Coverage Unique read coverage at that position (total: reference: variant). 

For heterozygous variants, coverage is color coded to highlight allelic balance:

  • Black: variant fraction 45 – 55%
  • Orange: variant fraction 30-44% or 56-70%
  • Red: variant fraction <30% or >70%

These thresholds are based on analysis of allele balance and false positives (Pirooznia et al., 2014).

1KG AF
EVS AF
ExAC AF
Frequencies from 1000 Genomes Project, Exome Variant Server and ExAC. Click on the hyperlinks to access the ethnic subpopulation frequencies.
Omicia Score Proprietary impact assessment score that provides a rational aggregation of other variant scoring algorithms (Coonrod et al, 2013). Values range from 0 t0 1, with higher value indicating more likely deleteriousness (See Appendix 2). The colored squares underneath the Omicia score denote the individual evaluations of the underlying component scores. The range is Red-Yellow-Green, with red deleterious and green benign.
VVP Score The VAAST Variant Prioritization (VVP) score applies the VAAST algorithm at the variant level. VAAST takes predicted protein impact, conservation and allele frequency into consideration in its deleteriousness assessment. VVP provides normalized scores for variants in genes, enabling direct comparison of variants in conserved and polymorphic genes. VVP scores are provided for coding, non-coding regions, and intergenic variant categories. Scores are normalized so comparable within categories
CADD Score The CADD score combines information from 63 different annotations including PhastCons, GERP, PhyloP, SIFT and PolyPhen, using a support vector machine classifier (Kircher et al, 2013). It measures deleteriousness by using observed variant frequency as the basis for its calculation. The C score ranges from 1 to 99, with a higher score indicating greater deleteriousness. Values >= 10 are predicted to be the 10% most deleterious substitutions, >= 20 indicate the 1% most deleterious.
Evidence Literature evidence gathered from ClinVar, OMIM, COSMIC, Locus Specific Databases and GWAS. Click relevant colored button to see evidence text.

Beacon

When a variant has not been observed in any of the reference populations, a beacon icon will appear in the allele frequency field. Beacon is a project launched by the Global Alliance for Genomics and Health (GA4GH). Beacon allows labs to share de-identified information at the variant level to determine whether they have patients with the same variant. A number of reference sites including NCBI, Sanger, the Institute of Systems Biology and UCSC have launched public web services. Clicking on the beacon icon in Opal will send a query to each of the Beacon reference sites asking whether they have any genomes. The only information exchanged is the chromosome, position and variant.

More information on Beacon is available at the GA4GH Beacon project site.

 

Omicia Score and Other Variant Scores

The Opal Annotation Engine annotates each variant with 19 individual scores that assess deleteriousness based on protein impact, conservation or a combined approach.

Variant Miner shows the Omicia Score, a proprietary impact score that provides a rational aggregation of other variant scoring algorithms: PolyPhen-2, SIFT, PhyloP and Mutation Taster. It has been demonstrated to be more accurate across more cases than the individual scores, using a 10,000 genome HGMD test set (Coonrod et al, 2013). More information is provided in Appendix 2.

You can access the individual scores used to compute the Omicia score, as well as the complete set of annotated scores for each variant by clicking on the variant’s Omicia score in the Variant Miner table.

 

Scores Used to Compute Omicia Score

 

Score Description
PolyPhen-2 PolyPhen predicts the possible impact of an amino acid substitution on the structure and function of a human protein using physical and comparative considerations, including interference with ligand binding sites (Adzhubei et al., 2010). It produces one of three calls for non-synonymous variants: benign, possibly damaging, or probably damaging. The output is a 0 to 1 score.
Mut-Taster MutationTaster employs a Bayes classifier to predict the disease potential of an alteration (Schwarz et al., 2010).. The output of three different models for synonymous/intronic changes, single amino acid changes and complex changes is entered into the Bayesian model. The prediction of disease causing or polymorphism is either from the Bayesian model or based on external data, which is noted in the classification. Pipeline 4 also provides the prediction confidence in parentheses, expressed as a 0 to 1 value with values closer to 1 indicating higher confidence.
SIFT SIFT scores assess tolerance of amino-acid changes in protein function, by aligning homologous protein sequences using PSI-BLAST.. SIFT p-values below 0.05 indicate that the change is likely deleterious. Lower scores indicate damaging variants (0 to 1 scale).
PhyloP PhyloP scores measure conservation at each base (Siepel et al., 2006). It measures acceleration (faster evolution than expected under neutral drift) and conservation (slower than expected evolution). The score is the -log(p-value) under a null hypothesis of neutral evolution, and a negative sign indicates faster-than expected evolution, while positive values imply conservation. Values between -11.764 and +6.424. Sites predicted to be conserved are assigned positive scores, while fast-evolving ones are assigned negative scores. The PhyloP-Placental, Primate and Vertebrate alignments are all used to calculate the Omicia Score.

 

Splice Site Predictions

Algorithm Description
NNSplice NNSplice is an improved splice site predictor based on a generalized Hidden Markov Model (GHMM) that describes the grammar of a legal parse of a multi-exon gene in a DNA sequence. Reese M.G., Eeckman F.H., Kulp D., Haussler D., 1997. 'Improved Splice Site Detection in Genie'. J Comp Biol 4(3), 311-23.
GeneSplicer  A flexible system for detecting de-novo splice sites in the genomic DNA using a devision tree model, wich got enhanced with a Markov model that capture additional dependencies among neighboring bases. GeneSplicer: M. Pertea , X. Lin , S. L. Salzberg. GeneSplicer: a new computational method for splice site prediction. Nucleic Acids Res. 2001 Mar 1;29(5):1185-90. https://ccb.jhu.edu/papers/genesplicer.pdf
MaxEntScan  MaxEntScan is a tool to detect splice sites. It is based on the approach for modeling the sequences of short sequence motifs which simultaneously accounts for non-adjacent as well as adjacent dependencies between positions. This method is based on the 'Maximum Entropy Principle' and generalizes most previous probabilistic models of sequence motifs such as weight matrix models and inhomogeneous Markov models. MaxEntScan: Yeo G. and Burge C.B., Maximum Entropy Modeling of Short Sequence Motifs with Applications to RNA Splicing Signals, RECOMB 2003 (Journal Comp. Bio in press). Paper: http://cbcl.mit.edu/cbcl/publications/ps/yeo-burge.pdf]

For each splicing score, NNSplice, GeneSplicer, and MaxEntScan, a delta is calculated, which is the difference between the scores for the ref and alt allele. A negative delta value indicates possible splice site loss and a positive delta value indicates possible splice site gain. A color indicating severity based on the absolute value of the delta is then assigned, as follows:

Raw NNSplice scores range from 0 to 1. The score is colored red if the absolute value of the delta is greater than 0.5, yellow if it is between 0.25 and 0.5, and green if it is less than 0.25.

Raw MaxEntScan scores range from approximately -35 to 20. The score is colored red if the absolute value of the delta is greater than 7, yellow if it is between 3.5 and 7, and green if it is less than 3.5.

Raw GeneSplicer scores range from approximately 0 to 25. The score is colored red if the absolute value of the delta is greater than 3, yellow if it is between 1.5 and 3, and green if it is less than 1.5.

Additional Variant Scores

Score Description
phyloP – Placental Values between -12.709 and +2.941. Positive scores indicate conservation; negative scores fast-evolution.
phyloP – Primate Values between -9.065 and +0.655. Positive scores indicate conservation; negative scores fast-evolution.
phyloP – Vertebrate Values between -11.764 and +6.424. Positive scores indicate conservation; negative scores fast-evolution.
PhastCons Conservation score based on a Phylogenetic Hidden Markov Model. Values between 0 and 1. The larger the score, the more conserved the site.
VAAST Variant Prioritization (VVP) The VAAST Variant Prioritization (VVP) provides a variant-specific VAAST score that incorporates allele frequency, amino acid substitution severity, and phylogenetic conservation into a unified score that is normalized to a given feature such as a gene or genomic bin. The output is a 0 to 100 score, with higher scores indicating damaging variants.
CADD – PHRED The CADD score combines information from 63 different annotations including PhastCons, GERP, PhyloP, SIFT and PolyPhen, using a support vector machine classifier (Kircher et al, 2013). It measures deleteriousness by using observed variant frequency as the basis for its calculation. The C score ranges from 1 to 99, with a higher score indicating greater deleteriousness. Values >= 10 are predicted to be the 10% most deleterious substitutions, >= 20 indicate the 1% most deleterious.
CADD – Raw Values between -400 to ~150. Raw values have relative meaning, with higher values indicating that a variant is more likely to have deleterious effects.
GERP++ – NR A conservation score measuring the number of substitutions expected under neutral drift minus the observed number of substitutions. Values between -15 and 10. The larger the score, the more conserved the site.
GERP++ – RS A conservation score measuring the number of substitutions expected under neutral drift minus the observed number of substitutions. Values between -15 and 10. The larger the score, the more conserved the site.
SiPhy A conservation score that takes the type of mutation into account. SiPhy scores are from dbNSFP, and are on the log odds scale, with most scores ranging between 0 and 20. Higher scores indicate higher conservation.
MutationTaster See Scores Used to Compute Omicia Score table above
MutationAssessor Uses multiple sequence alignments of homologous protens to assess the deleteriosness of a mutation. Scores range from approximately -5 to +5, with high scores indicating damaging.
Polyphen-2 – HVAR PolyPhen-2 score based on HumVar.
Polyphen-2 – HDIV PolyPhen-2 score based on HumDiv.
RadialSVM Composite score based on 10 component scores (SIFT, PolyPhen-2 HDIV, PolyPhen-2 HVAR, GERP++, MutationTaster, Mutation Assessor, FATHMM, LRT, SiPhy, PhyloP) and the maximum frequency observed in the 1000 genomes populations. From dbNSFP. Higher score indicates more damaging mutations.
LRT A likelihood test ratio using an alignment of 32 vertebrate species’ genomes. The score is a p-value. P-values <0.001 indicate=”” damaging=”” variants=”” td=””>
LR Composite score based on 10 component scores (SIFT, PolyPhen-2 HDIV, PolyPhen-2 HVAR, GERP++, MutationTaster, Mutation Assessor, FATHMM, LRT, SiPhy, PhyloP) and the maximum frequency observed in the 1000 genomes populations. From dbNSFP. Higher score indicates more damaging mutations.
SIFT See Scores Used to Compute Omicia Score table above
FATHMM “Functional Analysis through Hidden Markov Models”. Scores range between approximately -15 and +10, with lower scores indicating more damaging mutations.

Evidence

The Evidence column provides access to evidence gathered from ClinVar, OMIM, COSMIC, Locus Specific Databases and GWAS. When you click on a colored evidence button, the resulting window provides information from the source database which may include classification, mode of inheritance, abstracts, etc. Evidence is matched to a variant in three ways:

  • Matches allele: Matches the position and change of the sample variant
  • Matches position: Matches the sample variant position
  • Matches codon: Matches the same codon (missense variants only)
  • Matches amino acid: Matches the same amino acid (missense variants only)
  • Overlap: Overlaps the sample variant position, where the sample variant or evidence variant is an indel

Transcripts and Variant Consequences

Clicking on the Effect for any variant will display a table of Variant Consequences. This table lists the potential transcripts with related protein change and effect, and denotes the canonical transcript from Ensembl. Where the mappings are provided by Ensembl, the CCDS and RefSeq accessions corresponding to the canonical transcript are listed.

NNsplice is used for prediction of disruption or creation of splice sites. The Effect modal displays NNsplice scores for the reference and variant alleles, on a scale of 0 to 1, where 1 is strongest likelihood of the existence of a splice site. The difference or delta between the reference and alt scores indicates how likely it is that a splice site is disrupted or created. 

In the Interpret Variants table Effect column, "splice site impact" will be displayed if the delta between the absolute values of the reference and variant alleles is:

  • >0.5: "splice site impact" highlighted in red
  • >0.25: "splice site impact" highlighted in yellow

Filtering in Variant Miner

Variant Miner enables you to interactively filter the list of displayed variants. When the filters to the left of the variant table are changed, the display will dim briefly while the results are updating. Record count is displayed at the bottom of the table.

Your active filters are listed at the bottom left under the available filters.

The sections below describe the filtering options available in Variant Miner.

 

Filtering Protocols

You can dynamically apply Filtering Protocols to data in Variant Miner. Filtering Protocols are saved sets of filters that help standardize genomics analysis. For example, in the case of an early childhood, highly penetrant disease, a filtering protocol excluding common variants and requiring non-synonymous variants with high likelihood of being deleterious (e.g. SIFT <0.1 or PolyPhen probably/possibly damaging or Omicia score >0.75) could be applied. 

More information on creating and editing Filtering Protocols is provided in Managing Filtering Protocols.

Gene Filters

Allows you to select variants that match any single filter specified, or require that they match all of the specified filters.

Symbol

Show only variants within the specified gene or genes, using HGNC gene symbol notation. Multiple gene symbols can be entered separated by spaces. If you delete or change the gene symbol(s) in the field, click on the Enter key to update the variant table.

Gene Panel

Select a Gene Panel built in Panel Builder from the dropdown menu to include variants found in genes from the gene panel.

Gene Set

Select a Gene Set from the dropdown menu to include variants found in genes from the gene set.

 

Zygosity Filters

For Solo, Family and Panel Trio tests, zygosity filtering is available for each family member. Each of the dropdown menus has the options Any, Heterozygous, or Homozygous.

RSID Filter

Filter by RSID number.

Location Filters

Filter based on chromosome and location within the chromosome, and you can also specify whether you want to filter for regions of homozygosity that are greater than 5MB, 3MB, or 1MB. 

Quality Filters

This menu provides several filtering options for quality metrics, which allow you to focus on higher quality variants by raising the lower bound, or reduce highly redundant reads that may indicate a misassembly by lowering the upper bound.  The filters are presented as text entry boxes to set minimum and maximum cutoffs.

The Match option allows you to select variants that match any single filter specified, or require that they match all of the specified filters.

Coverage

Focus on higher quality variants by raising the lower bound (recommend minimum values from 6 to 15) or reduce highly redundant reads, which may indicate a misassembly, by lowering the upper bound (e.g. 2-3 standard deviations over coverage mean).

Coverage:   min and max values encountered in VCF file
Quality:   min and max values encountered in VCF file
Genotype Quality:   min and max values encountered in VCF file
Allele Balance: For heterozygous variants, allows you to display variants with any allelic balance, or require that the balance be either between 0.45-0.55 or between 0.30-0.70.

 

Evidence Filters

ClinVar Classification

These checkbox options filter variants based on their classification records in ClinVar. Records are considered if the record matches the variant allele or matches the variant amino acid. If none of these options are selected variants are shown regardless of their ClinVar records.

Pathogenic
Show variants with exclusively 'pathogenic' records in ClinVar (red ClinVar dot in Review Priority).
Likely Pathogenic
Show variants with a combination of different records including at least one 'likely pathogenic' or 'pathogenic' record (red ClinVar dot in Review Priority).
Uncertain Significance
Show variants with exclusively 'uncertain significance' records or variants with at least one 'conflicting interpretations of pathogenicity' but no 'likely pathogenic' or 'pathogenic' (yellow ClinVar dot in Review Priority).
Likely Benign
Show variants with a combination of records including at least one 'likely benign' or 'benign' record (green ClinVar dot in Review Priority).
Benign
Show variants with exclusively 'benign' records (green ClinVar dot in Review Priority).
Associated
Show variants that have at least one record with the clinical significance of 'association', 'confers sensitivity', 'drug response', 'protective' or 'risk factor', but no record with 'pathogenic', 'likely pathogenic', 'likely benign' or 'benign' records (yellow ClinVar dot in Review Priority).
In ClinVar, No Class
Show variants that are included in ClinVar but have only records with a clinical significance of 'not provided' (gray ClinVar dot in Review Priority).

See Appendix 3 for specific details.

Supporting Evidence

Show only variants that have supporting evidence in OMIM and/or ClinVar, or show only variants that have Any type of supporting evidence (ClinVar, OMIM, GWAS, etc.).

Allele Frequency Filters

The Match option allows you to select variants that match any single filter specified, or require that they match all of the specified filters.

1KG, EVS, and ExAC

Frequency of the variant within a reference population. For 1K Genomes, the dbSNP global MAF table is used. This setting allows you to filter out higher frequency or “common” variants by lowering the upper bound. Note: this filter does not exclude variants when MAF data is not available.

Impact Score Filters

The Match option allows you to select variants that match any single filter specified, or require that they match all of the specified filters. 

Omicia Score, SIFT, CADD, or VVP

You can specify a specific value in the filter you want to use, as follows:
Omicia Score:   0 - 1
Sift Score:   0 - 1
VVP Score:   0 - 100
CADD Score:   0 - 99

Variant Effect Filters

Variant Type 

Select to show only Protein Changing variants and/or variants with Regulatory consequences. If both options are selected variants have to meet both requirements to pass the filter. Protein Changing variants can be further specified by selecting one or multiple variant types, e.g. Missense, Splice Regions, etc. The Regulatory consequences are generated from Ensembl annotations.

Gene Models

Show only variants overlapping CCDS and/or RefSeq gene models.

Exclude

The Exclude menu provides filtering option to exclude variants from the report.

Genome Region
Exclude variants in Introns and/or Intergenic Regions.
Variant Type
Exclude variants with dbSNP Hits to show only novel variants.
VCF Filters
Filter failures
Exclude variants that have a value other than '.' (missing) or 'PASS' in the FILTER field of the VCF file.
Filter unspecified
Exclude variants that have the value '.' (missing) in the FILTER field of the VCF file.
Polymorphic Genes
Exclude variants in highly polymorphic genes
Non-Coding Genes
Exclude variants in non-coding genes (default)

Include Gene Sets

Display variants that are in a specified Gene Set. Only workspace Gene Sets are displayed. 

Exclude Gene Sets

Exclude variants in a specified Gene Set. This can be very useful for example to exclue highly polymorphic genes. These gene sets are in the Exclude category of Gene Sets, and can be edited in the Gene Sets module (see Managing Gene Sets).

Unique Variants

Unique Variants allows you to compare one or more genomes to identify unique variants in the current sample. This can be helpful, for example, to eliminate variants that have no significance to the phenotype under study when there is genome data from affected and/or unaffected individuals. 

The Find Unique Variants modal lists genomes available for comparison. You can select genomes from other projects by changing the project in the pull-down menu. To select a genome for comparison, click on it in the list. To select multiple genomes use the standard multiple select function of your computer (e.g. in Apple Mac OS X click on the first genome, and then click on the second while pressing the Command key).

You may also specify the whether you want to Match Allele and/or Match Zygosity using the checkboxes.

When you are satisfied with your choices click the “Find Variants” button. 

Shared Variants

Shared Variants allows you to compare one or more genomes to identify shared variants with the current sample. 

Reset Filters

At any point during variant mining you may want to clear the current filters. Simply click the “Reset” button at the top of the grid to return to defaults.

Export

Clicking the “Export” button displays menu items to save your mining results as CSV or VCF files. You can open CSV files in Excel.

Gene Summary Dialog

Clicking on a gene symbol in Variant Miner will display the Gene Summary window. This summary includes a graphical representation of the gene, annotated evidence, information on the gene from NCBI Entrez with links to USCS Genome Browser, Ensembl, and if applicable, to GeneTests and the NCBI Genetics Reference.

The two additional tabs provide the following information:

  • Gene Variants: a list of all gene variants in the sample genome is presented on a transcript basis with the HGVS nucleotide and protein nomenclature. The variant selected when the Gene Summary Report was displayed is highlighted in yellow. Clicking on it will highlight its location in the gene viewer above.
  • NLP Phenotype Mapper: Provides a list of conditions, strength of association and prioritized list of PubMed abstracts generated through natural language processing.
Have more questions? Submit a request

Comments