The VariantAnnotation package has facilities for reading in all or subsets of Variant Call Format (VCF) files. These text files contain meta-information lines, a header line and data lines each containing information about a position in the genome. The format also may also contain genotype information on samples for each position. More on the file format can be found in the VCF specs.
The 'locateVariants' function in the VariantAnnotation package identifies where a variant is located with respect to the gene model (e.g., exon, intron, splice site, etc.). The 'predictCoding' function reports the amino acid change for non-synonymous coding variants. Consequences of the coding changes can be investigated with the SIFT and PolyPhen database packages. We'll use these functions to learn about variants located on the TRPV gene on chromosome
This workflow requires several different Bioconductor packages. Usage of each will be described in detail in the following sections.
## Warning: package 'VariantAnnotation' was built under R version 3.2.1
## Warning: package 'GenomeInfoDb' was built under R version 3.2.1
## Warning: package 'rtracklayer' was built under R version 3.2.1
library(VariantAnnotation)
library(cgdv17)
library(org.Hs.eg.db)
library(TxDb.Hsapiens.UCSC.hg19.knownGene)
library(BSgenome.Hsapiens.UCSC.hg19)
library(PolyPhen.Hsapiens.dbSNP131)
Use biocLite() to get the packages you don't have installed:
source("http://bioconductor.org/biocLite.R")
biocLite("mypackage")
This workflow focuses on variants located in the Transient Receptor Potential Vanilloid (TRPV) gene family on chromosome 17. Sample data are from the Bioconductor cgdv17 experimental data package which contains Complete Genomics Diversity panel data for chromosome 17 on 46 individuals. For more background on how these data were curated see the package vignette.
browseVignettes("cgdv17")
We use a VCF file from the package that is a subset of chromosome 17 for a single individual from the CEU population.
library(VariantAnnotation)
library(cgdv17)
file <- system.file("vcf", "NA06985_17.vcf.gz", package = "cgdv17")
To get an idea of what data are in the file we look at the header. scanVcfHeader() parses the file header into a VCFHeader object and the info() and geno() accessors extract field-specific data.
hdr <- scanVcfHeader(file)
info(hdr)
## DataFrame with 3 rows and 3 columns
## Number Type Description
## <character> <character> <character>
## NS 1 Integer Number of Samples With Data
## DP 1 Integer Total Depth
## DB 0 Flag dbSNP membership, build 131
geno(hdr)
## DataFrame with 12 rows and 3 columns
## Number Type Description
## <character> <character> <character>
## GT 1 String Genotype
## GQ 1 Integer Genotype Quality
## DP 1 Integer Read Depth
## HDP 2 Integer Haplotype Read Depth
## HQ 2 Integer Haplotype Quality
## ... ... ... ...
## mRNA . String Overlaping mRNA
## rmsk . String Overlaping Repeats
## segDup . String Overlaping segmentation duplication
## rCov 1 Float relative Coverage
## cPd 1 String called Ploidy(level)
Variants in the VCF have been aligned to NCBI genome build GRCh37:
meta(hdr)$META
## DataFrame with 5 rows and 1 column
## Value
## <character>
## fileformat VCFv4.1
## fileDate 20111102
## source masterVar2VCFv40
## reference build37.fa.bz2
## phasing partial
[ Back to top ]
Use the org.Hs.eg.db package to convert gene symbols to gene ids.
## get entrez ids from gene symbols
library(org.Hs.eg.db)
genesym <- c("TRPV1", "TRPV2", "TRPV3")
geneid <- select(org.Hs.eg.db, keys=genesym, keytype="SYMBOL",
columns="ENTREZID")
geneid
## SYMBOL ENTREZID
## 1 TRPV1 7442
## 2 TRPV2 51393
## 3 TRPV3 162514
[ Back to top ]
We use the hg19 known gene track from UCSC to identify the TRPV gene ranges. These ranges will eventually be used to extract variants from a regions in the VCF file.
Load the annotation package.
library(TxDb.Hsapiens.UCSC.hg19.knownGene)
txdb <- TxDb.Hsapiens.UCSC.hg19.knownGene
txdb
## TxDb object:
## # Db type: TxDb
## # Supporting package: GenomicFeatures
## # Data source: UCSC
## # Genome: hg19
## # Organism: Homo sapiens
## # UCSC Table: knownGene
## # Resource URL: http://genome.ucsc.edu/
## # Type of Gene ID: Entrez Gene ID
## # Full dataset: yes
## # miRBase build ID: GRCh37
## # transcript_nrow: 82960
## # exon_nrow: 289969
## # cds_nrow: 237533
## # Db created by: GenomicFeatures package from Bioconductor
## # Creation time: 2015-03-19 13:55:51 -0700 (Thu, 19 Mar 2015)
## # GenomicFeatures version at creation time: 1.19.32
## # RSQLite version at creation time: 1.0.0
## # DBSCHEMAVERSION: 1.1
Our VCF file was aligned to a genome from NCBI and the known gene track was from UCSC. These institutions have different naming conventions for the chromosomes. In order to use these two pieces of data in a matching or overlap operation the chromosome names (also called sesqlevels) need to match. We will modify the txdb to match the VCF file.
txdb <- renameSeqlevels(txdb, gsub("chr", "", seqlevels(txdb)))
txdb <- keepSeqlevels(txdb, "17")
Create a list of transcripts by gene:
txbygene = transcriptsBy(txdb, "gene")
Create the gene ranges for the TRPV genes
gnrng <- unlist(range(txbygene[geneid$ENTREZID]), use.names=FALSE)
names(gnrng) <- geneid$SYMBOL
[ Back to top ]
A ScanVcfParam object is used to retrieve data subsets. This object can specify genomic coordinates (ranges) or individual VCF elements. Extractions of ranges (vs fields) requires a tabix index. See ?indexTabix for details.
param <- ScanVcfParam(which = gnrng, info = "DP", geno = c("GT", "cPd"))
param
## class: ScanVcfParam
## vcfWhich: 1 elements
## vcfFixed: character() [All]
## vcfInfo: DP
## vcfGeno: GT cPd
## vcfSamples:
## Extract the TRPV ranges from the VCF file
vcf <- readVcf(file, "hg19", param)
## Inspect the VCF object with the 'fixed', 'info' and 'geno' accessors
vcf
## class: CollapsedVCF
## dim: 405 1
## rowRanges(vcf):
## GRanges with 5 metadata columns: paramRangeID, REF, ALT, QUAL, FILTER
## info(vcf):
## DataFrame with 1 column: DP
## info(header(vcf)):
## Number Type Description
## DP 1 Integer Total Depth
## geno(vcf):
## SimpleList of length 2: GT, cPd
## geno(header(vcf)):
## Number Type Description
## GT 1 String Genotype
## cPd 1 String called Ploidy(level)
head(fixed(vcf))
## DataFrame with 6 rows and 4 columns
## REF ALT QUAL FILTER
## <DNAStringSet> <DNAStringSetList> <numeric> <character>
## 1 A G 120 PASS
## 2 A 0 PASS
## 3 AAAAA 0 PASS
## 4 AA 0 PASS
## 5 C T 59 PASS
## 6 T C 157 PASS
geno(vcf)
## List of length 2
## names(2): GT cPd
[ Back to top ]
The locateVariants function identifies where a variant falls with respect to gene structure, e.g., exon, utr, splice site, etc. We use the gene model from the TxDb.Hsapiens.UCSC.hg19.knownGene package loaded eariler.
## Use the 'region' argument to define the region
## of interest. See ?locateVariants for details.
cds <- locateVariants(vcf, txdb, CodingVariants())
five <- locateVariants(vcf, txdb, FiveUTRVariants())
splice <- locateVariants(vcf, txdb, SpliceSiteVariants())
intron <- locateVariants(vcf, txdb, IntronVariants())
all <- locateVariants(vcf, txdb, AllVariants())
Each row in cds represents a variant-transcript match so multiple rows per variant are possible. If we are interested in gene-centric questions the data can be summarized by gene regardless of transcript.
## Did any variants match more than one gene?
table(sapply(split(mcols(all)$GENEID, mcols(all)$QUERYID),
function(x) length(unique(x)) > 1))
##
## FALSE TRUE
## 391 11
## Summarize the number of variants by gene:
idx <- sapply(split(mcols(all)$QUERYID, mcols(all)$GENEID), unique)
sapply(idx, length)
## 125144 162514 51393 7442 84690
## 1 172 62 143 35
## Summarize variant location by gene:
sapply(names(idx),
function(nm) {
d <- all[mcols(all)$GENEID %in% nm, c("QUERYID", "LOCATION")]
table(mcols(d)$LOCATION[duplicated(d) == FALSE])
})
## 125144 162514 51393 7442 84690
## spliceSite 0 2 0 1 0
## intron 0 153 58 117 19
## fiveUTR 0 0 0 0 0
## threeUTR 0 0 0 0 0
## coding 0 5 3 8 0
## intergenic 0 0 0 0 0
## promoter 1 12 1 17 16
[ Back to top ]
Amino acid coding for non-synonymous variants can be computed with the function predictCoding. The BSgenome.Hsapiens.UCSC.hg19 package is used as the source of the reference alleles. Variant alleles are provided by the user.
library(BSgenome.Hsapiens.UCSC.hg19)
seqlevelsStyle(vcf) <- "UCSC"
seqlevelsStyle(txdb) <- "UCSC"
aa <- predictCoding(vcf, txdb, Hsapiens)
## Warning in .predictCodingGRangesList(query, cache[["cdsbytx"]],
## seqSource, : records with missing 'varAllele' were ignored
## Warning in .predictCodingGRangesList(query, cache[["cdsbytx"]],
## seqSource, : varAllele values containing 'N' were not translated
predictCoding returns results for coding variants only. As with locateVariants, the output has one row per variant-transcript match so multiple rows per variant are possible.
## Did any variants match more than one gene?
table(sapply(split(mcols(aa)$GENEID, mcols(aa)$QUERYID),
function(x) length(unique(x)) > 1))
##
## FALSE
## 17
## Summarize the number of variants by gene:
idx <- sapply(split(mcols(aa)$QUERYID, mcols(aa)$GENEID, drop=TRUE), unique)
sapply(idx, length)
## 162514 51393 7442
## 6 3 8
## Summarize variant consequence by gene:
sapply(names(idx),
function(nm) {
d <- aa[mcols(aa)$GENEID %in% nm, c("QUERYID","CONSEQUENCE")]
table(mcols(d)$CONSEQUENCE[duplicated(d) == FALSE])
})
## 162514 51393 7442
## nonsynonymous 2 0 2
## not translated 1 0 5
## synonymous 3 3 1
The variants 'not translated' are explained by the warnings thrown when predictCoding was called. Variants that have a missing varAllele or have an 'N' in the varAllele are not translated. If the varAllele substitution had resulted in a frameshift the consequence would be 'frameshift'. See ?predictCoding for details.
[ Back to top ]
The ensemblVEP package provides access to the online Ensembl Variant Effect Predictor (VEP tool). The VEP tool ouputs predictions of functional consequences of known and unknown variants as reported by Sequence Ontology or Ensembl. Regulatory region consequences, HGNC, Ensembl protein identifiers, HGVS, co-located variants are optional outputs. ensemblVEP() accepts the name of a VCF file and returns a VCF on disk or GRanges in the R workspace.
Load ensemblVEP:
library(ensemblVEP)
The 'file' argument to ensemblVEP must be a vcf file on disk. We'll write out the VCF object with the TRPV variants and submit that to ensemblVEP.
Write the VCF object to a vcf file in tempfile():
dest <- tempfile()
writeVcf(vcf, dest)
Call ensemblVEP with the file containing only the TRPV variants and the custom VEPParam object:
gr <- ensemblVEP(file = dest)
2015-06-09 08:50:22 - Starting...
2015-06-09 08:50:24 - Detected format of input file as vcf
2015-06-09 08:50:24 - Read 277 variants into buffer
2015-06-09 08:50:24 - Reading transcript data from cache and/or database
[================================================================================================================================================================] [ 100% ]
2015-06-09 08:54:29 - Retrieved 55 transcripts (0 mem, 0 cached, 55 DB, 0 duplicates)
2015-06-09 08:54:29 - Analyzing chromosome 17
2015-06-09 08:54:29 - Analyzing variants
[================================================================================================================================================================] [ 100% ]
2015-06-09 08:54:29 - Calculating consequences
[================================================================================================================================================================] [ 100% ]
2015-06-09 08:54:30 - Processed 277 total variants (1 vars/sec, 1 vars/sec total)
2015-06-09 08:54:30 - Wrote stats summary to /tmp/RtmppUYah1/file339f64bf5700_summary.html
2015-06-09 08:54:30 - Finished!
head(gr, 3)
GRanges object with 3 ranges and 22 metadata columns:
seqnames ranges strand | Allele
<Rle> <IRanges> <Rle> | <factor>
rs402369 chr17 [3469401, 3469401] * | G
rs402369 chr17 [3469401, 3469401] * | G
rs402369 chr17 [3469401, 3469401] * | G
Consequence IMPACT SYMBOL
<factor> <factor> <factor>
rs402369 splice_region_variant&intron_variant LOW SPATA22
rs402369 splice_region_variant&intron_variant LOW SPATA22
rs402369 upstream_gene_variant MODIFIER ASPA
Gene Feature_type Feature BIOTYPE EXON
<factor> <factor> <factor> <factor> <factor>
rs402369 ENSG00000141255 Transcript ENST00000397168 protein_coding <NA>
rs402369 ENSG00000141255 Transcript ENST00000355380 protein_coding <NA>
rs402369 ENSG00000108381 Transcript ENST00000456349 protein_coding <NA>
INTRON HGVSc HGVSp cDNA_position CDS_position
<factor> <factor> <factor> <factor> <factor>
rs402369 1/8 <NA> <NA> <NA> <NA>
rs402369 1/7 <NA> <NA> <NA> <NA>
rs402369 <NA> <NA> <NA> <NA> <NA>
Protein_position Amino_acids Codons Existing_variation DISTANCE
<factor> <factor> <factor> <factor> <factor>
rs402369 <NA> <NA> <NA> <NA> <NA>
rs402369 <NA> <NA> <NA> <NA> <NA>
rs402369 <NA> <NA> <NA> <NA> 4652
STRAND SYMBOL_SOURCE HGNC_ID
<factor> <factor> <factor>
rs402369 -1 HGNC HGNC:30705
rs402369 -1 HGNC HGNC:30705
rs402369 1 HGNC HGNC:756
-------
seqinfo: 25 sequences from genome; no seqlengths
Data from the VEP tool is returned as metadata columns in the GRanges ('gr' object). You can further control what fields are returned by setting the runtime options in the VEPParam.
VEPParam()
## class: VEPParam78
## identifier(0):
## colocatedVariants(0):
## dataformat(0):
## basic(0):
## input(1): species
## cache(3): dir, dir_cache, dir_plugins
## output(1): terms
## filterqc(0):
## database(2): host, database
## advanced(1): buffer_size
## version: 78,80
## scriptPath:
Each of the options, e.g., 'basic', 'inut', 'cache' etc. has multiple fields that can be set.
basic(VEPParam())
## $verbose
## [1] FALSE
##
## $quiet
## [1] FALSE
##
## $no_progress
## [1] FALSE
##
## $config
## character(0)
##
## $everything
## [1] FALSE
##
## $fork
## numeric(0)
For more details on runtime options see the ?VEPParam man page and the Ensembl VEP web site.
[ Back to top ]
Packages have extensive help pages, and include vignettes highlighting common use cases. The help pages and vignettes are available from within R. After loading a package, use syntax like
help(package="VariantAnnotation")
?predictCoding
to obtain an overview of help on the VariantAnnotation
package, and
the predictCoding
function. View the package vignette with
browseVignettes(package="VariantAnnotation")
To view vignettes providing a more comprehensive introduction to package functionality use
help.start()
[ Back to top ]
sessionInfo()
## R version 3.2.0 Patched (2015-04-22 r68234)
## Platform: x86_64-apple-darwin10.8.0 (64-bit)
## Running under: OS X 10.6.8 (Snow Leopard)
##
## locale:
## [1] C
##
## attached base packages:
## [1] stats4 parallel stats graphics grDevices utils datasets
## [8] methods base
##
## other attached packages:
## [1] ensemblVEP_1.8.1
## [2] PolyPhen.Hsapiens.dbSNP131_1.0.2
## [3] BSgenome.Hsapiens.UCSC.hg19_1.4.0
## [4] BSgenome_1.36.1
## [5] rtracklayer_1.28.5
## [6] cgdv17_0.6.0
## [7] TxDb.Hsapiens.UCSC.hg19.knownGene_3.1.2
## [8] GenomicFeatures_1.20.1
## [9] GGtools_5.4.0
## [10] data.table_1.9.4
## [11] GGBase_3.30.0
## [12] snpStats_1.18.0
## [13] Matrix_1.2-0
## [14] survival_2.38-1
## [15] org.Hs.eg.db_3.1.2
## [16] RSQLite_1.0.0
## [17] DBI_0.3.1
## [18] AnnotationDbi_1.30.1
## [19] Biobase_2.28.0
## [20] VariantAnnotation_1.14.3
## [21] Rsamtools_1.20.4
## [22] Biostrings_2.36.1
## [23] XVector_0.8.0
## [24] GenomicRanges_1.20.5
## [25] GenomeInfoDb_1.4.1
## [26] IRanges_2.2.4
## [27] S4Vectors_0.6.0
## [28] BiocGenerics_0.14.0
##
## loaded via a namespace (and not attached):
## [1] splines_3.2.0 gtools_3.5.0
## [3] Formula_1.2-1 latticeExtra_0.6-26
## [5] lattice_0.20-31 biovizBase_1.16.0
## [7] chron_2.3-45 digest_0.6.8
## [9] RColorBrewer_1.1-2 colorspace_1.2-6
## [11] plyr_1.8.2 XML_3.98-1.2
## [13] biglm_0.9-1 biomaRt_2.24.0
## [15] genefilter_1.50.0 zlibbioc_1.14.0
## [17] xtable_1.7-4 scales_0.2.5
## [19] gdata_2.16.1 ff_2.2-13
## [21] BiocParallel_1.2.3 annotate_1.46.0
## [23] ggplot2_1.0.1 ROCR_1.0-7
## [25] nnet_7.3-9 Gviz_1.12.0
## [27] hexbin_1.27.0 proto_0.3-10
## [29] magrittr_1.5 evaluate_0.7
## [31] MASS_7.3-40 gplots_2.17.0
## [33] foreign_0.8-63 tools_3.2.0
## [35] formatR_1.2 matrixStats_0.14.0
## [37] stringr_1.0.0 munsell_0.4.2
## [39] cluster_2.0.1 lambda.r_1.1.7
## [41] caTools_1.17.1 futile.logger_1.4.1
## [43] grid_3.2.0 RCurl_1.95-4.6
## [45] dichromat_2.0-0 iterators_1.0.7
## [47] bitops_1.0-6 gtable_0.1.2
## [49] reshape2_1.4.1 GenomicAlignments_1.4.1
## [51] gridExtra_0.9.1 knitr_1.10.5
## [53] bit_1.1-12 Hmisc_3.16-0
## [55] futile.options_1.0.0 KernSmooth_2.23-14
## [57] stringi_0.4-1 Rcpp_0.11.6
## [59] rpart_4.1-9 acepack_1.3-3.3
[ Back to top ]