Below we compare the results of annotating variants via VariantAnnotation. We compare using data from 1,000 Genomes Phase 1 Variants: * as parsed from a VCF file * retrieved from the Google Genomics API
First we read in the data from the VCF file:
suppressPackageStartupMessages(library(VariantAnnotation))
fl <- system.file("extdata", "chr22.vcf.gz", package="VariantAnnotation")
vcf <- readVcf(fl, "hg19")
vcf <- renameSeqlevels(vcf, c("22"="chr22"))
vcf
## class: CollapsedVCF
## dim: 10376 5
## rowRanges(vcf):
## GRanges with 5 metadata columns: paramRangeID, REF, ALT, QUAL, FILTER
## info(vcf):
## DataFrame with 22 columns: LDAF, AVGPOST, RSQ, ERATE, THETA, CIEND, C...
## info(header(vcf)):
## Number Type Description
## LDAF 1 Float MLE Allele Frequency Accounting for LD
## AVGPOST 1 Float Average posterior probability from MaCH/Thu...
## RSQ 1 Float Genotype imputation quality from MaCH/Thunder
## ERATE 1 Float Per-marker Mutation rate from MaCH/Thunder
## THETA 1 Float Per-marker Transition rate from MaCH/Thunder
## CIEND 2 Integer Confidence interval around END for imprecis...
## CIPOS 2 Integer Confidence interval around POS for imprecis...
## END 1 Integer End position of the variant described in th...
## HOMLEN . Integer Length of base pair identical micro-homolog...
## HOMSEQ . String Sequence of base pair identical micro-homol...
## SVLEN 1 Integer Difference in length between REF and ALT al...
## SVTYPE 1 String Type of structural variant
## AC . Integer Alternate Allele Count
## AN 1 Integer Total Allele Count
## AA 1 String Ancestral Allele, ftp://ftp.1000genomes.ebi...
## AF 1 Float Global Allele Frequency based on AC/AN
## AMR_AF 1 Float Allele Frequency for samples from AMR based...
## ASN_AF 1 Float Allele Frequency for samples from ASN based...
## AFR_AF 1 Float Allele Frequency for samples from AFR based...
## EUR_AF 1 Float Allele Frequency for samples from EUR based...
## VT 1 String indicates what type of variant the line rep...
## SNPSOURCE . String indicates if a snp was called when analysin...
## geno(vcf):
## SimpleList of length 3: GT, DS, GL
## geno(header(vcf)):
## Number Type Description
## GT 1 String Genotype
## DS 1 Float Genotype dosage from MaCH/Thunder
## GL . Float Genotype Likelihoods
The file chr22.vcf.gz
within package VariantAnnotation holds data for 5 of the 1,092 individuals in 1,000 Genomes, starting at position 50300078 and ending at position 50999964.
HG00096 HG00097 HG00099 HG00100 HG00101
Important data differences to note: * VCF data uses 1-based coordinates but data from the GA4GH APIs is 0-based. * There are two variants in the Google Genomics copy of 1,000 Genomes phase 1 variants that are not in chr22.vcf.gz
. They are the only two variants within the genomic range with ALT == <DEL>
.
# Authenticated on package load from the env variable GOOGLE_API_KEY.
suppressPackageStartupMessages(library(GoogleGenomics))
# TODO Right now we're just getting a few variants. Later update this to retrieve them all.
system.time({
granges <- getVariants(datasetId="10473108253681171589",
chromosome="22",
start=50300077,
end=50303000, # TODO end=50999964
converter=variantsToGRanges)
})
## Fetching variants page.
## Continuing variant query with the nextPageToken: CIWK_hcQ_Nr-8ZGlk78i
## Fetching variants page.
## Continuing variant query with the nextPageToken: CIaM_hcQvM_un46SmcJ0
## Fetching variants page.
## Continuing variant query with the nextPageToken: CJGN_hcQiYna3cSwse_RAQ
## Fetching variants page.
## Continuing variant query with the nextPageToken: CMKQ_hcQr9OUpKu4n_xz
## Fetching variants page.
## Continuing variant query with the nextPageToken: CMKS_hcQn-nIneaph7WJAQ
## Fetching variants page.
## Continuing variant query with the nextPageToken: CKOU_hcQ2eCCwsmBysdd
## Fetching variants page.
## Continuing variant query with the nextPageToken: CLWV_hcQmo-n-tX08uPyAQ
## Fetching variants page.
## Continuing variant query with the nextPageToken: CMqX_hcQ__3H4KKV9tUR
## Fetching variants page.
## Continuing variant query with the nextPageToken: CKuZ_hcQ9pHXwa742bKgAQ
## Fetching variants page.
## Continuing variant query with the nextPageToken: CLqa_hcQ-6zjo_Dc86gy
## Fetching variants page.
## Continuing variant query with the nextPageToken: CLSb_hcQ3vrDo5XJwsB_
## Fetching variants page.
## Continuing variant query with the nextPageToken: CKOd_hcQ6IPgk-PVjskK
## Fetching variants page.
## Continuing variant query with the nextPageToken: CPGe_hcQ-8Saw6Ojg8yXAQ
## Fetching variants page.
## Continuing variant query with the nextPageToken: CPGf_hcQuKDdodiCvLUm
## Fetching variants page.
## Variants are now available.
## user system elapsed
## 5.302 0.072 17.992
Ensure that the data retrieved by each matches:
vcf <- vcf[1:length(granges)] # Truncate the VCF data
suppressPackageStartupMessages(library(testthat))
expect_equal(start(granges), start(vcf))
expect_equal(end(granges), end(vcf))
expect_equal(as.character(granges$REF), as.character(ref(vcf)))
expect_equal(as.character(unlist(granges$ALT)), as.character(unlist(alt(vcf))))
expect_equal(granges$QUAL, qual(vcf))
expect_equal(granges$FILTER, filt(vcf))
Now locate the protein coding variants in each:
suppressPackageStartupMessages(library(TxDb.Hsapiens.UCSC.hg19.knownGene))
txdb <- TxDb.Hsapiens.UCSC.hg19.knownGene
rd <- rowData(vcf)
## Warning: 'rowData' is deprecated.
## Use 'rowRanges' instead.
## See help("Deprecated")
vcf_locations <- locateVariants(rd, txdb, CodingVariants())
vcf_locations
## GRanges object with 7 ranges and 9 metadata columns:
## seqnames ranges strand | LOCATION LOCSTART LOCEND
## <Rle> <IRanges> <Rle> | <factor> <integer> <integer>
## 1 chr22 [50301422, 50301422] * | coding 939 939
## 2 chr22 [50301476, 50301476] * | coding 885 885
## 3 chr22 [50301488, 50301488] * | coding 873 873
## 4 chr22 [50301494, 50301494] * | coding 867 867
## 5 chr22 [50301584, 50301584] * | coding 777 777
## 6 chr22 [50302962, 50302962] * | coding 698 698
## 7 chr22 [50302995, 50302995] * | coding 665 665
## QUERYID TXID CDSID GENEID PRECEDEID
## <integer> <character> <IntegerList> <character> <CharacterList>
## 1 24 75253 218562 79087
## 2 25 75253 218562 79087
## 3 26 75253 218562 79087
## 4 27 75253 218562 79087
## 5 28 75253 218562 79087
## 6 57 75253 218563 79087
## 7 58 75253 218563 79087
## FOLLOWID
## <CharacterList>
## 1
## 2
## 3
## 4
## 5
## 6
## 7
## -------
## seqinfo: 1 sequence from an unspecified genome; no seqlengths
granges_locations <- locateVariants(granges, txdb, CodingVariants())
granges_locations
## GRanges object with 7 ranges and 9 metadata columns:
## seqnames ranges strand | LOCATION LOCSTART LOCEND
## <Rle> <IRanges> <Rle> | <factor> <integer> <integer>
## 1 chr22 [50301422, 50301422] * | coding 939 939
## 2 chr22 [50301476, 50301476] * | coding 885 885
## 3 chr22 [50301488, 50301488] * | coding 873 873
## 4 chr22 [50301494, 50301494] * | coding 867 867
## 5 chr22 [50301584, 50301584] * | coding 777 777
## 6 chr22 [50302962, 50302962] * | coding 698 698
## 7 chr22 [50302995, 50302995] * | coding 665 665
## QUERYID TXID CDSID GENEID PRECEDEID
## <integer> <character> <IntegerList> <character> <CharacterList>
## 1 24 75253 218562 79087
## 2 25 75253 218562 79087
## 3 26 75253 218562 79087
## 4 27 75253 218562 79087
## 5 28 75253 218562 79087
## 6 57 75253 218563 79087
## 7 58 75253 218563 79087
## FOLLOWID
## <CharacterList>
## 1
## 2
## 3
## 4
## 5
## 6
## 7
## -------
## seqinfo: 1 sequence from an unspecified genome; no seqlengths
expect_equal(granges_locations, vcf_locations)
And predict the effect of the protein coding variants:
suppressPackageStartupMessages(library(BSgenome.Hsapiens.UCSC.hg19))
vcf_coding <- predictCoding(vcf, txdb, seqSource=Hsapiens)
vcf_coding
## GRanges object with 7 ranges and 17 metadata columns:
## seqnames ranges strand | paramRangeID
## <Rle> <IRanges> <Rle> | <factor>
## rs114335781 chr22 [50301422, 50301422] - | <NA>
## rs8135963 chr22 [50301476, 50301476] - | <NA>
## 22:50301488_C/T chr22 [50301488, 50301488] - | <NA>
## 22:50301494_G/A chr22 [50301494, 50301494] - | <NA>
## 22:50301584_C/T chr22 [50301584, 50301584] - | <NA>
## rs114264124 chr22 [50302962, 50302962] - | <NA>
## rs149209714 chr22 [50302995, 50302995] - | <NA>
## REF ALT QUAL FILTER
## <DNAStringSet> <DNAStringSetList> <numeric> <character>
## rs114335781 G A 100 PASS
## rs8135963 T C 100 PASS
## 22:50301488_C/T C T 100 PASS
## 22:50301494_G/A G A 100 PASS
## 22:50301584_C/T C T 100 PASS
## rs114264124 C T 100 PASS
## rs149209714 C G 100 PASS
## varAllele CDSLOC PROTEINLOC QUERYID
## <DNAStringSet> <IRanges> <IntegerList> <integer>
## rs114335781 T [939, 939] 313 24
## rs8135963 G [885, 885] 295 25
## 22:50301488_C/T A [873, 873] 291 26
## 22:50301494_G/A T [867, 867] 289 27
## 22:50301584_C/T A [777, 777] 259 28
## rs114264124 A [698, 698] 233 57
## rs149209714 C [665, 665] 222 58
## TXID CDSID GENEID CONSEQUENCE
## <character> <IntegerList> <character> <factor>
## rs114335781 75253 218562 79087 synonymous
## rs8135963 75253 218562 79087 synonymous
## 22:50301488_C/T 75253 218562 79087 synonymous
## 22:50301494_G/A 75253 218562 79087 synonymous
## 22:50301584_C/T 75253 218562 79087 synonymous
## rs114264124 75253 218563 79087 nonsynonymous
## rs149209714 75253 218563 79087 nonsynonymous
## REFCODON VARCODON REFAA
## <DNAStringSet> <DNAStringSet> <AAStringSet>
## rs114335781 ATC ATT I
## rs8135963 GCA GCG A
## 22:50301488_C/T CCG CCA P
## 22:50301494_G/A CAC CAT H
## 22:50301584_C/T CCG CCA P
## rs114264124 CGG CAG R
## rs149209714 GGA GCA G
## VARAA
## <AAStringSet>
## rs114335781 I
## rs8135963 A
## 22:50301488_C/T P
## 22:50301494_G/A H
## 22:50301584_C/T P
## rs114264124 Q
## rs149209714 A
## -------
## seqinfo: 1 sequence from hg19 genome; no seqlengths
granges_coding <- predictCoding(rep(granges, elementLengths(granges$ALT)),
txdb,
seqSource=Hsapiens,
varAllele=unlist(granges$ALT, use.names=FALSE))
granges_coding
## GRanges object with 7 ranges and 16 metadata columns:
## seqnames ranges strand | REF
## <Rle> <IRanges> <Rle> | <DNAStringSet>
## rs114335781 chr22 [50301422, 50301422] - | G
## rs8135963 chr22 [50301476, 50301476] - | T
## rs200080075 chr22 [50301488, 50301488] - | C
## rs147801200 chr22 [50301494, 50301494] - | G
## rs138060012 chr22 [50301584, 50301584] - | C
## rs114264124 chr22 [50302962, 50302962] - | C
## rs149209714 chr22 [50302995, 50302995] - | C
## ALT QUAL FILTER varAllele
## <DNAStringSetList> <numeric> <character> <DNAStringSet>
## rs114335781 A 100 PASS T
## rs8135963 C 100 PASS G
## rs200080075 T 100 PASS A
## rs147801200 A 100 PASS T
## rs138060012 T 100 PASS A
## rs114264124 T 100 PASS A
## rs149209714 G 100 PASS C
## CDSLOC PROTEINLOC QUERYID TXID CDSID
## <IRanges> <IntegerList> <integer> <character> <IntegerList>
## rs114335781 [939, 939] 313 24 75253 218562
## rs8135963 [885, 885] 295 25 75253 218562
## rs200080075 [873, 873] 291 26 75253 218562
## rs147801200 [867, 867] 289 27 75253 218562
## rs138060012 [777, 777] 259 28 75253 218562
## rs114264124 [698, 698] 233 57 75253 218563
## rs149209714 [665, 665] 222 58 75253 218563
## GENEID CONSEQUENCE REFCODON VARCODON
## <character> <factor> <DNAStringSet> <DNAStringSet>
## rs114335781 79087 synonymous ATC ATT
## rs8135963 79087 synonymous GCA GCG
## rs200080075 79087 synonymous CCG CCA
## rs147801200 79087 synonymous CAC CAT
## rs138060012 79087 synonymous CCG CCA
## rs114264124 79087 nonsynonymous CGG CAG
## rs149209714 79087 nonsynonymous GGA GCA
## REFAA VARAA
## <AAStringSet> <AAStringSet>
## rs114335781 I I
## rs8135963 A A
## rs200080075 P P
## rs147801200 H H
## rs138060012 P P
## rs114264124 R Q
## rs149209714 G A
## -------
## seqinfo: 1 sequence from an unspecified genome; no seqlengths
expect_equal(as.matrix(granges_coding$REFCODON), as.matrix(vcf_coding$REFCODON))
expect_equal(as.matrix(granges_coding$VARCODON), as.matrix(vcf_coding$VARCODON))
expect_equal(granges_coding$GENEID, vcf_coding$GENEID)
expect_equal(granges_coding$CONSEQUENCE, vcf_coding$CONSEQUENCE)
Add gene information:
suppressPackageStartupMessages(library(org.Hs.eg.db))
annots <- select(org.Hs.eg.db,
keys=granges_coding$GENEID,
keytype="ENTREZID",
columns=c("SYMBOL", "GENENAME","ENSEMBL"))
cbind(elementMetadata(granges_coding), annots)
## DataFrame with 7 rows and 20 columns
## REF ALT QUAL FILTER varAllele
## <DNAStringSet> <DNAStringSetList> <numeric> <character> <DNAStringSet>
## 1 G A 100 PASS T
## 2 T C 100 PASS G
## 3 C T 100 PASS A
## 4 G A 100 PASS T
## 5 C T 100 PASS A
## 6 C T 100 PASS A
## 7 C G 100 PASS C
## CDSLOC PROTEINLOC QUERYID TXID CDSID GENEID
## <IRanges> <IntegerList> <integer> <character> <IntegerList> <character>
## 1 [939, 939] 313 24 75253 218562 79087
## 2 [885, 885] 295 25 75253 218562 79087
## 3 [873, 873] 291 26 75253 218562 79087
## 4 [867, 867] 289 27 75253 218562 79087
## 5 [777, 777] 259 28 75253 218562 79087
## 6 [698, 698] 233 57 75253 218563 79087
## 7 [665, 665] 222 58 75253 218563 79087
## CONSEQUENCE REFCODON VARCODON REFAA VARAA
## <factor> <DNAStringSet> <DNAStringSet> <AAStringSet> <AAStringSet>
## 1 synonymous ATC ATT I I
## 2 synonymous GCA GCG A A
## 3 synonymous CCG CCA P P
## 4 synonymous CAC CAT H H
## 5 synonymous CCG CCA P P
## 6 nonsynonymous CGG CAG R Q
## 7 nonsynonymous GGA GCA G A
## ENTREZID SYMBOL GENENAME
## <character> <character> <character>
## 1 79087 ALG12 ALG12, alpha-1,6-mannosyltransferase
## 2 79087 ALG12 ALG12, alpha-1,6-mannosyltransferase
## 3 79087 ALG12 ALG12, alpha-1,6-mannosyltransferase
## 4 79087 ALG12 ALG12, alpha-1,6-mannosyltransferase
## 5 79087 ALG12 ALG12, alpha-1,6-mannosyltransferase
## 6 79087 ALG12 ALG12, alpha-1,6-mannosyltransferase
## 7 79087 ALG12 ALG12, alpha-1,6-mannosyltransferase
## ENSEMBL
## <character>
## 1 ENSG00000182858
## 2 ENSG00000182858
## 3 ENSG00000182858
## 4 ENSG00000182858
## 5 ENSG00000182858
## 6 ENSG00000182858
## 7 ENSG00000182858
Package versions used:
sessionInfo()
## R version 3.2.0 (2015-04-16)
## Platform: x86_64-unknown-linux-gnu (64-bit)
## Running under: Ubuntu 14.04.2 LTS
##
## locale:
## [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
## [3] LC_TIME=en_US.UTF-8 LC_COLLATE=C
## [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
## [7] LC_PAPER=en_US.UTF-8 LC_NAME=C
## [9] LC_ADDRESS=C LC_TELEPHONE=C
## [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
##
## attached base packages:
## [1] stats4 parallel stats graphics grDevices utils datasets
## [8] methods base
##
## other attached packages:
## [1] org.Hs.eg.db_3.1.2
## [2] RSQLite_1.0.0
## [3] DBI_0.3.1
## [4] BSgenome.Hsapiens.UCSC.hg19_1.4.0
## [5] BSgenome_1.36.0
## [6] rtracklayer_1.28.0
## [7] TxDb.Hsapiens.UCSC.hg19.knownGene_3.1.2
## [8] GenomicFeatures_1.20.0
## [9] AnnotationDbi_1.30.0
## [10] Biobase_2.28.0
## [11] testthat_0.9.1
## [12] ggbio_1.16.0
## [13] ggplot2_1.0.1
## [14] GoogleGenomics_1.0.0
## [15] VariantAnnotation_1.14.0
## [16] GenomicAlignments_1.4.0
## [17] Rsamtools_1.20.0
## [18] Biostrings_2.36.0
## [19] XVector_0.8.0
## [20] GenomicRanges_1.20.0
## [21] GenomeInfoDb_1.4.0
## [22] IRanges_2.2.0
## [23] S4Vectors_0.6.0
## [24] BiocGenerics_0.14.0
## [25] BiocStyle_1.6.0
##
## loaded via a namespace (and not attached):
## [1] reshape2_1.4.1 splines_3.2.0 lattice_0.20-31
## [4] colorspace_1.2-6 htmltools_0.2.6 yaml_2.1.13
## [7] RBGL_1.44.0 XML_3.98-1.1 survival_2.38-1
## [10] foreign_0.8-63 BiocParallel_1.2.0 RColorBrewer_1.1-2
## [13] lambda.r_1.1.7 plyr_1.8.1 stringr_0.6.2
## [16] zlibbioc_1.14.0 munsell_0.4.2 gtable_0.1.2
## [19] futile.logger_1.4 OrganismDbi_1.10.0 evaluate_0.6
## [22] labeling_0.3 latticeExtra_0.6-26 knitr_1.9
## [25] GGally_0.5.0 biomaRt_2.24.0 proto_0.3-10
## [28] Rcpp_0.11.5 acepack_1.3-3.3 scales_0.2.4
## [31] formatR_1.1 graph_1.46.0 Hmisc_3.15-0
## [34] jsonlite_0.9.16 gridExtra_0.9.1 rjson_0.2.15
## [37] digest_0.6.8 biovizBase_1.16.0 grid_3.2.0
## [40] tools_3.2.0 bitops_1.0-6 RCurl_1.95-4.5
## [43] dichromat_2.0-0 Formula_1.2-1 cluster_2.0.1
## [46] futile.options_1.0.0 MASS_7.3-40 rmarkdown_0.5.1
## [49] reshape_0.8.5 httr_0.6.1 rpart_4.1-9
## [52] nnet_7.3-9