VariantExperiment 1.4.2
This vignette is about the conversion methods and statistical
functions that are enabled on VariantExperiment
objects, for the
analysis of genotyping or DNA-seq data sets. If you want to learn more
about the implementation of the VariantExperiment
class, and basic
methods, please refer to the other vignette.
The package of VariantExperiment
is implemented to represent VCF/GDS
files using standard SummarizedExperiment metaphor. It is a container
for high-through genetic/genomic data with GDS back-end, and is
interoperable with the statistical functions/methods that are
implemented in SeqArray
and SeqVarTools
that are designed for GDS
data. The SummarizedExperiment
metaphor also gets the benefit of
common manipulations within Bioconductor ecosystem that are more
user-friendly.
First, we load the package into R session.
library(VariantExperiment)
To take advantage of the functions and methods that are defined on
SummarizedExperiment
, from which the VariantExperiment
extends, we
have defined coercion methods from VCF
and GDS
to
VariantExperiment
.
VCF
to VariantExperiment
The coercion function of makeVariantExperimentFromVCF
could
convert the VCF
file directly into VariantExperiment
object. To
achieve the best storage efficiency, the assay data are saved in
GDSArray
format, and the annotation data are saved in
DelayedDataFrame
format (with no option of ordinary DataFrame
),
which could be retrieved by rowData()
for feature related
annotations and colData()
for sample related annotations (Only when
sample.info
argument is specified).
vcf <- SeqArray::seqExampleFileName("vcf")
ve <- makeVariantExperimentFromVCF(vcf)
ve
## class: VariantExperiment
## dim: 1348 90
## metadata(0):
## assays(2): genotype annotation/format/DP
## rownames(1348): 1 2 ... 1347 1348
## rowData names(14): ID ALT ... info_GP info_BN
## colnames(90): NA06984 NA06985 ... NA12891 NA12892
## colData names(0):
Internally, the VCF
file was converted into a on-disk GDS
file,
which could be retrieved by:
gdsfile(ve)
## [1] "/tmp/RtmpUh5hMV/file5dff40e7592/se.gds"
assay data is in GDSArray
format:
assay(ve, 1)
## <1348 x 90 x 2> array of class GDSArray and type "integer":
## ,,1
## NA06984 NA06985 NA06986 NA06989 ... NA12889 NA12890 NA12891 NA12892
## 1 NA NA 0 NA . 0 0 0 0
## 2 NA NA 0 NA . 0 0 0 0
## ... . . . . . . . . .
## 1347 0 0 0 0 . 0 0 0 0
## 1348 NA NA 0 NA . NA NA NA NA
##
## ,,2
## NA06984 NA06985 NA06986 NA06989 ... NA12889 NA12890 NA12891 NA12892
## 1 NA NA 0 NA . 0 0 0 0
## 2 NA NA 0 NA . 0 0 0 0
## ... . . . . . . . . .
## 1347 0 0 0 0 . 0 0 0 0
## 1348 NA NA 1 NA . NA NA NA NA
feature-related annotation is in DelayedDataFrame
format:
rowData(ve)
## DelayedDataFrame with 1348 rows and 14 columns
## ID ALT REF QUAL FILTER info_AA
## <GDSArray> <DelayedArray> <DelayedArray> <GDSArray> <GDSArray> <GDSArray>
## 1 rs111751804 C T NaN PASS T
## 2 rs114390380 A G NaN PASS G
## 3 rs1320571 A G NaN PASS A
## ... ... ... ... ... ... ...
## 1346 rs8135982 T C NaN PASS C
## 1347 rs116581756 A G NaN PASS G
## 1348 rs5771206 G A NaN PASS G
## info_AC info_AN info_DP info_HM2 info_HM3 info_OR
## <GDSArray> <GDSArray> <GDSArray> <GDSArray> <GDSArray> <GDSArray>
## 1 4 114 3251 FALSE FALSE
## 2 1 106 2676 FALSE FALSE
## 3 6 154 7610 TRUE TRUE
## ... ... ... ... ... ... ...
## 1346 11 142 823 FALSE FALSE
## 1347 1 152 1257 FALSE FALSE
## 1348 1 6 48 FALSE FALSE
## info_GP info_BN
## <GDSArray> <GDSArray>
## 1 1:1115503 132
## 2 1:1115548 132
## 3 1:1120431 88
## ... ... ...
## 1346 22:45312345 116
## 1347 22:45312409 132
## 1348 22:50616806 114
User could also have the opportunity to save the sample related
annotation info directly into the VariantExperiment
object, by
providing the file path to the sample.info
argument, and then
retrieve by colData()
.
sampleInfo <- system.file("extdata", "Example_sampleInfo.txt",
package="VariantExperiment")
ve <- makeVariantExperimentFromVCF(vcf, sample.info = sampleInfo)
## Warning in (function (node, name, val = NULL, storage = storage.mode(val), :
## Missing characters are converted to "".
colData(ve)
## DelayedDataFrame with 90 rows and 1 column
## family
## <GDSArray>
## NA06984 1328
## NA06985
## NA06986 13291
## ... ...
## NA12890 1463
## NA12891
## NA12892
Arguments could be specified to take only certain info columns or format columns from the vcf file.
ve1 <- makeVariantExperimentFromVCF(vcf, info.import=c("OR", "GP"))
rowData(ve1)
## DelayedDataFrame with 1348 rows and 7 columns
## ID ALT REF QUAL FILTER info_OR
## <GDSArray> <DelayedArray> <DelayedArray> <GDSArray> <GDSArray> <GDSArray>
## 1 rs111751804 C T NaN PASS
## 2 rs114390380 A G NaN PASS
## 3 rs1320571 A G NaN PASS
## ... ... ... ... ... ... ...
## 1346 rs8135982 T C NaN PASS
## 1347 rs116581756 A G NaN PASS
## 1348 rs5771206 G A NaN PASS
## info_GP
## <GDSArray>
## 1 1:1115503
## 2 1:1115548
## 3 1:1120431
## ... ...
## 1346 22:45312345
## 1347 22:45312409
## 1348 22:50616806
In the above example, only 2 info entries (“OR” and “GP”) are read
into the VariantExperiment
object.
The start
and count
arguments could be used to specify the start
position and number of variants to read into Variantexperiment
object.
ve2 <- makeVariantExperimentFromVCF(vcf, start=101, count=1000)
ve2
## class: VariantExperiment
## dim: 1000 90
## metadata(0):
## assays(2): genotype annotation/format/DP
## rownames(1000): 101 102 ... 1099 1100
## rowData names(14): ID ALT ... info_GP info_BN
## colnames(90): NA06984 NA06985 ... NA12891 NA12892
## colData names(0):
For the above example, only 1000 variants are read into the
VariantExperiment
object, starting from the position of 101.
GDS
to VariantExperiment
The coercion function of makeVariantExperimentFromGDS
coerces
GDS
files into VariantExperiment
objects directly, with the assay
data saved as GDSArray
, and the rowData()/colData()
in
DelayedDataFrame
by default (with the option of ordinary DataFrame
object).
gds <- SeqArray::seqExampleFileName("gds")
ve <- makeVariantExperimentFromGDS(gds)
ve
## class: VariantExperiment
## dim: 1348 90
## metadata(0):
## assays(2): genotype annotation/format/DP
## rownames(1348): 1 2 ... 1347 1348
## rowData names(14): ID ALT ... info_GP info_BN
## colnames(90): NA06984 NA06985 ... NA12891 NA12892
## colData names(1): family
rowData(ve)
## DelayedDataFrame with 1348 rows and 14 columns
## ID ALT REF QUAL FILTER info_AA
## <GDSArray> <DelayedArray> <DelayedArray> <GDSArray> <GDSArray> <GDSArray>
## 1 rs111751804 C T NaN PASS T
## 2 rs114390380 A G NaN PASS G
## 3 rs1320571 A G NaN PASS A
## ... ... ... ... ... ... ...
## 1346 rs8135982 T C NaN PASS C
## 1347 rs116581756 A G NaN PASS G
## 1348 rs5771206 G A NaN PASS G
## info_AC info_AN info_DP info_HM2 info_HM3 info_OR
## <GDSArray> <GDSArray> <GDSArray> <GDSArray> <GDSArray> <GDSArray>
## 1 4 114 3251 FALSE FALSE
## 2 1 106 2676 FALSE FALSE
## 3 6 154 7610 TRUE TRUE
## ... ... ... ... ... ... ...
## 1346 11 142 823 FALSE FALSE
## 1347 1 152 1257 FALSE FALSE
## 1348 1 6 48 FALSE FALSE
## info_GP info_BN
## <GDSArray> <GDSArray>
## 1 1:1115503 132
## 2 1:1115548 132
## 3 1:1120431 88
## ... ... ...
## 1346 22:45312345 116
## 1347 22:45312409 132
## 1348 22:50616806 114
colData(ve)
## DelayedDataFrame with 90 rows and 1 column
## family
## <GDSArray>
## NA06984 1328
## NA06985
## NA06986 13291
## ... ...
## NA12890 1463
## NA12891
## NA12892
Arguments could be specified to take only certain annotation columns
for features and samples. All available data entries for
makeVariantExperimentFromGDS
arguments could be retrieved by the
showAvailable()
function with the gds file name as input.
showAvailable(gds)
## CharacterList of length 4
## [["name"]] genotype annotation/format/DP
## [["rowDataColumns"]] ID ALT REF QUAL FILTER
## [["colDataColumns"]] family
## [["infoColumns"]] AA AC AN DP HM2 HM3 OR GP BN
Note that the infoColumns
from gds file will be saved as columns
inside the rowData()
, with the prefix of
“info_”. rowDataOnDisk/colDataOnDisk
could be set as FALSE
to
save all annotation data in ordinary DataFrame
format.
ve3 <- makeVariantExperimentFromGDS(gds,
rowDataColumns = c("ID", "ALT", "REF"),
infoColumns = c("AC", "AN", "DP"),
rowDataOnDisk = TRUE,
colDataOnDisk = FALSE)
rowData(ve3) ## DelayedDataFrame object
## DelayedDataFrame with 1348 rows and 6 columns
## ID ALT REF info_AC info_AN info_DP
## <GDSArray> <DelayedArray> <DelayedArray> <GDSArray> <GDSArray> <GDSArray>
## 1 rs111751804 C T 4 114 3251
## 2 rs114390380 A G 1 106 2676
## 3 rs1320571 A G 6 154 7610
## ... ... ... ... ... ... ...
## 1346 rs8135982 T C 11 142 823
## 1347 rs116581756 A G 1 152 1257
## 1348 rs5771206 G A 1 6 48
colData(ve3) ## DataFrame object
## DataFrame with 90 rows and 1 column
## family
## <character>
## NA06984 1328
## NA06985
## NA06986 13291
## ... ...
## NA12890 1463
## NA12891
## NA12892
VariantExperiment
supports basic subsetting operations using [
,
[[
, $
, and ranged-based subsetting operations using
subsetByOverlap
.
NOTE that after a set of lazy operations, you need to call
saveVariantExperiment
function to synchronize the on-disk file
associated with the in-memory representation by providing a file
path. Statistical functions could only work on synchronized
VariantExperiment
object, or error will return.
Refer to the “VariantExperiment-class” vignette for more details.
Many statistical functions and methods are defined on
VariantExperiment
objects, most of which has their generic defined
in Bioconductor package of SeqArray
and SeqVarTools
. These
functions could be called directly on VariantExperiment
object as
input, with additional arguments to specify based on user’s need. More
details please refer to the vignettes of SeqArray and
SeqVarTools.
Here is a list of the statistical functions with brief description:
statistical functions | Description |
---|---|
seqAlleleFreq | Calculates the allele frequencies |
seqAlleleCount | Calculates the allele counts |
seqMissing | Calculates the missing rate for variant/sample |
seqNumAllele | Calculates the number of alleles (for ref/alt allele) |
hwe | Exact test for Hardy-Weinberg equilibrium on Single-Nucleotide Variants |
inbreedCoeff | Calculates the inbreeding coefficient by variant/sample |
pca | Calculates the eigenvalues and eignevectors with Principal Component Analysis |
titv | Calculate transition/transversion ratio overall or by sample |
refDosage | Calculate the dosage of reference allele (matrix with integers of 0/1/2) |
altDosage | Calculate the dosage of alternative allele (matrix with integers of 0/1/2) |
countSingletons | Count singleton variants for each sample |
heterozygosity | Calculate heterozygosity rate by sample or by variants |
homozygosity | Calculate homozygosity rate by sample or by variants |
meanBySample | Calculate the mean value of a variable by sample over all variants |
isSNV | Flag a single nucleotide variant |
isVariant | Locate which samples are variant for each site |
Here are some examples in calculating the sample missing rate, hwe, titv ratio and the count of singletons for each sample.
## sample missing rate
mr.samp <- seqMissing(ve, per.variant = FALSE)
head(mr.samp)
## [1] 0.083086053 0.096439169 0.009643917 0.094213650 0.129821958 0.022997033
## hwe
hwe <- hwe(ve)
head(hwe)
## variant.id nAA nAa naa afreq p f
## 1 1 53 4 0 0.9649123 1 -0.036363636
## 2 2 52 1 0 0.9905660 1 -0.009523810
## 3 3 71 6 0 0.9610390 1 -0.040540541
## 4 4 1 16 56 0.1232877 1 -0.013888889
## 5 5 76 13 0 0.9269663 1 -0.078787879
## 6 6 88 1 0 0.9943820 1 -0.005649718
## titv ratio by sample / overall
titv <- titv(ve, by.sample=TRUE)
head(titv)
## [1] 4.352941 3.791667 3.439394 3.568966 3.750000 3.646154
titv(ve, by.sample=FALSE)
## [1] 3.562712
## countSingletons
countSingletons(ve)
## [1] 2 2 7 5 1 5 3 5 2 2 8 2 4 3 6 5 9 5 3 5 5 5 7 0 5
## [26] 8 13 9 1 6 2 8 2 6 0 5 4 7 7 3 1 9 0 0 5 3 4 8 6 9
## [51] 6 4 7 8 5 7 3 5 2 2 6 8 2 6 4 3 7 4 3 3 8 2 8 6 1
## [76] 5 1 9 8 5 0 1 2 2 0 1 0 10 2 4
As we have noted in the other vignette, after the subsetting by
[
, $
or Ranged-based operations
, we need to save the new
VariantExperiment
object to synchronize the gds file (on-disk)
associated with the subset of data (in-memory representation) before
any statistical analysis. Otherwise, an error will be returned.
As a feature addition, we want to add the option of VCFArray
in
saving the assay
data in the step of
makeVariantExperimentFromVCF
. We also seek to implement the
SQLDataFrame in representation of the annotation data. We also
plan to connect Bioconductor package VariantAnnotation to
implement the variant filtering and annotation functions based on
VariantExperiment
format, and with that, to develop a pipeline for
using VariantExperiment
object as the basic data structure for
DNA-sequencing data analysis.
sessionInfo()
## R version 4.0.5 (2021-03-31)
## Platform: x86_64-pc-linux-gnu (64-bit)
## Running under: Ubuntu 18.04.5 LTS
##
## Matrix products: default
## BLAS: /home/biocbuild/bbs-3.12-bioc/R/lib/libRblas.so
## LAPACK: /home/biocbuild/bbs-3.12-bioc/R/lib/libRlapack.so
##
## locale:
## [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
## [3] LC_TIME=en_US.UTF-8 LC_COLLATE=C
## [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
## [7] LC_PAPER=en_US.UTF-8 LC_NAME=C
## [9] LC_ADDRESS=C LC_TELEPHONE=C
## [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
##
## attached base packages:
## [1] parallel stats4 stats graphics grDevices utils datasets
## [8] methods base
##
## other attached packages:
## [1] VariantExperiment_1.4.2 DelayedDataFrame_1.6.0
## [3] GDSArray_1.10.1 DelayedArray_0.16.3
## [5] Matrix_1.3-2 gdsfmt_1.26.1
## [7] SummarizedExperiment_1.20.0 Biobase_2.50.0
## [9] GenomicRanges_1.42.0 GenomeInfoDb_1.26.7
## [11] IRanges_2.24.1 MatrixGenerics_1.2.1
## [13] matrixStats_0.58.0 S4Vectors_0.28.1
## [15] BiocGenerics_0.36.0 BiocStyle_2.18.1
##
## loaded via a namespace (and not attached):
## [1] Rcpp_1.0.6 lattice_0.20-41 tidyr_1.1.3
## [4] formula.tools_1.7.1 Biostrings_2.58.0 assertthat_0.2.1
## [7] digest_0.6.27 utf8_1.2.1 R6_2.5.0
## [10] backports_1.2.1 evaluate_0.14 pillar_1.5.1
## [13] zlibbioc_1.36.0 rlang_0.4.10 data.table_1.14.0
## [16] jquerylib_0.1.3 rmarkdown_2.7 splines_4.0.5
## [19] stringr_1.4.0 GWASExactHW_1.01 RCurl_1.98-1.3
## [22] broom_0.7.6 compiler_4.0.5 xfun_0.22
## [25] pkgconfig_2.0.3 mgcv_1.8-34 htmltools_0.5.1.1
## [28] tidyselect_1.1.0 tibble_3.1.0 GenomeInfoDbData_1.2.4
## [31] bookdown_0.21 fansi_0.4.2 crayon_1.4.1
## [34] dplyr_1.0.5 bitops_1.0-6 grid_4.0.5
## [37] DBI_1.1.1 nlme_3.1-152 jsonlite_1.7.2
## [40] lifecycle_1.0.0 magrittr_2.0.1 SeqVarTools_1.28.1
## [43] stringi_1.5.3 debugme_1.1.0 XVector_0.30.0
## [46] SNPRelate_1.24.0 mice_3.13.0 bslib_0.2.4
## [49] ellipsis_0.3.1 vctrs_0.3.7 generics_0.1.0
## [52] tools_4.0.5 glue_1.4.2 purrr_0.3.4
## [55] yaml_2.2.1 SeqArray_1.30.0 BiocManager_1.30.12
## [58] operator.tools_1.6.3 logistf_1.24 knitr_1.31
## [61] sass_0.3.1