First of all we need to install NewWave:
NewWave is a new package that assumes a Negative Binomial distributions for dimensionality reduction and batch effect removal. In order to reduce the memory consumption it uses a PSOCK cluster combined with the R package SharedObject that allow to share a matrix between different cores without memory duplication. Thanks to that we can massively parallelize the estimation process with huge benefit in terms of time consumption. We can reduce even more the time consumption using some minibatch approaches on the different steps of the optimization.
I am going to show how to use NewWave with example data generated with Splatter.
params <- newSplatParams()
N=500
set.seed(1234)
data <- splatSimulateGroups(params,batchCells=c(N/2,N/2),
group.prob = rep(0.1,10),
de.prob = 0.2,
verbose = FALSE)
Now we have a dataset with 500 cells and 10000 genes, I will use only the 500 most variable genes. NewWave takes as input raw data, not normalized.
set.seed(12359)
hvg <- rowVars(counts(data))
names(hvg) <- rownames(counts(data))
data <- data[names(sort(hvg,decreasing=TRUE))[1:500],]
As you can see there is a variable called batch in the colData section.
colData(data)
#> DataFrame with 500 rows and 4 columns
#> Cell Batch Group ExpLibSize
#> <character> <character> <factor> <numeric>
#> Cell1 Cell1 Batch1 Group9 52739.2
#> Cell2 Cell2 Batch1 Group10 74088.5
#> Cell3 Cell3 Batch1 Group9 58679.5
#> Cell4 Cell4 Batch1 Group6 67414.3
#> Cell5 Cell5 Batch1 Group5 74912.0
#> ... ... ... ... ...
#> Cell496 Cell496 Batch2 Group5 39001.5
#> Cell497 Cell497 Batch2 Group6 50961.8
#> Cell498 Cell498 Batch2 Group4 65811.3
#> Cell499 Cell499 Batch2 Group2 66538.6
#> Cell500 Cell500 Batch2 Group9 56559.3
IMPORTANT: For batch effecr removal the batch variable must be a factor
We also have a variable called Group that represent the cell type labels.
We can see the how the cells are distributed between group and batch
There is a clear batch effect between the cells.
Let’s try to correct it.
I am going to show different implementation and the suggested way to use them with the given hardware.
Some advise:
This is the way to insert the batch variable, in the same manner can be inserted other cell-related variable and if you need some gene related variable those can be inserted in V.
res <- newWave(data,X = "~Batch", K=10, verbose = TRUE)
#> Time of setup
#> user system elapsed
#> 0.028 0.000 0.489
#> Time of initialization
#> user system elapsed
#> 0.048 0.004 0.885
#> Iteration 1
#> penalized log-likelihood = -1297331.19391633
#> Time of dispersion optimization
#> user system elapsed
#> 1.112 0.008 1.122
#> after optimize dispersion = -1059260.31267161
#> Time of right optimization
#> user system elapsed
#> 0.000 0.000 10.164
#> after right optimization= -1058530.03757114
#> after orthogonalization = -1058530.01070189
#> Time of left optimization
#> user system elapsed
#> 0.000 0.000 9.194
#> after left optimization= -1058253.43266873
#> after orthogonalization = -1058253.43105307
#> Iteration 2
#> penalized log-likelihood = -1058253.43105307
#> Time of dispersion optimization
#> user system elapsed
#> 0.848 0.000 0.850
#> after optimize dispersion = -1058247.60542535
#> Time of right optimization
#> user system elapsed
#> 0.000 0.000 10.206
#> after right optimization= -1058221.81481695
#> after orthogonalization = -1058221.8137679
#> Time of left optimization
#> user system elapsed
#> 0.000 0.000 7.251
#> after left optimization= -1058212.63549661
#> after orthogonalization = -1058212.63543312
In order to make it faster you can increase the number of cores using “children” parameter:
res2 <- newWave(data,X = "~Batch", K=10, verbose = TRUE, children=2)
#> Time of setup
#> user system elapsed
#> 0.012 0.016 0.624
#> Time of initialization
#> user system elapsed
#> 0.056 0.000 0.556
#> Iteration 1
#> penalized log-likelihood = -1297331.19401854
#> Time of dispersion optimization
#> user system elapsed
#> 1.124 0.020 1.152
#> after optimize dispersion = -1059260.31294836
#> Time of right optimization
#> user system elapsed
#> 0.004 0.000 5.234
#> after right optimization= -1058530.04460675
#> after orthogonalization = -1058530.0177271
#> Time of left optimization
#> user system elapsed
#> 0.004 0.000 5.202
#> after left optimization= -1058253.4382544
#> after orthogonalization = -1058253.43663678
#> Iteration 2
#> penalized log-likelihood = -1058253.43663678
#> Time of dispersion optimization
#> user system elapsed
#> 0.868 0.000 0.867
#> after optimize dispersion = -1058247.61123741
#> Time of right optimization
#> user system elapsed
#> 0.00 0.00 4.69
#> after right optimization= -1058221.81316142
#> after orthogonalization = -1058221.81210967
#> Time of left optimization
#> user system elapsed
#> 0.000 0.000 3.738
#> after left optimization= -1058212.63084014
#> after orthogonalization = -1058212.63077955
If you do not have an high number of cores to run newWave this is the fastest way to run. The optimization process is done by three process itereated until convercence.
Each of these three steps can be accelerated using mini batch, the number of observation is settled with these parameters:
res3 <- newWave(data,X = "~Batch", verbose = TRUE,K=10, children=2,
n_gene_disp = 100, n_gene_par = 100, n_cell_par = 100)
#> Warning in serialize(data, node$con): The shared memory does not exist(id: 37),
#> the unshared data will be exported instead
#> Warning in serialize(data, node$con): The shared memory does not exist(id: 38),
#> the unshared data will be exported instead
#> Warning in serialize(data, node$con): The shared memory does not exist(id: 37),
#> the unshared data will be exported instead
#> Warning in serialize(data, node$con): The shared memory does not exist(id: 38),
#> the unshared data will be exported instead
#> Time of setup
#> user system elapsed
#> 0.032 0.004 0.488
#> Warning in serialize(data, node$con): The shared memory does not exist(id: 38),
#> the unshared data will be exported instead
#> Warning in serialize(data, node$con): The shared memory does not exist(id: 38),
#> the unshared data will be exported instead
#> Time of initialization
#> user system elapsed
#> 0.060 0.004 0.555
#> Iteration 1
#> penalized log-likelihood = -1297331.19400289
#> Time of dispersion optimization
#> user system elapsed
#> 0.864 0.020 0.885
#> after optimize dispersion = -1059260.31235371
#> Time of right optimization
#> user system elapsed
#> 0.000 0.000 5.304
#> after right optimization= -1058530.04662611
#> after orthogonalization = -1058530.01975665
#> Time of left optimization
#> user system elapsed
#> 0.000 0.000 4.963
#> after left optimization= -1058253.44134748
#> after orthogonalization = -1058253.43973378
#> Iteration 2
#> penalized log-likelihood = -1058253.43973378
#> Time of dispersion optimization
#> user system elapsed
#> 0.396 0.000 0.399
#> after optimize dispersion = -1058253.43973378
#> Time of right optimization
#> user system elapsed
#> 0.000 0.000 0.921
#> after right optimization= -1058248.50027572
#> after orthogonalization = -1058248.4998783
#> Time of left optimization
#> user system elapsed
#> 0.004 0.000 0.650
#> after left optimization= -1058248.3455107
#> after orthogonalization = -1058248.34549991
If you have a lot of core disposable or you want to estimate a genewise dispersion parameter this is the fastes configuration:
res3 <- newWave(data,X = "~Batch", verbose = TRUE,K=10, children=2,
n_gene_par = 100, n_cell_par = 100, commondispersion = FALSE)
#> Time of setup
#> user system elapsed
#> 0.024 0.004 0.556
#> Time of initialization
#> user system elapsed
#> 0.064 0.000 0.575
#> Iteration 1
#> penalized log-likelihood = -1297331.19439467
#> Time of dispersion optimization
#> user system elapsed
#> 0.828 0.008 0.837
#> after optimize dispersion = -1059260.31603035
#> Time of right optimization
#> user system elapsed
#> 0.000 0.000 4.807
#> after right optimization= -1058530.04593073
#> after orthogonalization = -1058530.01907852
#> Time of left optimization
#> user system elapsed
#> 0.000 0.000 5.021
#> after left optimization= -1058253.44816635
#> after orthogonalization = -1058253.44655823
#> Iteration 2
#> penalized log-likelihood = -1058253.44655823
#> Time of dispersion optimization
#> user system elapsed
#> 0.060 0.000 1.051
#> after optimize dispersion = -1054574.32106304
#> Time of right optimization
#> user system elapsed
#> 0.004 0.000 1.305
#> after right optimization= -1054569.65870332
#> after orthogonalization = -1054569.65831265
#> Time of left optimization
#> user system elapsed
#> 0.000 0.000 1.351
#> after left optimization= -1054539.15555716
#> after orthogonalization = -1054539.15501452
#> Iteration 3
#> penalized log-likelihood = -1054539.15501452
#> Time of dispersion optimization
#> user system elapsed
#> 0.092 0.000 0.474
#> after optimize dispersion = -1054539.17601074
#> Time of right optimization
#> user system elapsed
#> 0.00 0.00 1.24
#> after right optimization= -1054534.4295738
#> after orthogonalization = -1054534.42910108
#> Time of left optimization
#> user system elapsed
#> 0.000 0.000 0.974
#> after left optimization= -1054513.83167299
#> after orthogonalization = -1054513.83117253
NB: do not use n_gene_disp in this case, it will slower the computation.
Now I can use the latent dimension rapresentation for visualization purpose:
latent <- reducedDim(res)
tsne_latent <- data.frame(Rtsne(latent)$Y)
tsne_latent$batch <- data$Batch
tsne_latent$group <- data$Group
or for clustering:
sessionInfo()
#> R version 4.0.3 (2020-10-10)
#> Platform: x86_64-pc-linux-gnu (64-bit)
#> Running under: Ubuntu 18.04.5 LTS
#>
#> Matrix products: default
#> BLAS: /home/biocbuild/bbs-3.12-bioc/R/lib/libRblas.so
#> LAPACK: /home/biocbuild/bbs-3.12-bioc/R/lib/libRlapack.so
#>
#> locale:
#> [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
#> [3] LC_TIME=en_US.UTF-8 LC_COLLATE=C
#> [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
#> [7] LC_PAPER=en_US.UTF-8 LC_NAME=C
#> [9] LC_ADDRESS=C LC_TELEPHONE=C
#> [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
#>
#> attached base packages:
#> [1] parallel stats4 stats graphics grDevices utils datasets
#> [8] methods base
#>
#> other attached packages:
#> [1] NewWave_1.0.2 SharedObject_1.4.0
#> [3] mclust_5.4.7 ggplot2_3.3.2
#> [5] Rtsne_0.15 irlba_2.3.3
#> [7] Matrix_1.3-0 splatter_1.14.1
#> [9] SingleCellExperiment_1.12.0 SummarizedExperiment_1.20.0
#> [11] Biobase_2.50.0 GenomicRanges_1.42.0
#> [13] GenomeInfoDb_1.26.2 IRanges_2.24.1
#> [15] S4Vectors_0.28.1 BiocGenerics_0.36.0
#> [17] MatrixGenerics_1.2.0 matrixStats_0.57.0
#>
#> loaded via a namespace (and not attached):
#> [1] tidyselect_1.1.0 locfit_1.5-9.4 xfun_0.19
#> [4] beachmat_2.6.4 BiocSingular_1.6.0 purrr_0.3.4
#> [7] lattice_0.20-41 colorspace_2.0-0 vctrs_0.3.6
#> [10] generics_0.1.0 htmltools_0.5.0 yaml_2.2.1
#> [13] rlang_0.4.9 pillar_1.4.7 glue_1.4.2
#> [16] withr_2.3.0 BiocParallel_1.24.1 GenomeInfoDbData_1.2.4
#> [19] lifecycle_0.2.0 stringr_1.4.0 zlibbioc_1.36.0
#> [22] munsell_0.5.0 gtable_0.3.0 rsvd_1.0.3
#> [25] evaluate_0.14 labeling_0.4.2 knitr_1.30
#> [28] Rcpp_1.0.5 scales_1.1.1 backports_1.2.1
#> [31] checkmate_2.0.0 DelayedArray_0.16.0 XVector_0.30.0
#> [34] farver_2.0.3 digest_0.6.27 stringi_1.5.3
#> [37] dplyr_1.0.2 grid_4.0.3 tools_4.0.3
#> [40] bitops_1.0-6 magrittr_2.0.1 RCurl_1.98-1.2
#> [43] tibble_3.0.4 crayon_1.3.4 pkgconfig_2.0.3
#> [46] ellipsis_0.3.1 rmarkdown_2.6 R6_2.5.0
#> [49] compiler_4.0.3