bluster 1.0.0
The bluster package provides a flexible and extensible framework for clustering in Bioconductor packages/workflows.
At its core is the clusterRows()
generic that controls dispatch to different clustering algorithms.
We will demonstrate on some single-cell RNA sequencing data from the scRNAseq package;
our aim is to cluster cells into cell populations based on their PC coordinates.
library(scRNAseq)
sce <- ZeiselBrainData()
# Trusting the authors' quality control, and going straight to normalization.
library(scuttle)
sce <- logNormCounts(sce)
# Feature selection based on highly variable genes.
library(scran)
dec <- modelGeneVar(sce)
hvgs <- getTopHVGs(dec, n=1000)
# Dimensionality reduction for work (PCA) and pleasure (t-SNE).
set.seed(1000)
library(scater)
sce <- runPCA(sce, ncomponents=20, subset_row=hvgs)
sce <- runUMAP(sce, dimred="PCA")
mat <- reducedDim(sce, "PCA")
dim(mat)
## [1] 3005 20
Our first algorithm is good old hierarchical clustering, as implemented using hclust()
from the stats package.
This automatically sets the cut height to half the dendrogram height.
library(bluster)
hclust.out <- clusterRows(mat, HclustParam())
plotUMAP(sce, colour_by=I(hclust.out))
Advanced users can achieve greater control of the procedure by passing more parameters to the HclustParam()
constructor.
Here, we use Ward’s criterion for the agglomeration with a dynamic tree cut from the dynamicTreeCut package.
hp2 <- HclustParam(method="ward.D2", cut.dynamic=TRUE)
hp2
## class: HclustParam
## metric: euclidean
## method: ward.D2
## cut.fun: cutreeDynamic
## cut.params(0):
hclust.out <- clusterRows(mat, hp2)
plotUMAP(sce, colour_by=I(hclust.out))
Our next algorithm is \(k\)-means clustering, as implemented using the kmeans()
function.
This requires us to pass in the number of clusters, either as a number:
set.seed(100)
kmeans.out <- clusterRows(mat, KmeansParam(10))
plotUMAP(sce, colour_by=I(kmeans.out))
Or as a function of the number of observations, which is useful for vector quantization purposes:
kp <- KmeansParam(sqrt)
kp
## class: KmeansParam
## centers: variable
## extra.args(0):
set.seed(100)
kmeans.out <- clusterRows(mat, kp)
plotUMAP(sce, colour_by=I(kmeans.out))
We can build shared or direct nearest neighbor graphs and perform community detection with igraph.
Here, the number of neighbors k
controls the resolution of the clusters.
set.seed(101) # just in case there are ties.
graph.out <- clusterRows(mat, NNGraphParam(k=10))
plotUMAP(sce, colour_by=I(graph.out))
It is again straightforward to tune the procedure by passing more arguments such as the community detection algorithm to use.
set.seed(101) # just in case there are ties.
np <- NNGraphParam(k=20, cluster.fun="louvain")
np
## class: NNGraphParam
## shared: TRUE
## graph.args(1): k
## cluster.fun: louvain
## cluster.args(0):
graph.out <- clusterRows(mat, np)
plotUMAP(sce, colour_by=I(graph.out))
We also provide a wrapper for a hybrid “two-step” approach for handling larger datasets. Here, a fast agglomeration is performed with \(k\)-means to compact the data, followed by a slower graph-based clustering step to obtain interpretable meta-clusters. (This dataset is not, by and large, big enough for this approach to work particularly well.)
set.seed(100) # for the k-means
two.out <- clusterRows(mat, TwoStepParam())
plotUMAP(sce, colour_by=I(two.out))
Each step is itself parametrized by BlusterParam
objects, so it is possible to tune them individually:
twop <- TwoStepParam(second=NNGraphParam(k=5))
twop
## class: TwoStepParam
## first:
## class: KmeansParam
## centers: variable
## extra.args(0):
## second:
## class: NNGraphParam
## shared: TRUE
## graph.args(1): k
## cluster.fun: walktrap
## cluster.args(0):
set.seed(100) # for the k-means
two.out <- clusterRows(mat, TwoStepParam())
plotUMAP(sce, colour_by=I(two.out))
Sometimes the vector of cluster assignments is not enough.
We can obtain more information about the clustering procedure by setting full=TRUE
in clusterRows()
.
For example, we could obtain the actual graph generated by NNGraphParam()
:
nn.out <- clusterRows(mat, NNGraphParam(), full=TRUE)
nn.out$objects$graph
## IGRAPH d701c4f U-W- 3005 87028 --
## + attr: weight (e/n)
## + edges from d701c4f:
## [1] 1-- 2 1-- 3 1-- 4 1-- 5 1-- 6 1-- 7 1-- 8 1-- 9 1-- 10 1-- 11
## [11] 1-- 46 1-- 48 1-- 49 1-- 54 1-- 55 1-- 56 1-- 59 1-- 61 1-- 63 1-- 64
## [21] 1-- 65 1-- 67 1-- 68 1-- 69 1-- 71 1-- 72 1-- 73 1--157 1--161 1--164
## [31] 1--201 1--202 1--203 1--206 1--212 1--213 1--228 1--229 1--230 1--231
## [41] 1--233 1--265 1--269 1--273 1--276 1--702 1--859 2-- 3 2-- 4 2-- 5
## [51] 2-- 6 2-- 7 2-- 8 2-- 9 2-- 13 2-- 15 2-- 46 2-- 47 2-- 49 2-- 56
## [61] 2-- 59 2-- 69 2-- 71 2-- 73 2--161 2--201 2--202 2--206 2--207 2--208
## [71] 2--210 2--212 2--226 2--228 2--229 2--230 2--231 2--233 2--265 2--273
## + ... omitted several edges
The assignment vector is still reported in the clusters
entry:
table(nn.out$clusters)
##
## 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19
## 166 216 256 543 218 89 583 92 123 186 91 42 94 64 58 58 60 35 31
clusterRows()
enables users or developers to easily switch between clustering algorithms by changing a single argument.
Indeed, by passing the BlusterParam
object across functions, we can ensure that the same algorithm is used through a workflow.
It is also helpful for package functions as it provides diverse functionality without compromising a clean function signature.
However, the true power of this framework lies in its extensibility.
Anyone can write a clusterRows()
method for a new clustering algorithm with an associated BlusterParam
subclass,
and that procedure is immediately compatible with any workflow or function that was already using clusterRows()
.
sessionInfo()
## R version 4.0.3 (2020-10-10)
## Platform: x86_64-pc-linux-gnu (64-bit)
## Running under: Ubuntu 18.04.5 LTS
##
## Matrix products: default
## BLAS: /home/biocbuild/bbs-3.12-bioc/R/lib/libRblas.so
## LAPACK: /home/biocbuild/bbs-3.12-bioc/R/lib/libRlapack.so
##
## locale:
## [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
## [3] LC_TIME=en_US.UTF-8 LC_COLLATE=C
## [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
## [7] LC_PAPER=en_US.UTF-8 LC_NAME=C
## [9] LC_ADDRESS=C LC_TELEPHONE=C
## [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
##
## attached base packages:
## [1] parallel stats4 stats graphics grDevices utils datasets
## [8] methods base
##
## other attached packages:
## [1] bluster_1.0.0 scater_1.18.0
## [3] ggplot2_3.3.2 scran_1.18.0
## [5] scuttle_1.0.0 scRNAseq_2.3.17
## [7] SingleCellExperiment_1.12.0 SummarizedExperiment_1.20.0
## [9] Biobase_2.50.0 GenomicRanges_1.42.0
## [11] GenomeInfoDb_1.26.0 IRanges_2.24.0
## [13] S4Vectors_0.28.0 BiocGenerics_0.36.0
## [15] MatrixGenerics_1.2.0 matrixStats_0.57.0
## [17] BiocStyle_2.18.0
##
## loaded via a namespace (and not attached):
## [1] ggbeeswarm_0.6.0 colorspace_1.4-1
## [3] ellipsis_0.3.1 dynamicTreeCut_1.63-1
## [5] XVector_0.30.0 BiocNeighbors_1.8.0
## [7] farver_2.0.3 bit64_4.0.5
## [9] RSpectra_0.16-0 interactiveDisplayBase_1.28.0
## [11] AnnotationDbi_1.52.0 xml2_1.3.2
## [13] sparseMatrixStats_1.2.0 knitr_1.30
## [15] Rsamtools_2.6.0 dbplyr_1.4.4
## [17] uwot_0.1.8 shiny_1.5.0
## [19] BiocManager_1.30.10 compiler_4.0.3
## [21] httr_1.4.2 dqrng_0.2.1
## [23] assertthat_0.2.1 Matrix_1.2-18
## [25] fastmap_1.0.1 lazyeval_0.2.2
## [27] limma_3.46.0 later_1.1.0.1
## [29] BiocSingular_1.6.0 htmltools_0.5.0
## [31] prettyunits_1.1.1 tools_4.0.3
## [33] rsvd_1.0.3 igraph_1.2.6
## [35] gtable_0.3.0 glue_1.4.2
## [37] GenomeInfoDbData_1.2.4 dplyr_1.0.2
## [39] rappdirs_0.3.1 Rcpp_1.0.5
## [41] vctrs_0.3.4 Biostrings_2.58.0
## [43] ExperimentHub_1.16.0 rtracklayer_1.50.0
## [45] DelayedMatrixStats_1.12.0 xfun_0.18
## [47] stringr_1.4.0 beachmat_2.6.0
## [49] mime_0.9 lifecycle_0.2.0
## [51] irlba_2.3.3 ensembldb_2.14.0
## [53] statmod_1.4.35 XML_3.99-0.5
## [55] AnnotationHub_2.22.0 edgeR_3.32.0
## [57] zlibbioc_1.36.0 scales_1.1.1
## [59] hms_0.5.3 promises_1.1.1
## [61] ProtGenerics_1.22.0 AnnotationFilter_1.14.0
## [63] yaml_2.2.1 curl_4.3
## [65] gridExtra_2.3 memoise_1.1.0
## [67] biomaRt_2.46.0 stringi_1.5.3
## [69] RSQLite_2.2.1 BiocVersion_3.12.0
## [71] GenomicFeatures_1.42.0 BiocParallel_1.24.0
## [73] rlang_0.4.8 pkgconfig_2.0.3
## [75] bitops_1.0-6 evaluate_0.14
## [77] lattice_0.20-41 purrr_0.3.4
## [79] labeling_0.4.2 GenomicAlignments_1.26.0
## [81] cowplot_1.1.0 bit_4.0.4
## [83] tidyselect_1.1.0 magrittr_1.5
## [85] bookdown_0.21 R6_2.4.1
## [87] magick_2.5.0 generics_0.0.2
## [89] DelayedArray_0.16.0 DBI_1.1.0
## [91] pillar_1.4.6 withr_2.3.0
## [93] RCurl_1.98-1.2 tibble_3.0.4
## [95] crayon_1.3.4 BiocFileCache_1.14.0
## [97] rmarkdown_2.5 viridis_0.5.1
## [99] progress_1.2.2 locfit_1.5-9.4
## [101] grid_4.0.3 FNN_1.1.3
## [103] blob_1.2.1 digest_0.6.27
## [105] xtable_1.8-4 httpuv_1.5.4
## [107] openssl_1.4.3 munsell_0.5.0
## [109] viridisLite_0.3.0 beeswarm_0.2.3
## [111] vipor_0.4.5 askpass_1.1