SpotSweeper
is an R package for spatial transcriptomics data quality control
(QC). It provides functions for detecting and visualizing spot-level local
outliers and artifacts using spatially-aware methods. The package is designed
to work with SpatialExperiment
objects, and is compatible with data from 10X Genomics Visium and other spatial
transcriptomics platforms.
Currently, the only way to install SpotSweeper
is by downloading the
development version which can be installed from
GitHub using the following:
if (!require("devtools")) install.packages("devtools")
remotes::install_github("MicTott/SpotSweeper")
Once accepted in Bioconductor, SpotSweeper
will be
installable using:
if (!requireNamespace("BiocManager", quietly = TRUE)) {
install.packages("BiocManager")
}
BiocManager::install("SpotSweeper")
Here we’ll walk you through the standard workflow for using ‘SpotSweeper’ to
detect and visualize local outliers in spatial transcriptomics data. We’ll use
the Visium_humanDLPFC
dataset from the STexampleData
package, which is a
SpatialExperiment
object.
Because local outliers will be saved in the colData
of the SpatialExperiment
object, we’ll first view the colData
and drop out-of-tissue spots before
calculating quality control (QC) metrics and running SpotSweeper
.
library(SpotSweeper)
# load Maynard et al DLPFC daatset
spe <- STexampleData::Visium_humanDLPFC()
## see ?STexampleData and browseVignettes('STexampleData') for documentation
## loading from cache
# show column data before SpotSweeper
colnames(colData(spe))
## [1] "barcode_id" "sample_id" "in_tissue" "array_row" "array_col"
## [6] "ground_truth" "reference" "cell_count"
# drop out-of-tissue spots
spe <- spe[, spe$in_tissue == 1]
scuttle
We’ll use the scuttle
package to calculate QC metrics. To do this, we’ll need
to first change the rownames
from gene id to gene names. We’ll then get the
mitochondrial transcripts and calculate QC metrics for each spot using
scuttle::addPerCellQCMetrics
.
# change from gene id to gene names
rownames(spe) <- rowData(spe)$gene_name
# identifying the mitochondrial transcripts
is.mito <- rownames(spe)[grepl("^MT-", rownames(spe))]
# calculating QC metrics for each spot using scuttle
spe <- scuttle::addPerCellQCMetrics(spe, subsets = list(Mito = is.mito))
colnames(colData(spe))
## [1] "barcode_id" "sample_id" "in_tissue"
## [4] "array_row" "array_col" "ground_truth"
## [7] "reference" "cell_count" "sum"
## [10] "detected" "subsets_Mito_sum" "subsets_Mito_detected"
## [13] "subsets_Mito_percent" "total"
SpotSweeper
We can now use SpotSweeper
to identify local outliers in the spatial
transcriptomics data. We’ll use the localOutliers
function to detect local
outliers based on the unique detected genes, total library size, and percent of
the total reads that are mitochondrial. These methods assume a normal
distribution, so we’ll use the log-transformed sum of the counts and the
log-transformed number of detected genes. For mitochondrial percent, we’ll use
the raw mitochondrial percentage.
# library size
spe <- localOutliers(spe,
metric = "sum",
direction = "lower",
log = TRUE
)
# unique genes
spe <- localOutliers(spe,
metric = "detected",
direction = "lower",
log = TRUE
)
# mitochondrial percent
spe <- localOutliers(spe,
metric = "subsets_Mito_percent",
direction = "higher",
log = FALSE
)
The localOutlier
function automatically outputs the results to the colData
with the naming convention X_outliers
, where X
is the name of the input
colData
. We can then combine all outliers into a single column called
local_outliers
in the colData
of the SpatialExperiment
object.
# combine all outliers into "local_outliers" column
spe$local_outliers <- as.logical(spe$sum_outliers) |
as.logical(spe$detected_outliers) |
as.logical(spe$subsets_Mito_percent_outliers)
We can visualize the local outliers using the plotQC
function. This
function creates a scatter plot of the specified metric and highlights the
local outliers in red using the escheR
package. Here, we’ll visualize local
outliers of library size, unique genes, mitochondrial percent, and finally, all
local outliers. We’ll then arrange these plots in a grid using
ggpubr::arrange
.
library(escheR)
## Loading required package: ggplot2
# all local outliers
plotQC(spe, metric = "sum_log", outliers = "local_outliers", point_size = 1.1,
stroke = 0.75) +
ggtitle("All Local Outliers")
SpotSweeper
# load in DLPFC sample with hangnail artifact
data(DLPFC_artifact)
spe <- DLPFC_artifact
# inspect colData before artifact detection
colnames(colData(spe))
## [1] "sample_id" "in_tissue" "array_row"
## [4] "array_col" "key" "sum_umi"
## [7] "sum_gene" "expr_chrM" "expr_chrM_ratio"
## [10] "ManualAnnotation" "subject" "region"
## [13] "sex" "age" "diagnosis"
## [16] "sample_id_complete" "count" "sizeFactor"
Technical artifacts can commonly be visualized by standard QC metrics, including
library size, unique genes, or mitochondrial percentage. We can first visualize
the technical artifacts using the plotQC
function. In this sample, we can
clearly see a hangnail artifact on the right side of the tissue section
in the mitochondrial ratio plot.
plotQC(spe,
metric = "expr_chrM_ratio",
outliers = NULL, point_size = 1.1
) +
ggtitle("Mitochondrial Percent")
SpotSweeper
We can then use the findArtifacts
function to identify artifacts in the
spatial transcriptomics (data. This function identifies technical artifacts
based on the first principle component of the local variance of the specified QC
metric (mito_percent
) at numerous neighorhood sizes (n_rings=5
). Currently,
kmeans
clustering is used to cluster the technical artifact vs high-quality
Visium spots. Similar to localOutliers
, the findArtifacts
function then
outputs the results to the colData
.
# find artifacts using SpotSweeper
spe <- findArtifacts(spe,
mito_percent = "expr_chrM_ratio",
mito_sum = "expr_chrM",
n_rings = 5,
name = "artifact"
)
# check that "artifact" is now in colData
colnames(colData(spe))
## [1] "sample_id" "in_tissue" "array_row"
## [4] "array_col" "key" "sum_umi"
## [7] "sum_gene" "expr_chrM" "expr_chrM_ratio"
## [10] "ManualAnnotation" "subject" "region"
## [13] "sex" "age" "diagnosis"
## [16] "sample_id_complete" "count" "sizeFactor"
## [19] "expr_chrM_ratio_log" "coords" "k6"
## [22] "k18" "k36" "k60"
## [25] "k90" "artifact"
We can visualize the artifacts using the escheR
package. Here, we’ll visualize
the artifacts using the plotQC
function and arrange these plots using
ggpubr::arrange
.
plotQC(spe,
metric = "expr_chrM_ratio",
outliers = "artifact", point_size = 1.1
) +
ggtitle("Hangnail artifact")
# Session information
utils::sessionInfo()
## R version 4.4.1 (2024-06-14)
## Platform: x86_64-pc-linux-gnu
## Running under: Ubuntu 22.04.4 LTS
##
## Matrix products: default
## BLAS: /home/biocbuild/bbs-3.19-bioc/R/lib/libRblas.so
## LAPACK: /usr/lib/x86_64-linux-gnu/lapack/liblapack.so.3.10.0
##
## locale:
## [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
## [3] LC_TIME=en_GB LC_COLLATE=C
## [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
## [7] LC_PAPER=en_US.UTF-8 LC_NAME=C
## [9] LC_ADDRESS=C LC_TELEPHONE=C
## [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
##
## time zone: America/New_York
## tzcode source: system (glibc)
##
## attached base packages:
## [1] stats4 stats graphics grDevices utils datasets methods
## [8] base
##
## other attached packages:
## [1] escheR_1.4.0 ggplot2_3.5.1
## [3] STexampleData_1.12.3 SpatialExperiment_1.14.0
## [5] SingleCellExperiment_1.26.0 SummarizedExperiment_1.34.0
## [7] Biobase_2.64.0 GenomicRanges_1.56.1
## [9] GenomeInfoDb_1.40.1 IRanges_2.38.1
## [11] S4Vectors_0.42.1 MatrixGenerics_1.16.0
## [13] matrixStats_1.3.0 ExperimentHub_2.12.0
## [15] AnnotationHub_3.12.0 BiocFileCache_2.12.0
## [17] dbplyr_2.5.0 BiocGenerics_0.50.0
## [19] SpotSweeper_1.0.2 BiocStyle_2.32.1
##
## loaded via a namespace (and not attached):
## [1] DBI_1.2.3 rlang_1.1.4
## [3] magrittr_2.0.3 spatialEco_2.0-2
## [5] compiler_4.4.1 RSQLite_2.3.7
## [7] DelayedMatrixStats_1.26.0 png_0.1-8
## [9] vctrs_0.6.5 pkgconfig_2.0.3
## [11] crayon_1.5.3 fastmap_1.2.0
## [13] magick_2.8.4 XVector_0.44.0
## [15] labeling_0.4.3 scuttle_1.14.0
## [17] utf8_1.2.4 rmarkdown_2.27
## [19] UCSC.utils_1.0.0 tinytex_0.51
## [21] purrr_1.0.2 bit_4.0.5
## [23] xfun_0.45 zlibbioc_1.50.0
## [25] cachem_1.1.0 beachmat_2.20.0
## [27] jsonlite_1.8.8 blob_1.2.4
## [29] highr_0.11 DelayedArray_0.30.1
## [31] BiocParallel_1.38.0 terra_1.7-78
## [33] parallel_4.4.1 R6_2.5.1
## [35] bslib_0.7.0 jquerylib_0.1.4
## [37] Rcpp_1.0.12 bookdown_0.40
## [39] knitr_1.48 Matrix_1.7-0
## [41] tidyselect_1.2.1 abind_1.4-5
## [43] yaml_2.3.9 codetools_0.2-20
## [45] curl_5.2.1 lattice_0.22-6
## [47] tibble_3.2.1 withr_3.0.0
## [49] KEGGREST_1.44.1 evaluate_0.24.0
## [51] Biostrings_2.72.1 pillar_1.9.0
## [53] BiocManager_1.30.23 filelock_1.0.3
## [55] generics_0.1.3 BiocVersion_3.19.1
## [57] sparseMatrixStats_1.16.0 munsell_0.5.1
## [59] scales_1.3.0 glue_1.7.0
## [61] tools_4.4.1 BiocNeighbors_1.22.0
## [63] grid_4.4.1 AnnotationDbi_1.66.0
## [65] colorspace_2.1-0 GenomeInfoDbData_1.2.12
## [67] cli_3.6.3 rappdirs_0.3.3
## [69] fansi_1.0.6 S4Arrays_1.4.1
## [71] viridisLite_0.4.2 dplyr_1.1.4
## [73] gtable_0.3.5 sass_0.4.9
## [75] digest_0.6.36 SparseArray_1.4.8
## [77] farver_2.1.2 rjson_0.2.21
## [79] memoise_2.0.1 htmltools_0.5.8.1
## [81] lifecycle_1.0.4 httr_1.4.7
## [83] mime_0.12 bit64_4.0.5
## [85] MASS_7.3-61