1 Introduction

1.1 Motivation

The primary functionality of spatialHeatmap package is to visualize cell-, tissue- and organ-specific data of biological assays by coloring the corresponding spatial features defined in anatomical images according to a numeric color key. The color scheme used to represent the assay values can be customized by the user. This core functionality of the package is called a spatial heatmap (SHM) plot. It also has extended functionalities of spatial enrichment (SE) and clustering. SE is specialized in detecting genes that are specifically expressed in a particular spatial feature, while clustering is designed to detect biological molecule groups sharing related abundance profiles (e.g. gene modules) and visualize them in matrix heatmaps combined with hierarchical clustering dendrograms and network representations. Moreover, an advanced functionality of integrated co-visualization of bulk and single-cell data (co-visualization) is also developed. Sinlge cells in embedding plots (PCA, UMAP, TSNE) are matched with corresponding bulk tissues in SHMs manually or automatically. These functionalities form an integrated methodology for spatial biological assay data visualization and analysis.

The functionalities of spatialHeatmap can be used either in a command-driven mode from within R or a graphical user interface (GUI) provided by a Shiny App that is also part of this package. While the R-based mode provides flexibility to customize and automate analysis routines, the Shiny App includes a variety of convenience features that will appeal to experimentalists and other users less familiar with R. Moreover, the Shiny App can be used on both local computers as well as centralized server-based deployments (e.g. cloud-based or custom servers) that can be accessed remotely as a public web service for using spatialHeatmap’s functionalities with community and/or private data. The functionalities of the spatialHeatmap package are illustrated in Figure 1.

Overview of spatialHeatmap. (A) The _saptialHeatmap_ package plots numeric assay data onto spatially annotated images. A wide range of omics technologies is supported including genomic, transcriptomic, proteomic and metabolomic profiling data. The assay data can be provided as numeric vectors, tabular data, or _SummarizedExperiment_ objects. The latter is a widely used data container for organizing both assay data as well as associated annotation and experimental design data. (B) Anatomical and other spatial images need to be provided as annotated SVG (aSVG) files where the spatial features and the corresponding data components of the assay data have matching labels (_e.g._ tissue labels). The assay data are used to color the matching spatial features in aSVG images according to a color key. The result is called a spatial heatmap (SHM). In the regular SHM (C), the feature profiles may or may not be contrasting, while in the enriched SHM (D) there are clear contrasting profiles across features. (E) Data mining graphics, such as matrix heatmaps and network graphs, are integrated to facilitate the identification of factors with similar assay profiles. The functionalities of _spatialHeatmap_ can be accessed from local computers via the R console or a graphical user interface based on Shiny. In addition, the latter can be deployed as a web service on custom servers or cloud-based systems.

Figure 1: Overview of spatialHeatmap
(A) The saptialHeatmap package plots numeric assay data onto spatially annotated images. A wide range of omics technologies is supported including genomic, transcriptomic, proteomic and metabolomic profiling data. The assay data can be provided as numeric vectors, tabular data, or SummarizedExperiment objects. The latter is a widely used data container for organizing both assay data as well as associated annotation and experimental design data. (B) Anatomical and other spatial images need to be provided as annotated SVG (aSVG) files where the spatial features and the corresponding data components of the assay data have matching labels (e.g. tissue labels). The assay data are used to color the matching spatial features in aSVG images according to a color key. The result is called a spatial heatmap (SHM). In the regular SHM (C), the feature profiles may or may not be contrasting, while in the enriched SHM (D) there are clear contrasting profiles across features. (E) Data mining graphics, such as matrix heatmaps and network graphs, are integrated to facilitate the identification of factors with similar assay profiles. The functionalities of spatialHeatmap can be accessed from local computers via the R console or a graphical user interface based on Shiny. In addition, the latter can be deployed as a web service on custom servers or cloud-based systems.

As anatomical images the package supports both tissue maps from public repositories and custom images provided by the user. In general any type of image can be used as long as it can be provided in SVG (Scalable Vector Graphics) format, where the corresponding spatial features have been defined (see aSVG below). The numeric values plotted onto an SHM are usually quantitative measurements from a wide range of profiling technologies, such as microarrays, next generation sequencing (e.g. RNA-Seq and scRNA-Seq), proteomics, metabolomics, or many other small- or large-scale experiments. For convenience, several preprocessing and normalization methods for the most common use cases are included that support raw and/or preprocessed data. Currently, the main application domains of the spatialHeatmap package are numeric data sets and spatially mapped images from biological, agricultural and biomedical areas. Moreover, the package has been designed to also work with many other spatial data types, such a population data plotted onto geographic maps. This high level of flexibility is one of the unique features of spatialHeatmap. Related software tools for biological applications in this field are largely based on pure web applications (Maag 2018; Lekschas et al. 2015; Papatheodorou et al. 2018; Winter et al. 2007; Waese et al. 2017) or local tools (Muschelli, Sweeney, and Crainiceanu 2014) that typically lack customization functionalities. These restrictions limit users to utilizing pre-existing expression data and/or fixed sets of anatomical image collections. Additionally, these existing tools are only able to visualize data, but not analyze data to identify feature-specific information. To close this gap for biological use cases, we have developed spatialHeatmap as a generic R/Bioconductor package for plotting quantitative values onto any type of spatially mapped images in a programmable environment and/or in an intuitive to use GUI application.

1.2 Design

The core feature of spatialHeatmap is to map assay values (e.g. gene expression data) of one or many items (e.g. genes) measured under different conditions in form of numerically graded colors onto the corresponding cell types or tissues represented in a chosen SVG image. In the gene profiling field, this feature supports comparisons of the expression values among multiple genes by plotting their SHMs next to each other. Similarly, one can display the expression values of a single or multiple genes across multiple conditions in the same plot (Figure 4). This level of flexibility is very efficient for visualizing complicated expression patterns across genes, cell types and conditions. In case of more complex anatomical images with overlapping multiple layer tissues, it is important to visually expose the tissue layer of interest in the plots. To address this, several default and customizable layer viewing options are provided. They allow to hide features in the top layers by making them transparent in order to expose features below them. This transparency viewing feature is highlighted below in the mouse example (Figure 5). Except for spatial data, this package also works on spatiotemporal data and generates spatiotemporal heatmaps (STHMs, Figure 9). Moreover, one can plot multiple distinct aSVGs in a single SHM plot as shown in Figure 11. This is particularly useful for displaying abundance trends across multiple development stages, where each is represented by its own aSVG image. In addition to static SHM representations, one can visualize them in form of interactive HTML files or generate videos for them. In spatial enrichment, the target feature is compared with reference features in a pairwise manner. Genes are specifically-expressed in the target feature across all pairwise comparisons are deemed target-specific.
To maximize reusability and extensibility, the package organizes large-scale omics assay data along with the associated experimental design information in a SummarizedExperiment object (Figure 1A). The latter is one of the core S4 classes within the Bioconductor ecosystem that has been widely adapted by many other software packages dealing with gene-, protein- and metabolite-level profiling data (Morgan et al. 2018). In case of gene expression data, the assays slot of the SummarizedExperiment container is populated with a gene expression matrix, where the rows and columns represent the genes and tissue/conditions, respectively, while the colData slot contains sample data including replicate information. The tissues and/or cell type information in the object maps via colData to the corresponding features in the SVG images using unique identifiers for the spatial features (e.g. tissues or cell types). This allows to color the features of interest in an SVG image according to the numeric data stored in a SummarizedExperiment object. For simplicity the numeric data can also be provided as numeric vectors or data.frames. This can be useful for testing purposes and/or the usage of simple data sets that may not require the more advanced features of the SummarizedExperiment class, such as measurements with only one or a few data points. The details about how to access the SVG images and properly format the associated expression data are provided in the Supplementary Section of this vignette.

1.3 Image Format: SVG

SHMs are images where colors encode numeric values in features of any shape. For plotting SHMs, Scalable Vector Graphics (SVG) has been chosen as image format since it is a flexible and widely adapted vector graphics format that provides many advantages for computationally embedding numerical and other information in images. SVG is based on XML formatted text describing all components present in images, including lines, shapes and colors. In case of biological images suitable for SHMs, the shapes often represent anatomical or cell structures. To assign colors to specific features in SHMs, annotated SVG (aSVG) files are used where the shapes of interest are labeled according to certain conventions so that they can be addressed and colored programmatically. SVGs and aSVGs of anatomical structures can be downloaded from many sources including the repositories described below. Alternatively, users can generate them themselves with vector graphics software such as Inkscape. Typically, in aSVGs one or more shapes of a feature of interest, such as the cell shapes of an organ, are grouped together by a common feature identifier. Via these group identifiers one or many feature types can be colored simultaneously in an aSVG according to biological experiments assaying the corresponding feature types with the required spatial resolution. Correct assignment of image features and assay results is assured by using for both the same feature identifiers. The color gradient used to visually represent the numeric assay values is controlled by a color gradient parameter. To visually interpret the meaning of the colors, the corresponding color key is included in the SHM plots. Additional details for properly formatting and annotating both aSVG images and assay data are provided in the Supplementary Section section of this vignette.

1.4 Data Repositories

If not generated by the user, SHMs can be generated with data downloaded from various public repositories. This includes gene, protein and metabolic profiling data from databases, such as GEO, BAR and Expression Atlas from EMBL-EBI (Papatheodorou et al. 2018). A particularly useful resource, when working with spatialHeatmap, is the EBI Expression Atlas. This online service contains both assay and anatomical images. Its assay data include mRNA and protein profiling experiments for different species, tissues and conditions. The corresponding anatomical image collections are also provided for a wide range of species including animals and plants. In spatialHeatmap several import functions are provided to work with the expression and aSVG repository from the Expression Atlas directly. The aSVG images developed by the spatialHeatmap project are available in its own repository called spatialHeatmap aSVG Repository, where users can contribute their aSVG images that are formatted according to our guidlines.

1.5 Tutorial Overview

The following sections of this vignette showcase the most important functionalities of the spatialHeatmap package using as initial example a simple to understand toy data set, and then more complex mRNA profiling data from the Expression Atlas and GEO databases. First, SHM plots are generated for both the toy and mRNA expression data. The latter include gene expression data sets from RNA-Seq and microarray experiments of Human Brain, Mouse Organs, Chicken Organs, and Arabidopsis Shoots. The first three are RNA-Seq data from the Expression Atlas, while the last one is a microarray data set from GEO. Second, gene context analysis tools are introduced, which facilitate the visualization of gene modules sharing similar expression patterns. This includes the visualization of hierarchical clustering results with traditional matrix heatmaps (Matrix Heatmap) as well co-expression network plots (Network). Third, the spatial enrichemnt functionality is illustrated on the mouse RNA-seq data. Lastly, an overview of the corresponding Shiny App is presented that provides access to the same functionalities as the R functions, but executes them in an interactive GUI environment (Chang et al. 2021; Chang and Borges Ribeiro 2018). Fourth, more advanced features for plotting customized SHMs are covered using the Human Brain data set as an example.

2 Getting Started

2.1 Installation

The spatialHeatmap package should be installed from an R (version \(\ge\) 3.6) session with the BiocManager::install command.

if (!requireNamespace("BiocManager", quietly = TRUE))
    install.packages("BiocManager")
BiocManager::install("spatialHeatmap")

2.2 Packages and Documentation

Next, the packages required for running the sample code in this vignette need to be loaded.

library(spatialHeatmap); library(SummarizedExperiment); library(ExpressionAtlas); library(GEOquery); library(scran)
library(scater); library(igraph); library(SingleCellExperiment)
library(BiocParallel)

The following lists the vignette(s) of this package in an HTML browser. Clicking the corresponding name will open this vignette.

browseVignettes('spatialHeatmap')

3 Spatial Heatmaps (SHMs)

3.1 Toy Example

SHMs are plotted with the spatial_hm function. To provide a quick and intuitive overview how these plots are generated, the following uses a generalized toy example where a small vector of random numeric values is generated that are used to color features in an aSVG image. The image chosen for this example is an aSVG depicting the human brain. The corresponding image file ‘homo_sapiens.brain.svg’ is included in this package for testing purposes. The path to this image on a user's system, where spatialHeatmap is installed, can be obtained with the system.file function.

3.1.1 aSVG Image

The following commands obtain the directory of the aSVG collection and the full path to the chosen target aSVG image on a user’s system, respectively.

svg.dir <- system.file("extdata/shinyApp/example", package="spatialHeatmap")
svg.hum <- system.file("extdata/shinyApp/example", 'homo_sapiens.brain.svg', package="spatialHeatmap")

To identify feature labels of interest in annotated aSVG images, the return_feature function can be used. The following searches the aSVG images stored in dir for the query terms ‘lobe’ and ‘homo sapiens’ under the feature and species fields, respectively. The identified matches are returned as a data.frame.

feature.df <- return_feature(feature=c('lobe'), species=c('homo sapiens'), remote=NULL, dir=svg.dir)
## Accessing features... 
## 
## Recommendation: please remove the 'transform' attribute with a 'matrix' value in the following groups by ungrouping and regrouping the respective groups in Inkscape. Otherwise, colors in spatial heatmap might be shifted! 
## UBERON_0000451 LAYER_EFO 
## 
## homo.sapiens_brain.shiny_shm.svg, Element "a" is removed: a4174 !
## 
## Recommendation: please remove the 'transform' attribute with a 'matrix' value in the following groups by ungrouping and regrouping the respective groups in Inkscape. Otherwise, colors in spatial heatmap might be shifted! 
## UBERON_0001873 UBERON_0001872 UBERON_0002038 UBERON_0001874 UBERON_0001875 UBERON_0002360 
## 
## Duplicated title text detected: hippocampus 
## homo_sapiens.brain.svg,
feature.df
##          feature     stroke color             id element    parent order
## 1 occipital.lobe 0.08000000  none UBERON_0002021    path LAYER_EFO     3
## 2  parietal.lobe 0.08000000  none UBERON_0001872       g LAYER_EFO     4
## 3  temporal.lobe 0.08000000  none UBERON_0001871    path LAYER_EFO     8
## 4 occipital.lobe 0.01600000  none UBERON_0002021    path LAYER_EFO     7
## 5  parietal.lobe 0.07060588  none UBERON_0001872       g LAYER_EFO     8
## 6  temporal.lobe 0.01600000  none UBERON_0001871    path LAYER_EFO    24
##                                SVG
## 1 homo.sapiens_brain.shiny_shm.svg
## 2 homo.sapiens_brain.shiny_shm.svg
## 3 homo.sapiens_brain.shiny_shm.svg
## 4           homo_sapiens.brain.svg
## 5           homo_sapiens.brain.svg
## 6           homo_sapiens.brain.svg
fnames <- feature.df[, 1]

3.1.2 Numeric Data

The following example generates a small numeric toy vector, where the data slot contains four numbers and its name slot is populated with the three feature names obtained from the above aSVG image. In addition, a non-matching entry (here ‘notMapped’) is included for demonstration purposes. Note, the numbers are mapped to features via matching names among the numeric vector and the aSVG, respectively. Accordingly, only numbers and features with matching name counterparts can be colored in the aSVG image. Entries without name matches are indicated by a message printed to the R console, here “notMapped”. This behavior can be turned off with verbose=FALSE in the corresponding function call. In addition, a summary of the numeric assay to feature mappings is stored in the result data.frame returned by the spatial_hm function (see below).

my_vec <- sample(1:100, length(unique(fnames))+1)
names(my_vec) <- c(unique(fnames), 'notMapped')
my_vec
## occipital.lobe  parietal.lobe  temporal.lobe      notMapped 
##             77              3             58             45

3.1.3 Plot SHM

Next, the SHM is plotted with the spatial_hm function (Figure 2). Internally, the numbers in my_vec are translated into colors based on the color key assigned to the col.com argument, and then painted onto the corresponding features in the aSVG, where the path to the image file is defined by svg.path=svg.hum. The remaining arguments used here include: ID for defining the title of the plot; ncol for setting the column-wise layout of the plot excluding the feature legend plot on the right; and height for defining the height of the SHM relative to its width. In addition, the outline feature g4320 covers all tissue features due to its default color, so it is set transparent through ft.trans. More details of the transparency function is explained in the mouse example (Figure 5). In the given example (Figure 2) only three features in my_vec (‘occipital lobe’, ‘parietal lobe’, and ‘temporal lobe’) have matching entries in the corresponding aSVG.

shm.lis <- spatial_hm(svg.path=svg.hum, data=my_vec, ID='toy', ncol=1, height=0.9, width=0.8, sub.title.size=20, legend.nrow=2, ft.trans=c('g4320'))
## Coordinates: homo_sapiens.brain.svg ... 
## CPU cores: 1 
## Element "a" is removed: a4174 !
## 
## Recommendation: please remove the 'transform' attribute with a 'matrix' value in the following groups by ungrouping and regrouping the respective groups in Inkscape. Otherwise, colors in spatial heatmap might be shifted! 
## UBERON_0001873 UBERON_0001872 UBERON_0002038 UBERON_0001874 UBERON_0001875 UBERON_0002360 
## 
## Duplicated title text detected: hippocampus 
## Features in data not mapped: notMapped 
## ggplots/grobs: homo_sapiens.brain.svg ... 
## ggplot: toy, con  
## Legend plot ... 
## CPU cores: 1 
## Converting "ggplot" to "grob" ... 
## toy_con_1  
## Converting "ggplot" to "grob" ... 
## 
SHM of human brain with toy data. The plots from left to right represent: color key, SHM and legend. The colors in the first two plots depict the user provided numeric values, whereas in the legend plot they are used to map the feature labels to the corresponding spatial regions in the image.

Figure 2: SHM of human brain with toy data
The plots from left to right represent: color key, SHM and legend. The colors in the first two plots depict the user provided numeric values, whereas in the legend plot they are used to map the feature labels to the corresponding spatial regions in the image.

The named numeric values in my_vec, that have name matches with the features in the chosen aSVG, are stored in the mapped_feature slot. The attributes of features are stored in feature_attribute slot.

# The SHM, mapped features, and feature attributes are stored in a list
names(shm.lis)
## [1] "spatial_heatmap"   "mapped_feature"    "feature_attribute"
# Mapped features
shm.lis[['mapped_feature']]
##   rowID     featureSVG value                    SVG
## 1   toy occipital.lobe    77 homo_sapiens.brain.svg
## 2   toy  parietal.lobe     3 homo_sapiens.brain.svg
## 3   toy  temporal.lobe    58 homo_sapiens.brain.svg
# Feature attributes
shm.lis[['feature_attribute']][1:3, ]
##          feature stroke color             id element        parent order
## 1          g4320  0.080  none          g4320       g LAYER_OUTLINE     1
## 2 locus.ceruleus  0.016  none UBERON_0002148    path     LAYER_EFO     1
## 3   diencephalon  0.016  none UBERON_0001894    path     LAYER_EFO     2
##                      SVG
## 1 homo_sapiens.brain.svg
## 2 homo_sapiens.brain.svg
## 3 homo_sapiens.brain.svg

3.2 Human Brain

This subsection introduces how to find cell- and tissue-specific assay data in the Expression Atlas database. After choosing a gene expression experiment, the data is downloaded directly into a user's R session. Subsequently, the expression values for selected genes can be plotted onto a chosen aSVG image with or without prior preprocessing steps (e.g. normalization). For querying and downloading expression data from the Expression Atlas database, functions from the ExpressionAtlas package are used (Keays 2019).

3.2.1 Gene Expression Data

The following example searches the Expression Atlas for expression data derived from specific tissues and species of interest, here ‘cerebellum’ and ‘Homo sapiens’, respectively.

To avoid repetitive downloading, the downloaded data sets are cached in ~/.cache/shm in all the following examples.

cache.pa <- '~/.cache/shm' # The path of cache.
all.hum <- read_cache(cache.pa, 'all.hum') # Retrieve data from cache.
if (is.null(all.hum)) { # Save downloaded data to cache if it is not cached.
  all.hum <- searchAtlasExperiments(properties="cerebellum", species="Homo sapiens")
  save_cache(dir=cache.pa, overwrite=TRUE, all.hum)
}

The search result is stored in a DFrame containing 13 accessions matching the above query. For the following sample code, the accession ‘E-GEOD-67196’ from Prudencio et al. (2015) has been chosen, which corresponds to an RNA-Seq profiling experiment of ‘cerebellum’ and ‘frontal cortex’ brain tissue from patients with amyotrophic lateral sclerosis (ALS). Details about the corresponding record can be returned as follows.

all.hum[2, ]
## DataFrame with 1 row and 4 columns
##     Accession      Species                  Type                  Title
##   <character>  <character>           <character>            <character>
## 1 E-MTAB-3358 Homo sapiens RNA-seq of coding RNA RNA-Seq CAGE (Cap An..

The getAtlasData function allows to download the chosen RNA-Seq experiment from the Expression Atlas and import it into a RangedSummarizedExperiment object of a user's R session.

rse.hum <- read_cache(cache.pa, 'rse.hum') # Read data from cache.
if (is.null(rse.hum)) { # Save downloaded data to cache if it is not cached.
  rse.hum <- getAtlasData('E-GEOD-67196')[[1]][[1]]
  save_cache(dir=cache.pa, overwrite=TRUE, rse.hum)
}

The design of the downloaded RNA-Seq experiment is described in the colData slot of rse.hum. The following returns only its first five rows and columns.

colData(rse.hum)[1:5, 1:5]
## DataFrame with 5 rows and 5 columns
##            AtlasAssayGroup     organism   individual  organism_part
##                <character>  <character>  <character>    <character>
## SRR1927019              g1 Homo sapiens  individual1     cerebellum
## SRR1927020              g2 Homo sapiens  individual1 frontal cortex
## SRR1927021              g1 Homo sapiens  individual2     cerebellum
## SRR1927022              g2 Homo sapiens  individual2 frontal cortex
## SRR1927023              g1 Homo sapiens individual34     cerebellum
##                           disease
##                       <character>
## SRR1927019 amyotrophic lateral ..
## SRR1927020 amyotrophic lateral ..
## SRR1927021 amyotrophic lateral ..
## SRR1927022 amyotrophic lateral ..
## SRR1927023 amyotrophic lateral ..

3.2.2 aSVG Image

The following example shows how to download from the above described SVG repositories an aSVG image that matches the tissues and species assayed in the gene expression data set downloaded in the previous subsection. The return_feature function queries the repository for feature- and species-related keywords, here c('frontal cortex', 'cerebellum') and c('homo sapiens', 'brain'), respectively. To return matching aSVGs, the argument keywords.any is set to TRUE by default. When return.all=FALSE, only aSVGs matching the query keywords are returned and saved under dir. Otherwise, all aSVGs are returned regardless of the keywords. To avoid overwriting of existing SVG files, it is recommended to start with an empty target directory, here tmp.dir.shm. To search a local directory for matching aSVG images, the argument setting remote=NULL needs to be used, while specifying the path of the corresponding directory under dir. All or only matching features are returned if match.only is set to FALSE or TRUE, respectively.

According to Bioconductor’s requirements, downloadings are not allowed inside functions, so the remote repos are downloaded before calling return_feature.

# Remote aSVG repos.
data(aSVG.remote.repo)
tmp.dir <- normalizePath(tempdir(check=TRUE), winslash="/", mustWork=FALSE)
tmp.dir.ebi <- paste0(tmp.dir, '/ebi.zip')
tmp.dir.shm <- paste0(tmp.dir, '/shm.zip')
# Download the remote aSVG repos as zip files.
download.file(aSVG.remote.repo$ebi, tmp.dir.ebi)
download.file(aSVG.remote.repo$shm, tmp.dir.shm)
remote <- list(tmp.dir.ebi, tmp.dir.shm)

Query the downloaded remote aSVG repos.

tmp.dir.shm <- paste0(normalizePath(tempdir(check=TRUE), winslash="/", mustWork=FALSE), '/shm')  # Create empty directory
feature.df <- return_feature(feature=c('frontal cortex', 'cerebellum'), species=c('homo sapiens', 'brain'), keywords.any=TRUE, return.all=FALSE, dir=tmp.dir.shm, remote=remote, match.only=TRUE, desc=FALSE) # Query aSVGs
feature.df[1:8, ] # Return first 8 rows for checking
unique(feature.df$SVG) # Return all matching aSVGs

To build this vignettes according to the R/Bioconductor package requirements, the following code section uses the aSVG file instance included in the spatialHeatmap package rather than the downloaded instance from the previous example.

feature.df <- return_feature(feature=c('frontal cortex', 'cerebellum'), species=c('homo sapiens', 'brain'), keywords.any=TRUE, return.all=FALSE, dir=svg.dir, remote=NULL)
## Accessing features... 
## 
## Recommendation: please remove the 'transform' attribute with a 'matrix' value in the following groups by ungrouping and regrouping the respective groups in Inkscape. Otherwise, colors in spatial heatmap might be shifted! 
## UBERON_0000451 LAYER_EFO 
## 
## homo.sapiens_brain.shiny_shm.svg, Element "a" is removed: a4174 !
## 
## Recommendation: please remove the 'transform' attribute with a 'matrix' value in the following groups by ungrouping and regrouping the respective groups in Inkscape. Otherwise, colors in spatial heatmap might be shifted! 
## UBERON_0001873 UBERON_0001872 UBERON_0002038 UBERON_0001874 UBERON_0001875 UBERON_0002360 
## 
## Duplicated title text detected: hippocampus 
## homo_sapiens.brain.svg, Element "a" is removed: a4174 !
## 
## Recommendation: please remove the 'transform' attribute with a 'matrix' value in the following groups by ungrouping and regrouping the respective groups in Inkscape. Otherwise, colors in spatial heatmap might be shifted! 
## EFO_0000530 
## 
## mus_musculus.brain.svg,

Note, the target tissues frontal cortex and cerebellum are included in both the experimental design slot of the downloaded expression data as well as the annotations of the aSVG. This way these features can be colored in the downstream SHM plots. If necessary users can also change from within R the feature identifiers and names in an aSVG. Details on this utility are provided in the Supplementary Section.

feature.df
##                 feature     stroke color             id element    parent order
## 1  middle.frontal.gyrus 0.08000000  none UBERON_0002702    path LAYER_EFO     2
## 2     prefrontal.cortex 0.08230311  none UBERON_0000451       g LAYER_EFO     5
## 3        frontal.cortex 0.08000000  none UBERON_0001870    path LAYER_EFO     6
## 4       cerebral.cortex 0.08000000  none UBERON_0000956       g LAYER_EFO     7
## 5            cerebellum 0.08000000  none UBERON_0002037       g LAYER_EFO     9
## 6  middle.frontal.gyrus 0.01600000  none UBERON_0002702    path LAYER_EFO     6
## 7      cingulate.cortex 0.01600000  none UBERON_0003027    path LAYER_EFO    19
## 8     prefrontal.cortex 0.01600000  none UBERON_0000451       g LAYER_EFO    21
## 9        frontal.cortex 0.01600000  none UBERON_0001870    path LAYER_EFO    22
## 10      cerebral.cortex 0.01600000  none UBERON_0000956       g LAYER_EFO    23
## 11           cerebellum 0.01600000  none UBERON_0002037       g LAYER_EFO    25
## 12      cerebral.cortex 0.05000000  none UBERON_0000956    path LAYER_EFO     2
## 13           cerebellum 0.05000000  none UBERON_0002037    path LAYER_EFO     7
##                                 SVG
## 1  homo.sapiens_brain.shiny_shm.svg
## 2  homo.sapiens_brain.shiny_shm.svg
## 3  homo.sapiens_brain.shiny_shm.svg
## 4  homo.sapiens_brain.shiny_shm.svg
## 5  homo.sapiens_brain.shiny_shm.svg
## 6            homo_sapiens.brain.svg
## 7            homo_sapiens.brain.svg
## 8            homo_sapiens.brain.svg
## 9            homo_sapiens.brain.svg
## 10           homo_sapiens.brain.svg
## 11           homo_sapiens.brain.svg
## 12           mus_musculus.brain.svg
## 13           mus_musculus.brain.svg

Since the Expression Atlas supports the cross-species anatomy ontology, the corresponding UBERON identifiers are included in the id column of the data.frame returned by the above function call of return_feature (Mungall et al. 2012). This ontology is also supported by the rols Bioconductor package (Gatto 2019).

3.2.3 Experimental Design

For organizing experimental designs and downstream plotting purposes, it can be desirable to customize the text in certain columns of colData. This way one can use the source data for displaying ‘pretty’ sample names in columns and legends of all downstream tables and plots, respectively, in a consistent and automated manner. To achieve this, the following example imports a ‘targets’ file that can be generated and edited by the user in a text or spreadsheet program. In the following example the target file content is used to replace the text in the colData slot of the RangedSummarizedExperiment object with a version containing shorter sample names that are more suitable for plotting purposes.

The following imports a custom target file containing simplified sample labels and experimental design information.

hum.tar <- system.file('extdata/shinyApp/example/target_human.txt', package='spatialHeatmap')
target.hum <- read.table(hum.tar, header=TRUE, row.names=1, sep='\t')

Load custom target data into colData slot.

colData(rse.hum) <- DataFrame(target.hum)

A slice of the simplified colData object is shown below, where the disease column contains now shorter labels than in the original data set. Additional details for generating and using target files in spatialHeatmap are provided in the Supplementary Section of this vignette.

colData(rse.hum)[c(1:3, 41:42), 4:5]
## DataFrame with 5 rows and 2 columns
##             organism_part     disease
##               <character> <character>
## SRR1927019     cerebellum         ALS
## SRR1927020 frontal cortex         ALS
## SRR1927021     cerebellum         ALS
## SRR1927059     cerebellum      normal
## SRR1927060 frontal cortex      normal

3.2.4 Preprocess Assay Data

The actual gene expression data of the downloaded RNA-Seq experiment is stored in the assay slot of rse.hum. Since it contains raw count data, it can be desirable to apply basic preprocessing routines prior to plotting spatial heatmaps. The following shows how to normalize the count data, aggregate replicates and then remove genes with unreliable expression responses. These preprocessing steps are optional and can be skipped if needed. For this, the expression data can be provided to the spatial_hm function directly, where it is important to assign to the sam.factor and con.factor arguments the corresponding sample and condition column names (Table 2).

For normalizing raw count data from RNA-Seq experiments, the norm_data function can be used. It supports the following pre-existing functions from widely used packages for analyzing count data in the next generation sequencing (NGS) field: calcNormFactors (CNF) from edgeR (Robinson, McCarthy, and Smyth 2010); as well as estimateSizeFactors (ESF), varianceStabilizingTransformation (VST), and rlog from DESeq2 (Love, Huber, and Anders 2014). The argument norm.fun specifies one of the four internal normalizing methods: CNF, ESF, VST, and rlog. If norm.fun='none', no normalization is applied. The arguments for each normalizing function are provided via a parameter.list, which is a list with named slots. For example, norm.fun='ESF' and parameter.list=list(type='ratio') is equivalent to estimateSizeFactors(object, type='ratio'). If paramter.list=NULL, the default arguments are used by the normalizing function assigned to norm.fun. For additional details, users want to consult the help file of the norm_data function by typing ?norm_data in the R console.

The following example uses the ESF normalization option. This method has been chosen mainly due to its good time performance.

se.nor.hum <- norm_data(data=rse.hum, norm.fun='ESF', log2.trans=TRUE)
## Normalising: ESF 
##    type 
## "ratio"

Replicates are aggregated with the aggr_rep function, where the summary statistics can be chosen under the aggr argument (e.g. aggr='mean'). The columns specifying replicates can be assigned to the sam.factor and con.factor arguments corresponding to samples and conditions, respectively. For tracking, the corresponding sample/condition labels are used as column titles in the aggregated assay instance, where they are concatenated with a double underscore as separator. In addition, the corresponding rows in the colData slot are collapsed accordingly.

se.aggr.hum <- aggr_rep(data=se.nor.hum, sam.factor='organism_part', con.factor='disease', aggr='mean')
## Syntactically valid column names are made!
assay(se.aggr.hum)[1:3, ]
##                 cerebellum__ALS frontal.cortex__ALS cerebellum__normal
## ENSG00000000003        7.024054            7.091484           6.406157
## ENSG00000000005        0.000000            1.540214           0.000000
## ENSG00000000419        7.866582            8.002549           8.073264
##                 frontal.cortex__normal
## ENSG00000000003               7.004446
## ENSG00000000005               1.403110
## ENSG00000000419               7.955709

To remove unreliable expression measures, filtering can be applied. The following example retains genes with expression values larger than 5 (log2 space) in at least 1% of all samples (pOA=c(0.01, 5)), and a coefficient of variance (CV) between 0.30 and 100 (CV=c(0.30, 100)).

se.fil.hum <- filter_data(data=se.aggr.hum, sam.factor='organism_part', con.factor='disease', pOA=c(0.01, 5), CV=c(0.3, 100), dir=NULL)
## Syntactically valid column names are made! 
## All values before filtering:
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   0.000   0.000   0.287   2.442   4.268  19.991 
## All coefficient of variances (CVs) before filtering:
##      Min.   1st Qu.    Median      Mean   3rd Qu.      Max. 
## 0.0007742 0.0767696 0.4019655 0.6217813 0.9956157 2.0000000 
## All values after filtering:
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   0.000   2.654   4.976   4.779   6.451  14.695 
## All coefficient of variances (CVs) after filtering:
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.3001  0.3648  0.4637  0.5651  0.7392  1.1548

To inspect the results, the following returns three selected rows of the fully preprocessed data matrix (Table 1).

assay(se.fil.hum)[c(5, 733:734), ]

Table 1: Slice of fully preprocessed expression matrix.
cerebellum__ALS frontal.cortex__ALS cerebellum__normal frontal.cortex__normal
ENSG00000006047 1.134172 5.2629629 0.5377534 5.3588310
ENSG00000268433 5.324064 0.3419665 3.4780744 0.1340332
ENSG00000268555 5.954572 2.6148548 4.9349736 2.0351776

3.2.5 SHM: Single Gene

The preprocessed expression values for any gene in the assay slot of se.fil.hum can be plotted as an SHM. The following uses gene ENSG00000268433 as an example. The chosen aSVG is a depiction of the human brain where the assayed featured are colored by the corresponding expression values in se.fil.hum.

shm.lis <- spatial_hm(svg.path=svg.hum, data=se.fil.hum, ID=c('ENSG00000268433'), height=0.7, legend.r=1.5, legend.key.size=0.02, legend.text.size=12, legend.nrow=2, ft.trans=c('g4320'))
## Coordinates: homo_sapiens.brain.svg ... 
## CPU cores: 1 
## Element "a" is removed: a4174 !
## 
## Recommendation: please remove the 'transform' attribute with a 'matrix' value in the following groups by ungrouping and regrouping the respective groups in Inkscape. Otherwise, colors in spatial heatmap might be shifted! 
## UBERON_0001873 UBERON_0001872 UBERON_0002038 UBERON_0001874 UBERON_0001875 UBERON_0002360 
## 
## Duplicated title text detected: hippocampus 
## ggplots/grobs: homo_sapiens.brain.svg ... 
## ggplot: ENSG00000268433, ALS  normal  
## Legend plot ... 
## CPU cores: 1 
## Converting "ggplot" to "grob" ... 
## ENSG00000268433_ALS_1  ENSG00000268433_normal_1  
## Converting "ggplot" to "grob" ... 
## 
SHM of human brain. Only cerebellum and frontal cortex are colored, because they are present in both the aSVG and the expression data. The legend plot on the right maps the feature labels to the corresponding spatial regions in the image.

Figure 3: SHM of human brain
Only cerebellum and frontal cortex are colored, because they are present in both the aSVG and the expression data. The legend plot on the right maps the feature labels to the corresponding spatial regions in the image.

The plotting instructions of the SHM along with the corresponding mapped features and feature attributes are stored as a list, here named shm.lis. Its components can be accessed as follows.

names(shm.lis) # All slots.
## [1] "spatial_heatmap"   "mapped_feature"    "feature_attribute"
shm.lis[['mapped_feature']] # Mapped features.
##             rowID     featureSVG condition     value                    SVG
## 1 ENSG00000268433     cerebellum       ALS 5.3240638 homo_sapiens.brain.svg
## 2 ENSG00000268433 frontal.cortex       ALS 0.3419665 homo_sapiens.brain.svg
## 3 ENSG00000268433     cerebellum    normal 3.4780744 homo_sapiens.brain.svg
## 4 ENSG00000268433 frontal.cortex    normal 0.1340332 homo_sapiens.brain.svg
shm.lis[['feature_attribute']][1:3, ] # Feature attributes.
##          feature stroke color             id element        parent order
## 1          g4320  0.080  none          g4320       g LAYER_OUTLINE     1
## 2 locus.ceruleus  0.016  none UBERON_0002148    path     LAYER_EFO     1
## 3   diencephalon  0.016  none UBERON_0001894    path     LAYER_EFO     2
##                      SVG
## 1 homo_sapiens.brain.svg
## 2 homo_sapiens.brain.svg
## 3 homo_sapiens.brain.svg

In the above example, the normalized expression values of gene ENSG00000268433 are colored in the frontal cortex and cerebellum, where the different conditions, here normal and ALS, are given in separate SHMs plotted next to each other. The color and feature mappings are defined by the corresponding color key and legend plot on the left and right, respectively.

3.2.6 SHM: Multiple Genes

SHMs for multiple genes can be plotted by providing the corresponding gene IDs under the ID argument as a character vector. The spatial_hm function will then sequentially arrange the SHMs for each gene in a single composite plot. To facilitate comparisons among expression values across genes and/or conditions, the lay.shm parameter can be assigned 'gene' or 'con', respectively. For instance, in Figure 4 the SHMs of the genes ENSG00000268433 and ENSG00000006047 are organized by condition in a horizontal view. This functionality is particularly useful when comparing gene families. Users can also customize the order of the SHM subplots, by assigning lay.shm='none'. With this setting the SHM subplots are organized according to the gene and condition ordering under ID and data, respectively.

spatial_hm(svg.path=svg.hum, data=se.fil.hum, ID=c('ENSG00000268433', 'ENSG00000006047'), lay.shm='con', width=0.8, height=1, legend.r=1.5, legend.nrow=2, ft.trans=c('g4320'))
## Coordinates: homo_sapiens.brain.svg ... 
## CPU cores: 1 
## Element "a" is removed: a4174 !
## 
## Recommendation: please remove the 'transform' attribute with a 'matrix' value in the following groups by ungrouping and regrouping the respective groups in Inkscape. Otherwise, colors in spatial heatmap might be shifted! 
## UBERON_0001873 UBERON_0001872 UBERON_0002038 UBERON_0001874 UBERON_0001875 UBERON_0002360 
## 
## Duplicated title text detected: hippocampus 
## ggplots/grobs: homo_sapiens.brain.svg ... 
## ggplot: ENSG00000268433, ALS  normal  
## ggplot: ENSG00000006047, ALS  normal  
## Legend plot ... 
## CPU cores: 1 
## Converting "ggplot" to "grob" ... 
## ENSG00000268433_ALS_1  ENSG00000268433_normal_1  ENSG00000006047_ALS_1  ENSG00000006047_normal_1  
## Converting "ggplot" to "grob" ... 
## 
SHMs of two genes. The subplots are organized by "condition" with the `lay.shm='con'` setting.

Figure 4: SHMs of two genes
The subplots are organized by “condition” with the lay.shm='con' setting.

3.2.7 SHM: HTML and Video

SHMs can be saved to interactive HTML files as well as video files. To trigger this export behavior, the argument out.dir needs to be assinged a directory path where the HTML and video files will be stored. Each HTML file contains an interactive SHM with zoom in and out functionality. Hovering over graphics features will display data, gene, condition and other information. The video will play the SHM subplots in the order specified under the lay.shm argument.

The following example saves the interactive HTML and video files under
a directory named tmp.dir.shm.

tmp.dir.shm <- paste0(normalizePath(tempdir(check=TRUE), winslash="/"), '/shm')
spatial_hm(svg.path=svg.hum, data=se.fil.hum, ID=c('ENSG00000268433', 'ENSG00000006047'), lay.shm='con', width=0.8, height=1, legend.r=1.5, legend.nrow=2, out.dir=tmp.dir.shm, ft.trans=c('g4320'))

3.2.8 SHM: Customization

To provide a high level of flexibility, the spatial_hm contains many arguments. An overview of important arguments and their utility is provided in Table 2.


Table 2: List of important argumnets of ‘spatial_hm’.
argument description
svg.path Path of aSVG
data Input data of SummarizedExperiment (SE), data frame, or vector
sam.factor Applies to SE. Column name of sample replicates in colData slot. Default is NULL
con.factor Applies to SE. Column name of condition replicates in colData slot. Default is NULL
ID A character vector of row items for plotting spatial heatmaps
col.com A character vector of color components for building colour scale. Default is c(‘yellow’, ‘orange’,‘red’)
col.bar ‘selected’ or ‘all’, the former means use values of ID to build the colour scale while the latter use all values in data. Default is ‘selected’.
bar.width A numeric of colour bar width. Default is 0.7
trans.scale One of ‘log2’, ‘exp2’, ‘row’, ‘column’, or NULL, which means transform the data by ‘log2’ or ‘2-base expoent’, scale by ‘row’ or ‘column’, or no manipuation respectively.
ft.trans A vector of aSVG features to be transparent. Default is NULL.
legend.r A numeric to adjust the dimension of the legend plot. Default is 1. The larger, the higher ratio of width to height.
sub.title.size The title size of each spatial heatmap subplot. Default is 11.
lay.shm ‘gen’ or ‘con’, applies to multiple genes or conditions respectively. ‘gen’ means spatial heatmaps are organised by genes while ‘con’ organised by conditions. Default is ‘gen’
ncol The total column number of spatial heatmaps, not including legend plot. Default is 2.
ft.legend ‘identical’, ‘all’, or a vector of samples/features in aSVG to show in legend plot. ‘identical’ only shows matching features while ‘all’ shows all features.
legend.ncol, legend.nrow Two numbers of columns and rows of legend keys respectively. Default is NULL, NULL, since they are automatically set.
legend.position the position of legend keys (‘none’, ‘left’, ‘right’,‘bottom’, ‘top’), or two-element numeric vector. Default is ‘bottom’.
legend.key.size, legend.text.size The size of legend keys and labels respectively. Default is 0.5 and 8 respectively.
line.size, line.color The size and colour of all plogyon outlines respectively. Default is 0.2 and ‘grey70’ respectively.
verbose TRUE or FALSE. Default is TRUE and the aSVG features not mapped are printed to R console.
out.dir The directory to save HTML and video files of spatial heatmaps. Default is NULL.

3.3 Mouse Organs

This section generates an SHM plot for mouse data from the Expression Atlas. The code components are very similar to the previous Human Brain example. For brevity, the corresponding text explaining the code has been reduced to a minimum.

3.3.1 Gene Expression Data

The chosen mouse RNA-Seq data compares tissue level gene expression across mammalian species (Merkin et al. 2012). The following searches the Expression Atlas for expression data from ‘heart’ and ‘Mus musculus’.

all.mus <- read_cache(cache.pa, 'all.mus') # Retrieve data from cache.
if (is.null(all.mus)) { # Save downloaded data to cache if it is not cached.
  all.mus <- searchAtlasExperiments(properties="heart", species="Mus musculus")
  save_cache(dir=cache.pa, overwrite=TRUE, all.mus)
}

Among the many matching entries, accession ‘E-MTAB-2801’ will be downloaded.

all.mus[7, ]
## DataFrame with 1 row and 4 columns
##     Accession      Species                  Type                  Title
##   <character>  <character>           <character>            <character>
## 1 E-MTAB-2801 Mus musculus RNA-seq of coding RNA Strand-specific RNA-..
rse.mus <- read_cache(cache.pa, 'rse.mus') # Read data from cache.
if (is.null(rse.mus)) { # Save downloaded data to cache if it is not cached.
  rse.mus <- getAtlasData('E-MTAB-2801')[[1]][[1]]
  save_cache(dir=cache.pa, overwrite=TRUE, rse.mus)
}

The design of the downloaded RNA-Seq experiment is described in the colData slot of rse.mus. The following returns only its first three rows.

colData(rse.mus)[1:3, ]
## DataFrame with 3 rows and 4 columns
##           AtlasAssayGroup     organism organism_part      strain
##               <character>  <character>   <character> <character>
## SRR594393              g7 Mus musculus         brain      DBA/2J
## SRR594394             g21 Mus musculus         colon      DBA/2J
## SRR594395             g13 Mus musculus         heart      DBA/2J

3.3.2 aSVG Image

The following example shows how to retrieve from the above described remote SVG repositories an aSVG image that matches the tissues and species assayed in the gene expression data set downloaded in the previous subsection. The remote repos are downloaded in the Human Brain example (remote) and are used below. As before the image is saved to a directory named tmp.dir.shm.

tmp.dir.shm <- paste0(normalizePath(tempdir(check=TRUE), winslash="/", mustWork=FALSE), '/shm')
feature.df <- return_feature(feature=c('heart', 'kidney'), species=c('Mus musculus'), keywords.any=TRUE, return.all=FALSE, dir=tmp.dir.shm, remote=remote, match.only=FALSE)

To build this vignettes according to the R/Bioconductor package requirements, the following code section uses the aSVG file instance included in the spatialHeatmap package rather than the downloaded instance from the example in the previous step.

feature.df <- return_feature(feature=c('heart', 'kidney'), species=NULL, keywords.any=TRUE, return.all=FALSE, dir=svg.dir, remote=NULL, match.only=FALSE) 
## Accessing features... 
## arabidopsis.thaliana_organ_shm.svg, arabidopsis.thaliana_organ_shm1.svg, arabidopsis.thaliana_organ_shm2.svg, arabidopsis.thaliana_root.cross_shm.svg, Element "a" is removed: a4174 !
## 
## Recommendation: please remove the 'transform' attribute with a 'matrix' value in the following groups by ungrouping and regrouping the respective groups in Inkscape. Otherwise, colors in spatial heatmap might be shifted! 
## LAYER_OUTLINE 
## 
## arabidopsis.thaliana_root.ebi_shm.svg, Extracted tiny path from path2027 is removed! 
## Extracted tiny path from path2027-3 is removed! 
## arabidopsis.thaliana_root.roottip_shm.svg, Extracted tiny path from path1146-6 is removed! 
## Extracted tiny path from path1146 is removed! 
## arabidopsis.thaliana_shoot.root_shm.svg, Extracted tiny path from path1146 is removed! 
## arabidopsis.thaliana_shoot_shm.svg, Element "a" is removed: a4174 !
## 
## Recommendation: please remove the 'transform' attribute with a 'matrix' value in the following groups by ungrouping and regrouping the respective groups in Inkscape. Otherwise, colors in spatial heatmap might be shifted! 
## UBERON_0014892 
## 
## gallus_gallus.svg, 
## Recommendation: please remove the 'transform' attribute with a 'matrix' value in the following groups by ungrouping and regrouping the respective groups in Inkscape. Otherwise, colors in spatial heatmap might be shifted! 
## UBERON_0000451 LAYER_EFO 
## 
## homo.sapiens_brain.shiny_shm.svg, Element "a" is removed: a4174 !
## 
## Recommendation: please remove the 'transform' attribute with a 'matrix' value in the following groups by ungrouping and regrouping the respective groups in Inkscape. Otherwise, colors in spatial heatmap might be shifted! 
## UBERON_0001873 UBERON_0001872 UBERON_0002038 UBERON_0001874 UBERON_0001875 UBERON_0002360 
## 
## Duplicated title text detected: hippocampus 
## homo_sapiens.brain.svg, maize_leaf_shm1.svg, maize_leaf_shm2.svg, Element "a" is removed: a4174 !
## 
## Recommendation: please remove the 'transform' attribute with a 'matrix' value in the following groups by ungrouping and regrouping the respective groups in Inkscape. Otherwise, colors in spatial heatmap might be shifted! 
## EFO_0000530 
## 
## mus_musculus.brain.svg, Element "a" is removed: a4174 !
## mus_musculus.male.svg, oryza.sativa_coleoptile.ANT_shm.svg, oryza.sativa_coleoptile.NT_shm.svg, us_map_shm.svg,

Return the names of the matching aSVG files.

unique(feature.df$SVG)
## [1] "gallus_gallus.svg"     "mus_musculus.male.svg"

The following first selects mus_musculus.male.svg as target aSVG, then returns the first three rows of the resulting feature.df, and finally prints the unique set of all aSVG features.

feature.df <- subset(feature.df, SVG=='mus_musculus.male.svg')
feature.df[1:3, ]
##     feature stroke color             id element        parent order
## 10   kidney   0.05  none UBERON_0002113       g     LAYER_EFO     3
## 11    heart   0.05  none UBERON_0000948    path     LAYER_EFO    13
## 12 path4204   0.05  none       path4204       g LAYER_OUTLINE     1
##                      SVG
## 10 mus_musculus.male.svg
## 11 mus_musculus.male.svg
## 12 mus_musculus.male.svg
unique(feature.df[, 1])
##  [1] "kidney"          "heart"           "path4204"        "spleen"         
##  [5] "adrenal.gland"   "colon"           "caecum"          "esophagus"      
##  [9] "tongue"          "testis"          "penis"           "lung"           
## [13] "diaphragm"       "liver"           "brain"           "skeletal.muscle"

Obtain path of target aSVG on user system.

svg.mus <- system.file("extdata/shinyApp/example", "mus_musculus.male.svg", package="spatialHeatmap")

3.3.3 Experimental Design

The following imports a sample target file that is included in this package. To inspect its content, the first three rows of the target file are printed to the screen.

mus.tar <- system.file('extdata/shinyApp/example/target_mouse.txt', package='spatialHeatmap')
target.mus <- read.table(mus.tar, header=TRUE, row.names=1, sep='\t')
target.mus[1:3, ]
##           AtlasAssayGroup     organism organism_part strain
## SRR594393              g7 Mus musculus         brain DBA.2J
## SRR594394             g21 Mus musculus         colon DBA.2J
## SRR594395             g13 Mus musculus         heart DBA.2J
unique(target.mus[, 3])
## [1] "brain"           "colon"           "heart"           "kidney"         
## [5] "liver"           "lung"            "skeletal muscle" "spleen"         
## [9] "testis"

Load custom target data into colData slot.

colData(rse.mus) <- DataFrame(target.mus)

3.3.4 Preprocess Assay Data

The raw RNA-Seq count are preprocessed with the following steps: (1) normalization, (2) aggregation of replicates, and (3) filtering of reliable expression data. The details of these steps are explained in the sub-section above using data from human.

se.nor.mus <- norm_data(data=rse.mus, norm.fun='ESF', log2.trans=TRUE) # Normalization
## Normalising: ESF 
##    type 
## "ratio"
se.aggr.mus <- aggr_rep(data=se.nor.mus, sam.factor='organism_part', con.factor='strain', aggr='mean') # Aggregation of replicates
## Syntactically valid column names are made!
se.fil.mus <- filter_data(data=se.aggr.mus, sam.factor='organism_part', con.factor='strain', pOA=c(0.01, 5), CV=c(0.6, 100), dir=NULL) # Filtering of genes with low counts and variance 
## Syntactically valid column names are made! 
## All values before filtering:
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   0.000   0.000   0.000   2.838   5.282  21.716 
## All coefficient of variances (CVs) before filtering:
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
## 0.01741 0.29033 1.09806 1.51730 2.34078 5.09902 
## All values after filtering:
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.0000  0.0000  0.9781  2.1806  3.4151 21.7158 
## All coefficient of variances (CVs) after filtering:
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.6001  0.8869  1.3158  1.4953  1.9883  5.0990

3.3.5 SHM: Transparency

The pre-processed expression data for gene ‘ENSMUSG00000000263’ is plotted in form of an SHM. In this case the plot includes expression data for 8 tissues across 3 mouse strains.

shm.lis <- spatial_hm(svg.path=svg.mus, data=se.fil.mus, ID=c('ENSMUSG00000000263'), height=0.7, legend.width=0.7, legend.text.size=10, sub.title.size=9, ncol=3, ft.trans=c('skeletal muscle', 'path4204'), legend.nrow=4, line.size=0.2, line.color='grey70')
## Coordinates: mus_musculus.male.svg ... 
## CPU cores: 1 
## Element "a" is removed: a4174 !
## ggplots/grobs: mus_musculus.male.svg ... 
## ggplot: ENSMUSG00000000263, DBA.2J  C57BL.6  CD1  
## Legend plot ... 
## CPU cores: 1 
## Converting "ggplot" to "grob" ... 
## ENSMUSG00000000263_DBA.2J_1  ENSMUSG00000000263_C57BL.6_1  ENSMUSG00000000263_CD1_1  
## Converting "ggplot" to "grob" ... 
## 
SHM of mouse organs. This is a multiple-layer image where the shapes of the 'skeletal muscle' is set transparent to expose 'lung' and 'heart'.

Figure 5: SHM of mouse organs
This is a multiple-layer image where the shapes of the ‘skeletal muscle’ is set transparent to expose ‘lung’ and ‘heart’.

The SHM plots in Figures 5 and below demonstrate the usage of the transparency feature via the ft.trans parameter. Except for the outline layer path4204 interfering with other tissues, the corresponding mouse organ aSVG image includes overlapping tissue layers. In this case the skelectal muscle layer partially overlaps with lung and heart tissues. To view lung and heart in Figure 5, the skelectal muscle tissue and outline are set transparent with ft.trans=c('skeletal muscle', 'path4204'). To view in the same aSVG the skeletal muscle tissue instead, ft.trans is assigned only path4204 as shown below.

To fine control the visual effects in feature rich aSVGs, the line.size and line.color parameters are useful. This way one can adjust the thickness and color of complex structures.

spatial_hm(svg.path=svg.mus, data=se.fil.mus, ID=c('ENSMUSG00000000263'), height=0.6, legend.text.size=10, sub.title.size=9, ncol=3, legend.ncol=2, line.size=0.1, line.color='grey70', ft.trans='path4204')
## Coordinates: mus_musculus.male.svg ... 
## CPU cores: 1 
## Element "a" is removed: a4174 !
## ggplots/grobs: mus_musculus.male.svg ... 
## ggplot: ENSMUSG00000000263, DBA.2J  C57BL.6  CD1  
## Legend plot ... 
## CPU cores: 1 
## Converting "ggplot" to "grob" ... 
## ENSMUSG00000000263_DBA.2J_1  ENSMUSG00000000263_C57BL.6_1  ENSMUSG00000000263_CD1_1  
## Converting "ggplot" to "grob" ... 
## 
SHM of mouse organs. This is a multiple-layer image where the view onto 'lung' and 'heart' is obstructed by displaying the 'skeletal muscle' tissue.

Figure 6: SHM of mouse organs
This is a multiple-layer image where the view onto ‘lung’ and ‘heart’ is obstructed by displaying the ‘skeletal muscle’ tissue.

3.4 Chicken Organs

This section generates an SHM plot for chicken data from the Expression Atlas. The code components are very similar to the Human Brain example. For brevity, the corresponding text explaining the code has been reduced to a minimum.

3.4.1 Gene Expression Data

The chosen chicken RNA-Seq experiment compares the developmental changes across nine time points of seven organs (Cardoso-Moreira et al. 2019).

The following searches the Expression Atlas for expression data from ‘heart’ and ‘gallus’.

all.chk <- read_cache(cache.pa, 'all.chk') # Retrieve data from cache.
if (is.null(all.chk)) { # Save downloaded data to cache if it is not cached.
  all.chk <- searchAtlasExperiments(properties="heart", species="gallus")
  save_cache(dir=cache.pa, overwrite=TRUE, all.chk)
}

Among the matching entries, accession ‘E-MTAB-6769’ will be downloaded.

all.chk[3, ]
## DataFrame with 1 row and 4 columns
##     Accession       Species                  Type                  Title
##   <character>   <character>           <character>            <character>
## 1 E-MTAB-6769 Gallus gallus RNA-seq of coding RNA Chicken RNA-seq time..
rse.chk <- read_cache(cache.pa, 'rse.chk') # Read data from cache.
if (is.null(rse.chk)) { # Save downloaded data to cache if it is not cached.
  rse.chk <- getAtlasData('E-MTAB-6769')[[1]][[1]]
  save_cache(dir=cache.pa, overwrite=TRUE, rse.chk)
}

The design of the downloaded RNA-Seq experiment is described in the colData slot of rse.chk. The following returns only its first three rows.

colData(rse.chk)[1:3, ]
## DataFrame with 3 rows and 8 columns
##            AtlasAssayGroup      organism         strain           genotype
##                <character>   <character>    <character>        <character>
## ERR2576379              g1 Gallus gallus Red Junglefowl wild type genotype
## ERR2576380              g1 Gallus gallus Red Junglefowl wild type genotype
## ERR2576381              g2 Gallus gallus Red Junglefowl wild type genotype
##            developmental_stage         age         sex organism_part
##                    <character> <character> <character>   <character>
## ERR2576379              embryo      10 day      female         brain
## ERR2576380              embryo      10 day      female         brain
## ERR2576381              embryo      10 day      female    cerebellum

3.4.2 aSVG Image

The following example shows how to download from the above introduced SVG repositories an aSVG image that matches the tissues and species assayed in the gene expression data set downloaded in the previous subsection. The remote repos are downloaded in the Human Brain example (remote) and are used below. As before the image is saved to a directory named tmp.dir.shm.

tmp.dir.shm <- paste0(normalizePath(tempdir(check=TRUE), winslash="/", mustWork=FALSE), '/shm')
# Query aSVGs.
feature.df <- return_feature(feature=c('heart', 'kidney'), species=c('gallus'), keywords.any=TRUE, return.all=FALSE, dir=tmp.dir.shm, remote=remote, match.only=FALSE)

To build this vignettes according to the R/Bioconductor package requirements, the following code section uses the aSVG file instance included in the spatialHeatmap package rather than the downloaded instance from the previous step.

feature.df <- return_feature(feature=c('heart', 'kidney'), species=c('gallus'), keywords.any=TRUE, return.all=FALSE, dir=svg.dir, remote=NULL, match.only=FALSE)
## Accessing features... 
## Element "a" is removed: a4174 !
## 
## Recommendation: please remove the 'transform' attribute with a 'matrix' value in the following groups by ungrouping and regrouping the respective groups in Inkscape. Otherwise, colors in spatial heatmap might be shifted! 
## UBERON_0014892 
## 
## gallus_gallus.svg,
feature.df[1:3, ] # A slice of the features.
##           feature stroke   color              id element        parent order
## 1           heart      0    none  UBERON_0000948    path     LAYER_EFO     2
## 2          kidney      0    none  UBERON_0002113    path     LAYER_EFO     3
## 3 chicken_outline      0 #a0a1a2 chicken_outline       g LAYER_OUTLINE     1
##                 SVG
## 1 gallus_gallus.svg
## 2 gallus_gallus.svg
## 3 gallus_gallus.svg

Obtain path of target aSVG on user system.

svg.chk <- system.file("extdata/shinyApp/example", "gallus_gallus.svg", package="spatialHeatmap")

3.4.3 Experimental Design

The following imports a sample target file that is included in this package. To inspect its content, the first three rows of the target file are printed to the screen.

chk.tar <- system.file('extdata/shinyApp/example/target_chicken.txt', package='spatialHeatmap')
target.chk <- read.table(chk.tar, header=TRUE, row.names=1, sep='\t')
target.chk[1:3, ]
##            AtlasAssayGroup      organism         strain           genotype
## ERR2576379              g1 Gallus gallus Red Junglefowl wild type genotype
## ERR2576380              g1 Gallus gallus Red Junglefowl wild type genotype
## ERR2576381              g2 Gallus gallus Red Junglefowl wild type genotype
##            developmental_stage   age    sex organism_part
## ERR2576379              embryo day10 female         brain
## ERR2576380              embryo day10 female         brain
## ERR2576381              embryo day10 female    cerebellum

Load custom target data into colData slot.

colData(rse.chk) <- DataFrame(target.chk)

Return samples used for plotting SHMs.

unique(colData(rse.chk)[, 'organism_part'])
## [1] "brain"      "cerebellum" "heart"      "kidney"     "ovary"     
## [6] "testis"     "liver"

Return conditions considered for plotting downstream SHM.

unique(colData(rse.chk)[, 'age'])
## [1] "day10"  "day12"  "day14"  "day17"  "day0"   "day155" "day35"  "day7"  
## [9] "day70"

3.4.4 Preprocess Assay Data

The raw RNA-Seq count are preprocessed with the following steps: (1) normalization, (2) aggregation of replicates, and (3) filtering of reliable expression data. The details of these steps are explained in the above sub-section on human data.

se.nor.chk <- norm_data(data=rse.chk, norm.fun='ESF', log2.trans=TRUE) # Normalization
## Normalising: ESF 
##    type 
## "ratio"
se.aggr.chk <- aggr_rep(data=se.nor.chk, sam.factor='organism_part', con.factor='age', aggr='mean') # Replicate agggregation using mean 
se.fil.chk <- filter_data(data=se.aggr.chk, sam.factor='organism_part', con.factor='age', pOA=c(0.01, 5), CV=c(0.6, 100), dir=NULL) # Filtering of genes with low counts and varince
## All values before filtering:
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.0000  0.6718  5.4389  5.2246  9.0067 23.0323 
## All coefficient of variances (CVs) before filtering:
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
## 0.01497 0.08457 0.29614 0.79232 1.02089 7.87401 
## All values after filtering:
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.0000  0.2118  1.0459  2.0432  2.9370 23.0323 
## All coefficient of variances (CVs) after filtering:
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.6001  0.7793  1.0556  1.3299  1.5950  5.4224

3.4.5 SHM: Time Series

The expression profile for gene ENSGALG00000006346 is plotted across nine time points in four organs in form of a composite SHM with 9 panels. Their layout in three columns is controlled with the argument setting ncol=3. The target organs are labeled by text in legend plot via label=TRUE.

spatial_hm(svg.path=svg.chk, data=se.fil.chk, ID='ENSGALG00000006346', width=0.9, legend.width=0.9, legend.r=1.5, sub.title.size=9, ncol=3, legend.nrow=2, label=TRUE)
## Coordinates: gallus_gallus.svg ... 
## CPU cores: 1 
## Element "a" is removed: a4174 !
## 
## Recommendation: please remove the 'transform' attribute with a 'matrix' value in the following groups by ungrouping and regrouping the respective groups in Inkscape. Otherwise, colors in spatial heatmap might be shifted! 
## UBERON_0014892 
## 
## Features in data not mapped: cerebellum, ovary, testis 
## ggplots/grobs: gallus_gallus.svg ... 
## ggplot: ENSGALG00000006346, day10  day12  day14  day17  day0  day155  day35  day7  day70  
## Legend plot ... 
## CPU cores: 1 
## Converting "ggplot" to "grob" ... 
## ENSGALG00000006346_day10_1  ENSGALG00000006346_day12_1  ENSGALG00000006346_day14_1  ENSGALG00000006346_day17_1  ENSGALG00000006346_day0_1  ENSGALG00000006346_day155_1  ENSGALG00000006346_day35_1  ENSGALG00000006346_day7_1  ENSGALG00000006346_day70_1  
## Converting "ggplot" to "grob" ... 
## 
Time course of chicken organs. The SHM shows the expression profile of a single gene across nine time points and four organs.

Figure 7: Time course of chicken organs
The SHM shows the expression profile of a single gene across nine time points and four organs.

3.5 Arabidopsis Shoot

This section generates an SHM for Arabidopsis thaliana tissues with gene expression data from the Affymetrix microarray technology. The chosen experiment used ribosome-associated mRNAs from several cell populations of shoots and roots that were exposed to hypoxia stress (Mustroph et al. 2009). In this case the expression data will be downloaded from GEO with utilites from the GEOquery package (Davis and Meltzer 2007). The data preprocessing routines are specific to the Affymetrix technology. The remaining code components for generating SHMs are very similar to the previous examples. For brevity, the text in this section explains mainly the steps that are specific to this data set.

3.5.1 Gene Expression Data

The GSE14502 data set will be downloaded with the getGEO function from the GEOquery package. Intermediately, the expression data is stored in an ExpressionSet container (Huber et al. 2015), and then converted to a SummarizedExperiment object.

gset <- read_cache(cache.pa, 'gset') # Retrieve data from cache.
if (is.null(gset)) { # Save downloaded data to cache if it is not cached.
  gset <- getGEO("GSE14502", GSEMatrix=TRUE, getGPL=TRUE)[[1]]
  save_cache(dir=cache.pa, overwrite=TRUE, gset)
}
se.sh <- as(gset, "SummarizedExperiment")

The gene symbol identifiers are extracted from the rowData component to be used as row names. Similarly, one can work with AGI identifiers by providing below AGI under Gene.Symbol.

rownames(se.sh) <- make.names(rowData(se.sh)[, 'Gene.Symbol'])

The following returns a slice of the experimental design stored in the colData slot. Both the samples and conditions are contained in the title column. The samples include promoters (pGL2, pCO2, pSCR, pWOL, p35S), tissues and organs (root atrichoblast epidermis, root cortex meristematic zone, root endodermis, root vasculature, root_total and shoot_total); and the conditions are control and hypoxia.

colData(se.sh)[60:63, 1:4]
## DataFrame with 4 rows and 4 columns
##                            title geo_accession                status
##                      <character>   <character>           <character>
## GSM362227 shoot_hypoxia_pGL2_r..     GSM362227 Public on Oct 12 2009
## GSM362228 shoot_hypoxia_pGL2_r..     GSM362228 Public on Oct 12 2009
## GSM362229 shoot_control_pRBCS_..     GSM362229 Public on Oct 12 2009
## GSM362230 shoot_control_pRBCS_..     GSM362230 Public on Oct 12 2009
##           submission_date
##               <character>
## GSM362227     Jan 21 2009
## GSM362228     Jan 21 2009
## GSM362229     Jan 21 2009
## GSM362230     Jan 21 2009

3.5.2 aSVG Image

In this example, the aSVG image has been generated in Inkscape from the corresponding figure in Mustroph et al. (2009). The resulting custom figure has been included as a sample aSVG file in the spatialHeatmap package. Detailed instructions for generating custom aSVG images in Inkscape are provided in the SVG tutorial.

The annotations in the corresponding aSVG file located under svg.dir can be queried with the return_features function.

feature.df <- return_feature(feature=c('pGL2', 'pRBCS'), species=c('shoot'), keywords.any=TRUE, return.all=FALSE, dir=svg.dir, remote=NULL, match.only=FALSE)
## Accessing features... 
## Extracted tiny path from path1146-6 is removed! 
## Extracted tiny path from path1146 is removed! 
## arabidopsis.thaliana_shoot.root_shm.svg, Extracted tiny path from path1146 is removed! 
## arabidopsis.thaliana_shoot_shm.svg,

The unique set of the matching aSVG files can be returned as follows.

unique(feature.df$SVG)
## [1] "arabidopsis.thaliana_shoot.root_shm.svg"
## [2] "arabidopsis.thaliana_shoot_shm.svg"

The aSVG file arabidopsis.thaliana_shoot_shm.svg is chosen to generate the SHM in this section.

feature.df <- subset(feature.df, SVG=='arabidopsis.thaliana_shoot_shm.svg')
feature.df[1:3, ]
##        feature    stroke   color          id element    parent order
## 17  shoot_pGL2 0.1500001 #10ddeb  shoot_pGL2       g container     2
## 18 shoot_pRBCS 0.1500001 #7227ab shoot_pRBCS       g container     3
## 19        g258 0.1500001 #f7fcf5        g258       g container     1
##                                   SVG
## 17 arabidopsis.thaliana_shoot_shm.svg
## 18 arabidopsis.thaliana_shoot_shm.svg
## 19 arabidopsis.thaliana_shoot_shm.svg

Obtain full path of target aSVG on user system.

svg.sh <- system.file("extdata/shinyApp/example", "arabidopsis.thaliana_shoot_shm.svg", package="spatialHeatmap")

3.5.3 Experimental Design

The following imports a sample target file that is included in this package. To inspect its content, four selected rows of this target file are printed to the screen.

sh.tar <- system.file('extdata/shinyApp/example/target_arab.txt', package='spatialHeatmap')
target.sh <- read.table(sh.tar, header=TRUE, row.names=1, sep='\t')
target.sh[60:63, ]
##                           col.name     samples conditions
## shoot_hypoxia_pGL2_rep1  GSM362227  shoot_pGL2    hypoxia
## shoot_hypoxia_pGL2_rep2  GSM362228  shoot_pGL2    hypoxia
## shoot_control_pRBCS_rep1 GSM362229 shoot_pRBCS    control
## shoot_control_pRBCS_rep2 GSM362230 shoot_pRBCS    control

Return all samples present in target file.

unique(target.sh[, 'samples'])
##  [1] "root_total"      "root_p35S"       "root_pSCR"       "root_pSHR"      
##  [5] "root_pWOL"       "root_pGL2"       "root_pSUC2"      "root_pSultr2.2" 
##  [9] "root_pCO2"       "root_pPEP"       "root_pRPL11C"    "shoot_total"    
## [13] "shoot_p35S"      "shoot_pGL2"      "shoot_pRBCS"     "shoot_pSUC2"    
## [17] "shoot_pSultr2.2" "shoot_pCER5"     "shoot_pKAT1"

Return all conditions present in target file.

unique(target.sh[, 'conditions'])
## [1] "control" "hypoxia"

Load custom target data into colData slot.

colData(se.sh) <- DataFrame(target.sh)

3.5.4 Preprocess Assay Data

The downloaded GSE14502 data set has already been normalized with the RMA algorithm (Gautier et al. 2004). Thus, the pre-processing steps can be restricted to the aggregation of replicates and filtering of reliably expressed genes. For the latter, the following code will retain genes with expression values larger than 6 (log2 space) in at least 3% of all samples (pOA=c(0.03, 6)), and with a coefficient of variance (CV) between 0.30 and 100 (CV=c(0.30, 100)).

se.aggr.sh <- aggr_rep(data=se.sh, sam.factor='samples', con.factor='conditions', aggr='mean') # Replicate agggregation using mean
se.fil.arab <- filter_data(data=se.aggr.sh, sam.factor='samples', con.factor='conditions', pOA=c(0.03, 6), CV=c(0.30, 100), dir=NULL) # Filtering of genes with low intensities and variance
## All values before filtering:
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   3.345   4.879   6.481   6.763   8.263  15.107 
## All coefficient of variances (CVs) before filtering:
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
## 0.01047 0.03424 0.05347 0.07706 0.09526 0.54344 
## All values after filtering:
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   3.644   4.838   6.249   7.364   9.756  15.004 
## All coefficient of variances (CVs) after filtering:
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.3008  0.3203  0.3385  0.3531  0.3735  0.5434

3.5.5 SHM: Microarray

The expression profile for the HRE2 gene is plotted for the control and the hypoxia treatment across six cell types (Figure 8).

spatial_hm(svg.path=svg.sh, data=se.fil.arab, ID=c("HRE2"), height=0.7, legend.nrow=3, legend.text.size=11)
## Coordinates: arabidopsis.thaliana_shoot_shm.svg ... 
## CPU cores: 1 
## Extracted tiny path from path1146 is removed! 
## Features in data not mapped: root_total, root_p35S, root_pSCR, root_pSHR, root_pWOL, root_pGL2, root_pSUC2, root_pSultr2.2, root_pCO2, root_pPEP, root_pRPL11C, shoot_total, shoot_p35S 
## ggplots/grobs: arabidopsis.thaliana_shoot_shm.svg ... 
## ggplot: HRE2, control  hypoxia  
## Legend plot ... 
## CPU cores: 1 
## Converting "ggplot" to "grob" ... 
## HRE2_control_1  HRE2_hypoxia_1  
## Converting "ggplot" to "grob" ... 
## 
SHM of Arabidopsis shoots. The expression profile of the HRE2 gene is plotted for control and hypoxia treatment across six cell types.

Figure 8: SHM of Arabidopsis shoots
The expression profile of the HRE2 gene is plotted for control and hypoxia treatment across six cell types.

4 Spatiotemporal Heatmaps (STHMs)

The above examples are SHMs plotted at the single spatial dimension. This section showcases the application of SHMs at spatial and temporal dimesions, i.e. data assayed in spatial feature(s) across different development stages.

The data at single spatial dimension contains only two factors: samples and conditions. By contrast, the spatiotemporal data contains three factors: samples, conditions, and times (development stages). There are three alternatives to organize the three factors: 1) combine samples and conditions; 2) combine samples and times; or 3) combine samples, conditions, and times. More details are provided in the Supplementary Section.

Which option to choose depends on the specific data and aSVGs, and the chosen one should achieve optimal visualization. In this example, the third is the best choice and will be showcased in the first part. Meanwhile, for demonstration purpose the second choice will also be illustrated in the second part. In the spatiotemporal application, different development stages can be represented in different aSVG images, and this feature will be presented in the third part.

4.1 Sample-Time-Condition Factor

4.1.1 Gene Expression Data

The data is from the transcriptome analysis on rice coleoptile during germinating and developing stages under anoxia and re-oxygenation (Narsai et al. 2017), which is also downloaded with ExpressionAtlas.

rse.clp <- read_cache(cache.pa, 'rse.clp') # Retrieve data from cache.
if (is.null(rse.clp)) { # Save downloaded data to cache if it is not cached.
  rse.clp <- getAtlasData('E-GEOD-115371')[[1]][[1]]
  save_cache(dir=cache.pa, overwrite=TRUE, rse.clp)
}

4.1.2 Experimental Design

The targets file was prepared according to the experiment design stored in colData slot of res.clp by using the convenient function edit_tar, and pre-packaged in spatialHeatmap.

clp.tar <- system.file('extdata/shinyApp/example/target_coleoptile.txt', package='spatialHeatmap')
target.clp <- read_fr(clp.tar)

The helper function edit_tar is designed to edit entries in the targets file. Below is an example of editing the tissue entries.

cdat <- colData(rse.clp) # Original targets file.
unique(cdat$organism_part) # Original tissues.
## [1] "plant embryo"            "plant embryo coleoptile"
## [3] "coleoptile"
cdat <- edit_tar(cdat, column='organism_part', old=c('plant embryo', 'plant embryo coleoptile'), new=c('embryo', 'embryoColeoptile')) # Replace old entries with desired ones.
unique(cdat$organism_part) # New tissue entries. 
## [1] "embryo"           "embryoColeoptile" "coleoptile"

Inspect the tissues, conditions, and times, where “A” and “N” denote “aerobic” and “anaerobic” respectively.

target.clp[1:3, c(6, 7, 9, 10)] # A slice of the targets file.
##            age organism_part stimulus con
## SRR7265373  0h        embryo  aerobic   A
## SRR7265374  0h        embryo  aerobic   A
## SRR7265375  0h        embryo  aerobic   A
unique(target.clp[, 'age']) # All development stages.
## [1] "0h"     "1h"     "3h"     "12h"    "24h"    "48h"    "72h"    "96h"   
## [9] "72N24A"
unique(target.clp[, 'organism_part']) # All tissues.
## [1] "embryo"           "embryoColeoptile" "coleoptile"
unique(target.clp[, 'stimulus']) # All conditions.
## [1] "aerobic"   "anaerobic" "NA"

Combine sample, time, condition factors using the helper function com_factor. The targets file including the new composite factors (samTimeCon) is loaded to the colData slot in rse.clp internally.

rse.clp <- com_factor(rse.clp, target.clp, factors2com=c('organism_part', 'age', 'con'), sep='.', factor.new='samTimeCon')
## New combined factors: embryo.0h.A embryoColeoptile.1h.A embryoColeoptile.1h.N embryoColeoptile.3h.A embryoColeoptile.3h.N embryoColeoptile.12h.A embryoColeoptile.12h.N embryoColeoptile.24h.A embryoColeoptile.24h.N coleoptile.48h.A coleoptile.48h.N coleoptile.72h.A coleoptile.72h.N coleoptile.96h.A coleoptile.96h.N coleoptile.72N24A.NA
colData(rse.clp)[1:3, c(6, 7, 9:11)]
## DataFrame with 3 rows and 5 columns
##                    age organism_part    stimulus         con  samTimeCon
##            <character>   <character> <character> <character> <character>
## SRR7265373          0h        embryo     aerobic           A embryo.0h.A
## SRR7265374          0h        embryo     aerobic           A embryo.0h.A
## SRR7265375          0h        embryo     aerobic           A embryo.0h.A

Inspect the sample-time-condition composite factors. At least one of the composite factors should have a matching feature counterpart in the aSVG file, otherwise no aSVG file will be returned in the next section.

target.clp <- colData(rse.clp)
unique(target.clp$samTimeCon)
##  [1] "embryo.0h.A"            "embryoColeoptile.1h.A"  "embryoColeoptile.1h.N" 
##  [4] "embryoColeoptile.3h.A"  "embryoColeoptile.3h.N"  "embryoColeoptile.12h.A"
##  [7] "embryoColeoptile.12h.N" "embryoColeoptile.24h.A" "embryoColeoptile.24h.N"
## [10] "coleoptile.48h.A"       "coleoptile.48h.N"       "coleoptile.72h.A"      
## [13] "coleoptile.72h.N"       "coleoptile.96h.A"       "coleoptile.96h.N"      
## [16] "coleoptile.72N24A.NA"

4.1.3 aSVG Image

Similar with the Arabidopsis Shoot example, the aSVG image has been generated in Inkscape from the corresponding figure in Narsai et al. (2017) according to the SVG tutorial, and the resulting custom figure has been included in spatialHeatmap.

Query the aSVG files with one composite factor embryo.0h.A.

feature.df <- return_feature(feature=c('embryo.0h.A', 'embryoColeoptile.1h.A'), species=c('oryza', 'sativa'), keywords.any=FALSE, return.all=FALSE, dir=svg.dir, remote=NULL, match.only=FALSE)
## Accessing features... 
## oryza.sativa_coleoptile.ANT_shm.svg, oryza.sativa_coleoptile.NT_shm.svg,
feature.df[1:2, ] # The first two rows of the query results.
##                 feature stroke color                    id element    parent
## 1           embryo.0h.A    0.2  none           embryo.0h.A       g container
## 2 embryoColeoptile.1h.A    0.2  none embryoColeoptile.1h.A    path container
##   order                                 SVG
## 1    25 oryza.sativa_coleoptile.ANT_shm.svg
## 2    27 oryza.sativa_coleoptile.ANT_shm.svg

Only one aSVG file oryza.sativa_coleoptile.ANT_shm.svg is retrieved.

unique(feature.df$SVG)
## [1] "oryza.sativa_coleoptile.ANT_shm.svg"

Note no matter how the factors are combined, the composite factors of interest should always have matching counterparts in the aSVG file. In this example, all composite factors are matched to the aSVG.

unique(target.clp$samTimeCon) %in% unique(feature.df$feature)
##  [1] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
## [16] TRUE

Obtain full path of the retrieved aSVG on user system.

svg.clp1 <- system.file("extdata/shinyApp/example", "oryza.sativa_coleoptile.ANT_shm.svg", package="spatialHeatmap")

4.1.4 Preprocess Assay Data

The raw RNA-Seq count are preprocessed with the following steps: (1) normalization, (2) aggregation of replicates, and (3) filtering of reliable expression data. The details of these steps are explained in the above sub-section on human data. The normalization step is same for combined factors sample-time and sample-time-condition, while the aggregation and filtering are different. The difference is reflected by sam.factor and con.factor, and subsequently the column names in resulting assay slot of the SummarizedExperiment object.

se.nor.clp <- norm_data(data=rse.clp, norm.fun='ESF', log2.trans=TRUE) # Normalization
## Normalising: ESF 
##    type 
## "ratio"

Aggregation and filtering by sample-time-condition factor.

se.aggr.clp1 <- aggr_rep(data=se.nor.clp, sam.factor='samTimeCon', con.factor=NULL, aggr='mean') # Replicate agggregation using mean
se.fil.clp1 <- filter_data(data=se.aggr.clp1, sam.factor='samTimeCon', con.factor=NULL, pOA=c(0.07, 7), CV=c(0.7, 100), dir=NULL) # Filtering of genes with low counts and varince.
## All values before filtering:
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.0000  0.2995  4.0283  4.3614  7.7436 18.0561 
## All coefficient of variances (CVs) before filtering:
##     Min.  1st Qu.   Median     Mean  3rd Qu.     Max. 
## 0.008408 0.087498 0.257247 0.654965 0.847843 4.000000 
## All values after filtering:
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.0000  0.4401  2.6058  3.8745  7.1382 15.6921 
## All coefficient of variances (CVs) after filtering:
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.7005  0.7846  0.9030  0.9768  1.0853  1.9504

Since all three factors (conditions, times, tissues) are combined, the resulting data table loses the double underscore string.

assay(se.fil.clp1)[1:3, 1:3] # A slice of the resulting data table.
##              embryo.0h.A embryoColeoptile.1h.A embryoColeoptile.1h.N
## Os01g0106300    2.549855             0.2403387              1.902315
## Os01g0111600   12.116707            12.9343197             12.708776
## Os01g0127600    6.495876             7.3024594              7.443524

4.1.5 STHM: combine sample, time, and condition

The expression profile of gene Os12g0630200 and Os01g0106300 in coleoptile is plotted across eight time points under anoxia and re-oxygenation in form of a composite STHM.

shm.lis <- spatial_hm(svg.path=svg.clp1, data=se.fil.clp1, ID=c('Os12g0630200', 'Os01g0106300'), legend.r=0.7, legend.key.size=0.01, legend.text.size=8, legend.nrow=8, ncol=1, width=0.8, line.size=0)
## Coordinates: oryza.sativa_coleoptile.ANT_shm.svg ... 
## CPU cores: 1 
## ggplots/grobs: oryza.sativa_coleoptile.ANT_shm.svg ... 
## ggplot: Os12g0630200, con  
## ggplot: Os01g0106300, con  
## Legend plot ... 
## CPU cores: 1 
## Converting "ggplot" to "grob" ... 
## Os12g0630200_con_1  Os01g0106300_con_1  
## Converting "ggplot" to "grob" ... 
## 
Spatiotemporal heatmap at sample-time-condition factor. Gene expression profile of two genes in coleoptile across eight time points under anoxia and re-oxygenation is visualized in a composite image.

Figure 9: Spatiotemporal heatmap at sample-time-condition factor
Gene expression profile of two genes in coleoptile across eight time points under anoxia and re-oxygenation is visualized in a composite image.

This STHM example is also deployed as an interacive Shiny app, where STHMs are provided in form of static images, interactive HTML files, and video files. Click here to see this app.

4.2 Sample-Time Factor

This part showcases the usage of sample-time factor. The sample-condition factor could be applied similarly. To obtain optimal result, the data under aerobic is excluded. Since most steps are similar with the sample-time-condition factor, the following process is reduced to minimum.

4.2.1 Gene Expression Data

The same RNA-seq count data from Narsai et al. (2017) is downloaded.

rse.clp <- read_cache(cache.pa, 'rse.clp') # Retrieve data from cache.
if (is.null(rse.clp)) { # Save downloaded data to cache if it is not cached.
  rse.clp <- getAtlasData('E-GEOD-115371')[[1]][[1]]
  save_cache(dir=cache.pa, overwrite=TRUE, rse.clp)
}

4.2.2 Experimental Design

The same targets file with sample-time factor is imported.

clp.tar <- system.file('extdata/shinyApp/example/target_coleoptile.txt', package='spatialHeatmap')
target.clp <- read_fr(clp.tar)

Inspect the samples, conditions, and times.

target.clp[1:3, c(6, 7, 9, 10)] # A slice of the targets file.
##            age organism_part stimulus con
## SRR7265373  0h        embryo  aerobic   A
## SRR7265374  0h        embryo  aerobic   A
## SRR7265375  0h        embryo  aerobic   A
unique(target.clp[, 'age']) # All development stages.
## [1] "0h"     "1h"     "3h"     "12h"    "24h"    "48h"    "72h"    "96h"   
## [9] "72N24A"
unique(target.clp[, 'organism_part']) # All tissues.
## [1] "embryo"           "embryoColeoptile" "coleoptile"
unique(target.clp[, 'stimulus']) # All conditions.
## [1] "aerobic"   "anaerobic" "NA"

Combine sample and time factors, which is the essential difference with sample-time-condition example.

rse.clp <- com_factor(rse.clp, target.clp, factors2com=c('organism_part', 'age'), factor.new='samTime')
## New combined factors: embryo.0h embryoColeoptile.1h embryoColeoptile.3h embryoColeoptile.12h embryoColeoptile.24h coleoptile.48h coleoptile.72h coleoptile.96h coleoptile.72N24A
target.clp <- colData(rse.clp)
target.clp[1:3, ]
## DataFrame with 3 rows and 11 columns
##            AtlasAssayGroup           genotype     organism    cultivar
##                <character>        <character>  <character> <character>
## SRR7265373              g1 wild type genotype Oryza sativa      Amaroo
## SRR7265374              g1 wild type genotype Oryza sativa      Amaroo
## SRR7265375              g1 wild type genotype Oryza sativa      Amaroo
##            developmental_stage         age organism_part environmental_history
##                    <character> <character>   <character>           <character>
## SRR7265373                seed          0h        embryo            etiolation
## SRR7265374                seed          0h        embryo            etiolation
## SRR7265375                seed          0h        embryo            etiolation
##               stimulus         con     samTime
##            <character> <character> <character>
## SRR7265373     aerobic           A   embryo.0h
## SRR7265374     aerobic           A   embryo.0h
## SRR7265375     aerobic           A   embryo.0h

4.2.3 aSVG Image

Similarly the custom aSVG image was generated in Inkscape from the corresponding figure in Narsai et al. (2017) according to the SVG tutorial and included in spatialHeatmap.

Query the aSVG files.

feature.df <- return_feature(feature=c('embryo.0h', 'embryoColeoptile1h'), species=c('oryza', 'sativa'), keywords.any=FALSE, return.all=FALSE, dir=svg.dir, remote=NULL, match.only=FALSE)
## Accessing features... 
## oryza.sativa_coleoptile.ANT_shm.svg, oryza.sativa_coleoptile.NT_shm.svg,
feature.df[1:2, ] # The first two rows of the query results.
##              feature stroke   color                 id element    parent order
## 1          embryo.0h    0.2    none          embryo.0h       g container    25
## 2 rect1033_globalLGD    0.2 #3f1d38 rect1033_globalLGD    path container     1
##                                  SVG
## 1 oryza.sativa_coleoptile.NT_shm.svg
## 2 oryza.sativa_coleoptile.NT_shm.svg

Only one aSVG file oryza.sativa_coleoptile.NT_shm.svg is retrieved.

unique(feature.df$SVG)
## [1] "oryza.sativa_coleoptile.NT_shm.svg"

Note no matter how the factors are combined, the composite factor of interest should always have matching counterparts in the aSVG file.

unique(target.clp$samTime) %in% unique(feature.df$feature)
## [1] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE

Obtain full path of the retrieved aSVG on user system.

svg.clp2 <- system.file("extdata/shinyApp/example", "oryza.sativa_coleoptile.NT_shm.svg", package="spatialHeatmap")

4.2.4 Preprocess Assay Data

The raw RNA-Seq count are preprocessed with the following steps: (1) normalization, (2) aggregation of replicates, and (3) filtering of reliable expression data. The normalization step is the same for composite factors sample-time and sample-time-condition, while the aggregation and filtering are different. The difference is reflected by sam.factor and con.factor, and subsequently the column names in the assay slot of the resulting SummarizedExperiment object.

se.nor.clp <- norm_data(data=rse.clp, norm.fun='ESF', log2.trans=TRUE) # Normalization
## Normalising: ESF 
##    type 
## "ratio"

Aggregation and filtering by sample-time factor.

se.aggr.clp2 <- aggr_rep(data=se.nor.clp, sam.factor='samTime', con.factor='stimulus', aggr='mean') # Replicate agggregation using mean. 
se.fil.clp2 <- filter_data(data=se.aggr.clp2, sam.factor='samTime', con.factor='stimulus', pOA=c(0.07, 7), CV=c(0.7, 100), dir=NULL) # Filtering of genes with low counts and varince.
## All values before filtering:
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.0000  0.2995  4.0283  4.3614  7.7436 18.0561 
## All coefficient of variances (CVs) before filtering:
##     Min.  1st Qu.   Median     Mean  3rd Qu.     Max. 
## 0.008408 0.087498 0.257247 0.654965 0.847843 4.000000 
## All values after filtering:
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.0000  0.4401  2.6058  3.8745  7.1382 15.6921 
## All coefficient of variances (CVs) after filtering:
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.7005  0.7846  0.9030  0.9768  1.0853  1.9504

Since only sample and time factors are combined, the resulting data table preserves the double underscore string, which is different from the sample-time-condition example.

df.fil.clp <- assay(se.fil.clp2) 
df.fil.clp[1:3, 1:3] # A slice of the resulting data table.
##              embryo.0h__aerobic embryoColeoptile.1h__aerobic
## Os01g0106300           2.549855                    0.2403387
## Os01g0111600          12.116707                   12.9343197
## Os01g0127600           6.495876                    7.3024594
##              embryoColeoptile.1h__anaerobic
## Os01g0106300                       1.902315
## Os01g0111600                      12.708776
## Os01g0127600                       7.443524

The optimal viusalization on complete data is achieved on sample-time-condition factor. To also obtain the best result on sample-time factor, the data under aerobic is excluded.

df.fil.clp1 <- df.fil.clp[, !grepl('__aerobic', colnames(df.fil.clp))] # Exclude aerobic data.  
df.fil.clp1[1:3, 1:3] # A slice of the data table without aerobic data.
##              embryoColeoptile.1h__anaerobic embryoColeoptile.3h__anaerobic
## Os01g0106300                       1.902315                       1.357282
## Os01g0111600                      12.708776                      12.531359
## Os01g0127600                       7.443524                       6.919786
##              embryoColeoptile.12h__anaerobic
## Os01g0106300                       2.9825250
## Os01g0111600                       9.7997716
## Os01g0127600                       0.9402776

4.2.5 STHM: combine sample and time

The expression profile of gene Os12g0630200 in coleoptile is plotted across eight time points under anoxia and re-oxygenation respectively.

shm.lis <- spatial_hm(svg.path=svg.clp2, data=df.fil.clp1, ID=c('Os12g0630200'), legend.r=0.9, legend.key.size=0.02, legend.text.size=9, legend.nrow=8, ncol=1, line.size=0)
## Coordinates: oryza.sativa_coleoptile.NT_shm.svg ... 
## CPU cores: 1 
## ggplots/grobs: oryza.sativa_coleoptile.NT_shm.svg ... 
## ggplot: Os12g0630200, anaerobic  NA  
## Legend plot ... 
## CPU cores: 1 
## Converting "ggplot" to "grob" ... 
## Os12g0630200_anaerobic_1  Os12g0630200_NA_1  
## Converting "ggplot" to "grob" ... 
## 
Spatiotemporal heatmap at sample-time factor. Gene expression profile of one gene in coleoptile across eight time points under anoxia and re-oxygenation is visualized in two images.

Figure 10: Spatiotemporal heatmap at sample-time factor
Gene expression profile of one gene in coleoptile across eight time points under anoxia and re-oxygenation is visualized in two images.

4.3 Multiple aSVGs

In the spatiotemporal application, different development stages may need to be represented in separate aSVG images. In that case, the spatial_hm function is able to arrange multiple aSVGs in a single SHM plot. To organize the subplots, the names of the separate aSVG files are expected to include the following suffixes: *_shm1.svg, *_shm2.svg, etc. The paths to the aSVG files are provided under the svg.path argument. By default, every aSVG image will have a legend plot on the right. The legend argument provides fine control over which legend plots to display.

As a simple toy example, the following stores random numbers in a data.frame.

df.random <- data.frame(matrix(sample(x=1:100, size=50, replace=TRUE), nrow=10))
colnames(df.random) <- c('shoot_totalA__condition1', 'shoot_totalA__condition2', 'shoot_totalB__condition1', 'shoot_totalB__condition2', 'notMapped') # Assign column names
rownames(df.random) <- paste0('gene', 1:10) # Assign row names 
df.random[1:3, ]
##       shoot_totalA__condition1 shoot_totalA__condition2
## gene1                       57                        6
## gene2                       50                       77
## gene3                       29                       41
##       shoot_totalB__condition1 shoot_totalB__condition2 notMapped
## gene1                        4                       53        49
## gene2                       71                       37        41
## gene3                       96                       68         2

Next the paths to the aSVG files are obtained, here for younger and older plants using *_shm1 and *_shm1, respectively, which are made from Mustroph et al. (2009).

svg.sh1 <- system.file("extdata/shinyApp/example", "arabidopsis.thaliana_organ_shm1.svg", package="spatialHeatmap")
svg.sh2 <- system.file("extdata/shinyApp/example", "arabidopsis.thaliana_organ_shm2.svg", package="spatialHeatmap")

The following generates the corresponding SHMs plot for gene1. The orginal image dimensions can be preserved by assigning TRUE to the preserve.scale argument.

spatial_hm(svg.path=c(svg.sh1, svg.sh2), data=df.random, ID=c('gene1'), width=0.7, legend.r=0.2, legend.width=1, preserve.scale=TRUE, bar.width=0.09) 
## Coordinates: arabidopsis.thaliana_organ_shm1.svg ... 
## CPU cores: 1 
## Coordinates: arabidopsis.thaliana_organ_shm2.svg ... 
## CPU cores: 1 
## Features in data not mapped: shoot_totalB, notMapped 
## Features in data not mapped: shoot_totalA, notMapped 
## ggplots/grobs: arabidopsis.thaliana_organ_shm1.svg ... 
## ggplot: gene1, condition1  condition2  
## Legend plot ... 
## CPU cores: 1 
## Converting "ggplot" to "grob" ... 
## gene1_condition1_1  gene1_condition2_1  
## Converting "ggplot" to "grob" ... 
##   
## ggplots/grobs: arabidopsis.thaliana_organ_shm2.svg ... 
## ggplot: gene1, condition1  condition2  
## Legend plot ... 
## CPU cores: 1 
## Converting "ggplot" to "grob" ... 
## gene1_condition1_2  gene1_condition2_2  
## Converting "ggplot" to "grob" ... 
## 
Spatial heatmap of Arabidopsis at two growth stages. The expression profile of gene1 under condition1 and condition2 is plotted for two growth stages (top and bottom row).

Figure 11: Spatial heatmap of Arabidopsis at two growth stages
The expression profile of gene1 under condition1 and condition2 is plotted for two growth stages (top and bottom row).

Note in Figure 11 shoots have thicker outlines than roots. This is another function of spatial_hm, i.e. preserves the outline thicknesses defined in aSVG files. This feature is particularly useful in cellular SHMs where different cell types have different cell-wall thicknesses. The outline widths can be updated with update_feature programatically or with Inkscape manually. The former is illustrated in the Supplementary Section.

5 Multi-Factor Spatial Heatmaps

In principle, the spatialHeatmap is extendable to as many factors (e.g. samples, conditions, times) as possible. The most common scenario involves only two factors of samples and conditions (Section 3). If more factors are relevant such as development stages, geographical locations, genotypes, etc, all or selected factors should be combined to generate composite factors. The spatiotemporal section is an illustration of three factors. Similar combining strategies should be appied in cases of four or more factors. A rule of thumb is the composite factors of interest must have a matching counterpart in the aSVG file, otherwise target spatial features are not painted in spatial heatmaps.

6 Overlaying

spatialHeatmap supports overlaying real images with SHMs, which provides more intuitive background. There are certain requirements to utilize this feature: 1) The real images are templates for creating aSVGs and the formats include .jpg, .JPG, .png, .PNG; 2) Except for file extensions, the file names of templates and aSVGs should be same such as real_shm.png, real_shm.svg. In other words, templates and aSVGs should be paired; and 3) If multiple templates/aSVGs are used such as developmental stages, names of separate template/aSVG files need to include the following suffixes: *_shm1.png, *_shm1.svg, *_shm2.png, *_shm2.svg, etc. Otherwise, the order of these images is not recognized.

In the following example, template images are modified from a study on abaxial bundle sheath (ABS) cells in maize leaves (Bezrutczyk et al. 2021). They are labeled 1 and 2 to mimic two developmental stages. The data are randomly generated.

The first template and paired aSVG.

tmp.pa1 <- system.file('extdata/shinyApp/example/maize_leaf_shm1.png', package='spatialHeatmap')
svg.pa1 <- system.file('extdata/shinyApp/example/maize_leaf_shm1.svg', package='spatialHeatmap')

The second template and paired aSVG.

tmp.pa2 <- system.file('extdata/shinyApp/example/maize_leaf_shm2.png', package='spatialHeatmap')
svg.pa2 <- system.file('extdata/shinyApp/example/maize_leaf_shm2.svg', package='spatialHeatmap')

Random data.

dat.overlay <- read_fr(system.file('extdata/shinyApp/example/dat_overlay.txt', package='spatialHeatmap'))
dat.overlay[1:2, ]
##       cell1 cell2
## gene1     7    65
## gene2    70    27

The real images may contain colors that interfere with SHMs. To address such issues, the argument alpha.overlay is developed to adjust transparency of real images. In Figure 12, expression profiles of gene1 are plotted on SHMs and SHMs are placed on top of template images.

spatial_hm(svg.path=c(svg.pa1, svg.pa2), data=dat.overlay, tmp.path=c(tmp.pa1, tmp.pa2), charcoal=FALSE, ID=c('gene1'), alpha.overlay=0.5, bar.width=0.09, sub.title.vjust=4)
## Coordinates: maize_leaf_shm1.svg ... 
## CPU cores: 1 
## Coordinates: maize_leaf_shm2.svg ... 
## CPU cores: 1 
## Features in data not mapped: cell2 
## Features in data not mapped: cell1 
## ggplots/grobs: maize_leaf_shm1.svg ... 
## ggplot: gene1, con  
## Legend plot ... 
## CPU cores: 1 
## Converting "ggplot" to "grob" ... 
## gene1_con_1  
## Converting "ggplot" to "grob" ... 
##   
## ggplots/grobs: maize_leaf_shm2.svg ... 
## ggplot: gene1, con  
## Legend plot ... 
## CPU cores: 1 
## Converting "ggplot" to "grob" ... 
## gene1_con_2  
## Converting "ggplot" to "grob" ... 
## 
Overlaying real images with SHMs (colorful backaground). The expression profiles of gene1 are plotted on ABS cells.

Figure 12: Overlaying real images with SHMs (colorful backaground)
The expression profiles of gene1 are plotted on ABS cells.

To reduce template colors to minimum, templates can be set black-white by charcoal=TRUE (Figure 13).

spatial_hm(svg.path=c(svg.pa1, svg.pa2), data=dat.overlay, tmp.path=c(tmp.pa1, tmp.pa2), charcoal=TRUE, ID=c('gene1'), alpha.overlay=0.5, bar.width=0.09, sub.title.vjust=4)
## Coordinates: maize_leaf_shm1.svg ... 
## CPU cores: 1 
## Coordinates: maize_leaf_shm2.svg ... 
## CPU cores: 1 
## Features in data not mapped: cell2 
## Features in data not mapped: cell1 
## ggplots/grobs: maize_leaf_shm1.svg ... 
## ggplot: gene1, con  
## Legend plot ... 
## CPU cores: 1 
## Converting "ggplot" to "grob" ... 
## gene1_con_1  
## Converting "ggplot" to "grob" ... 
##   
## ggplots/grobs: maize_leaf_shm2.svg ... 
## ggplot: gene1, con  
## Legend plot ... 
## CPU cores: 1 
## Converting "ggplot" to "grob" ... 
## gene1_con_2  
## Converting "ggplot" to "grob" ... 
## 
Overlaying real images with SHMs (black-white background). The expression profiles of gene1 are plotted on ABS cells.

Figure 13: Overlaying real images with SHMs (black-white background)
The expression profiles of gene1 are plotted on ABS cells.

7 Matrix Heatmaps

SHMs are suitable for comparing assay profiles among small number of items (e.g. few genes or proteins) across cell types and conditions. To also support analysis routines of larger number of items, spatialHeatmap integrates functionalities for identifying groups of items with similar and/or dissimilar assay profiles, and subsequently analyzing the results with data mining methods that scale to larger numbers of items than SHMs, such as hierarchical clustering and network analysis methods. The following introduces both approaches using as sample data the gene expression data from Arabidopsis shoots from the previous section.

To identify similar items, the submatrix function can be used. The latter identifies items sharing similar profiles with one or more query items of interest. The given example uses correlation coefficients as similarity metric. Three filtering parameters are provided to control the similarity and number of items in the result. First, the p argument allows to restrict the number of similar items to return based on a percentage cutoff relative to the number of items in the assay data set (e.g. 1% of the top most similar items). Second, a fixed number n of the most similar items can be returned. Third, all items above a user definable correlation coefficient cutoff value can be returned, such as a Pearson correlation coefficient (PCC) of at least 0.6. If several query items are provided then the function returns the similar genes for each query, while assuring uniqueness among items in the result.

The following example uses as query the two genes: ‘RCA’ and ‘HRE2’. The ann argument corresponds to the column name in the rowData slot containing gene annotation information. The latter is only relevant if users want to see these annotations when mousing over a node in the interactive network plot generated in the next section.

sub.mat <- submatrix(data=se.fil.arab, ann='Target.Description', ID=c('RCA', 'HRE2'), p=0.1)

The result from the previous step is the assay matrix subsetted to the genes sharing similar assay profiles with the query genes ‘RCA’ and ‘HRE2’.

sub.mat[c('RCA', 'HRE2'), c(1:3, 37)] # Subsetted assay matrix
##      root_total__control root_total__hypoxia root_p35S__control
## RCA             6.569305            6.416811           7.443822
## HRE2            5.486920           11.370161           5.578123
##                                                    Target.Description
## RCA  hypothetical protein ;supported by full-length cDNA: Ceres:7114.
## HRE2                         putative AP2 domain transcription factor

Subsequently, hierarchical clustering is applied to the subsetted assay matrix containing only the genes that share profile similarities with the query genes ‘RCA’ and ‘HRE2’. The clustering result is displayed as a matrix heatmap where the rows and columns are sorted by the corresponding hierarchical clustering dendrograms (Figure 14). The position of the query genes (‘RCA’ and ‘HRE2’) is indicated in the heatmap by black lines. Setting static=FALSE will launch the interactive mode, where users can zoom into the heatmap by selecting subsections in the image or zoom out by double clicking.

matrix_hm(ID=c('RCA', 'HRE2'), data=sub.mat, angleCol=80, angleRow=35, cexRow=0.8, cexCol=0.8, margin=c(10, 6), static=TRUE, arg.lis1=list(offsetRow=0.01, offsetCol=0.01))
Matrix Heatmap. Rows are genes and columns are samples. The input genes are tagged by black lines.

Figure 14: Matrix Heatmap
Rows are genes and columns are samples. The input genes are tagged by black lines.

8 Network Graphs

8.1 Module Identification

Network analysis is performed with the WGCNA algorithm (Langfelder and Horvath 2008; Ravasz et al. 2002) using as input the subsetted assay matrix generated in the previous section. The objective is to identify network modules that can be visualized in the following step in form of network graphs. Applied to the gene expression sample data used here, these network modules represent groups of genes sharing highly similar expression profiles. Internally, the network module identification includes five major steps. First, a correlation matrix is computed using distance or correlation-based similarity metrics. Second, the obtained matrix is transformed into an adjacency matrix defining the connections among items. Third, the adjacency matrix is used to calculate a topological overlap matrix (TOM) where shared neighborhood information among items is used to preserve robust connections, while removing spurious connections. Fourth, the distance transformed TOM is used for hierarchical clustering. To maximize time performance, the hierarchical clustering is performed with the flashClust package (Langfelder and Horvath 2012). Fifth, network modules are identified with the dynamicTreeCut package (Langfelder, Zhang, and Steve Horvath 2016). Its ds (deepSplit) argument can be assigned integer values from 0 to 3, where higher values increase the stringency of the module identification process. To simplify the network module identification process with WGCNA, the individual steps can be executed with a single function called adj_mod. The result is a list containing the adjacency matrix and the final module assignments stored in a data.frame. Since the interactive network feature used in the visualization step below performs best on smaller modules, only modules are returned that were obtained with stringent ds settings (here ds=2 and ds=3).

adj.mod <- adj_mod(data=sub.mat)

A slice of the adjacency matrix is shown below.

adj.mod[['adj']][1:3, 1:3]
##                 CA1    PSAH.1 AT2G26500
## CA1       1.0000000 0.9514016 0.9636366
## PSAH.1    0.9514016 1.0000000 0.9611725
## AT2G26500 0.9636366 0.9611725 1.0000000

The module assignments are stored in a data frame. Its columns contain the results for the ds=2 and ds=3 settings. Integer values \(>0\) are the module labels, while \(0\) indicates unassigned items. The following returns the first three rows of the module assignment result.

adj.mod[['mod']][1:3, ] 
##           2 3
## CA1       1 1
## PSAH.1    1 1
## AT2G26500 1 1

8.2 Module Visualization

Network modules can be visualized with the network function. To plot a module containing an item (gene) of interest, its ID (e.g. ‘HRE2’) needs to be provided under the corresponding argument. Typically, this could be one of the items chosen for the above SHM plots. There are two modes to visualize the selected module: static or interactive. Figure 15 was generated with static=TRUE. Setting static=FALSE will generate the interactive version. In the network plot shown below the nodes and edges represent genes and their interactions, respectively. The thickness of the edges denotes the adjacency levels, while the size of the nodes indicates the degree of connectivity of each item in the network. The adjacency and connectivity levels are also indicated by colors. For example, in Figure 15 connectivity increases from “turquoise” to “violet”, while the adjacency increases from “yellow” to “blue”. If a gene of interest has been assigned under ID, then its ID is labeled in the plot with the suffix tag: *_target.

network(ID="HRE2", data=sub.mat, adj.mod=adj.mod, adj.min=0.90, vertex.label.cex=1.2, vertex.cex=2, static=TRUE)
Static network. Node size denotes gene connectivity while edge thickness stands for co-expression similarity.

Figure 15: Static network
Node size denotes gene connectivity while edge thickness stands for co-expression similarity.

Setting static=FALSE launches the interactive network. In this mode there is an interactive color bar that denotes the gene connectivity. To modify it, the color labels need to be provided in a comma separated format, e.g.: yellow, orange, red. The latter would indicate that the gene connectivity increases from yellow to red.

If the subsetted expression matrix contains gene/protein annotation information in the last column, then it will be shown when moving the cursor over a network node.

network(ID="HRE2", data=sub.mat, adj.mod=adj.mod, static=FALSE)

9 Spatial Enrichment (SE)

This functionality is an extension of the SHMs. It identifies spatial feature-specifically expressed genes and thus enables the SHMs to visualize feature-specific profiles. It compares the target feature with all other selected features in a pairwise manner. The genes significantly up- or down-regulated in the target feature across all pairwise comparisons are denoted final target feature-specifically expressed genes. The base methods include edgeR (McCarthy et al. 2012), limma (Ritchie et al. 2015), DESeq2 (Love, Huber, and Anders 2014), distinct (Tiberi and Robinson. 2020). The feature-specific genes are first detected with each method independently then summarized across methods.

In addition to feature-specific genes, the SE is also able to identify genes specifically expressed in a certain condition or certain composite factor. The latter is a combination of multiple expermental factors. E.g. the spatiotemporal factor is a combination of feature and time points. See section 5 for details.

The application of SE is illustrated on the mouse brain in the following example, which could be extended to other examples in a similar way.

9.1 Mouse Brain

In this example, brain is selected as the target feature, liver and kidney as the reference features, and all the three strains DBA.2J, C57BL.6, CD1 as the factors.

All features.

unique(colData(rse.mus)$organism_part)
## [1] "brain"           "colon"           "heart"           "kidney"         
## [5] "liver"           "lung"            "skeletal muscle" "spleen"         
## [9] "testis"

All factors.

unique(colData(rse.mus)$strain)
## [1] "DBA.2J"  "C57BL.6" "CD1"

Subset the data according to the selected features and factors.

data.sub.mus <- sub_data(data=rse.mus, feature='organism_part', features=c('brain', 'liver', 'kidney'), factor='strain', factors=c('DBA.2J', 'C57BL.6', 'CD1'), com.by='feature', target='brain')

The SE requires replicates in the count data and has build-in normalizing utilities, thus the pre-processing only involves filtering.

data.sub.mus.fil <- filter_data(data=data.sub.mus, sam.factor='organism_part', con.factor='strain', pOA=c(0.2, 15), CV=c(0.8, 100))
## All values before filtering:
##      Min.   1st Qu.    Median      Mean   3rd Qu.      Max. 
##       0.0       0.0       0.0     519.7      31.0 2920174.0 
## All coefficient of variances (CVs) before filtering:
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.2026  0.8285  1.2990  1.4345  1.8179  3.0000 
## All values after filtering:
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##       0       4      42    1506     414 2920174 
## All coefficient of variances (CVs) after filtering:
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.8000  0.9902  1.2767  1.3160  1.6195  2.9769

The base methods in SE include four opitions: edgeR, limma, DESeq2, distinct, and the default are the first two. With each of the selected methods, the strains (factors) in each feature are treated as replicates and the target feature brain is compared with liver and kidney in a pairwise manner. The brain-specific genes are selected by log2 fold change (log2.fc) and FDR (fdr).

deg.lis.mus <- spatial_enrich(data.sub.mus.fil, methods=c('edgeR', 'limma'), log2.fc=1, fdr=0.05, aggr='mean', log2.trans.aggr=TRUE)
## edgeR ... 
## Normalizing ... 
## Computing DEGs ... 
## brain_VS_kidney 
## brain_VS_liver 
## kidney_VS_liver 
## brain up: 3681 ; down: 3043 
## kidney up: 1817 ; down: 631 
## liver up: 1795 ; down: 2011 
## Done! 
## limma ... 
## Normalising: TMM 
## brain up: 3712 ; down: 2981 
## kidney up: 1771 ; down: 542 
## liver up: 1747 ; down: 1639 
## Done! 
## DEG table ... 
## Normalising: CNF 
## method 
##  "TMM" 
## Computing CPM ... 
## Done!

The up- and down-regulated genes in brain can be accessed by method. The genes in edgeR can be accessed as following.

deg.lis.mus$lis.up.down$up.lis$edgeR.up[1:3] # Up-regulated.
## [1] "ENSMUSG00000026764" "ENSMUSG00000027350" "ENSMUSG00000053025"
deg.lis.mus$lis.up.down$down.lis$edgeR.down[1:3] # Down-regulated. 
## [1] "ENSMUSG00000025479" "ENSMUSG00000017950" "ENSMUSG00000025347"

Overlap of up-regulated genes in brain across methods in UpSet plot.

deg_ovl(deg.lis.mus$lis.up.down, type='up', plot='upset')
UpSet plot of up-regulated genes in mouse brain. The overlap of up-regulated genes detected by edgeR and limma is indicated by bars.

Figure 16: UpSet plot of up-regulated genes in mouse brain
The overlap of up-regulated genes detected by edgeR and limma is indicated by bars.

Overlap of up-regulated genes in brain across methods in matrix plot.

deg_ovl(deg.lis.mus$lis.up.down, type='up', plot='matrix')
Matrix plot of up-regulated genes in mouse brain. The matrix plot displays any overlap between up-regulated genes detected by edgeR and limma.

Figure 17: Matrix plot of up-regulated genes in mouse brain
The matrix plot displays any overlap between up-regulated genes detected by edgeR and limma.

The brain-specific genes are also summarized in a table. The type column indicates up- or down-regulated, the total column shows how many methods identify a certain gene as up or down, and the edgeR and limma columns have entry 1 if the method identifies a certain gene as up or down otherwise the entry will be 0. The data provided to spatial_enrich is normalized and replicates are aggregated internally, and appended to the right of the table.

deg.table.mus <- deg.lis.mus$deg.table; deg.table.mus[1:2, ] 
## DataFrame with 2 rows and 14 columns
##                 gene        type     total     edgeR     limma brain__DBA.2J
##          <character> <character> <numeric> <numeric> <numeric>     <numeric>
## 1 ENSMUSG00000026764          up         2         1         1      10.03888
## 2 ENSMUSG00000027350          up         2         1         1       8.69843
##   kidney__DBA.2J liver__DBA.2J brain__C57BL.6 kidney__C57BL.6 liver__C57BL.6
##        <numeric>     <numeric>      <numeric>       <numeric>      <numeric>
## 1        1.70799      -2.29798        9.97080         1.77276      -0.408155
## 2       -2.30229      -2.29798        8.92193        -2.32841      -2.526978
##   brain__CD1 kidney__CD1 liver__CD1
##    <numeric>   <numeric>  <numeric>
## 1    9.74457     1.57857   0.126663
## 2    8.84029    -1.04229  -3.463828

The numbers in total column are stringency measures of gene specificity, where the larger, the more strigent. For example, the up- and down-regulated genes in brain can be subsetted with the highest stringency total==2.

df.up.mus <- subset(deg.table.mus, type=='up' & total==2)
df.up.mus[1:2, ]
## DataFrame with 2 rows and 14 columns
##                 gene        type     total     edgeR     limma brain__DBA.2J
##          <character> <character> <numeric> <numeric> <numeric>     <numeric>
## 1 ENSMUSG00000026764          up         2         1         1      10.03888
## 2 ENSMUSG00000027350          up         2         1         1       8.69843
##   kidney__DBA.2J liver__DBA.2J brain__C57BL.6 kidney__C57BL.6 liver__C57BL.6
##        <numeric>     <numeric>      <numeric>       <numeric>      <numeric>
## 1        1.70799      -2.29798        9.97080         1.77276      -0.408155
## 2       -2.30229      -2.29798        8.92193        -2.32841      -2.526978
##   brain__CD1 kidney__CD1 liver__CD1
##    <numeric>   <numeric>  <numeric>
## 1    9.74457     1.57857   0.126663
## 2    8.84029    -1.04229  -3.463828
df.down.mus <- subset(deg.table.mus, type=='down' & total==2)
df.down.mus[1:2, ]
## DataFrame with 2 rows and 14 columns
##                 gene        type     total     edgeR     limma brain__DBA.2J
##          <character> <character> <numeric> <numeric> <numeric>     <numeric>
## 1 ENSMUSG00000025479        down         2         1         1      -1.91059
## 2 ENSMUSG00000017950        down         2         1         1      -3.46383
##   kidney__DBA.2J liver__DBA.2J brain__C57BL.6 kidney__C57BL.6 liver__C57BL.6
##        <numeric>     <numeric>      <numeric>       <numeric>      <numeric>
## 1        12.0379       14.8190       -2.27321         11.8227        14.8985
## 2        10.5755       11.1949       -3.46383         10.5404        11.5768
##   brain__CD1 kidney__CD1 liver__CD1
##    <numeric>   <numeric>  <numeric>
## 1   -2.83300     12.0990    14.6375
## 2   -3.46383     10.9252    11.5412

Create SHMs on one up-regulated (ENSMUSG00000026764) and one down-regulated (ENSMUSG00000025479) gene.

spatial_hm(svg.path=svg.mus, data=deg.lis.mus$deg.table, ID=c('ENSMUSG00000026764', 'ENSMUSG00000025479'), legend.r=1, legend.nrow=3, sub.title.size=6.1, ncol=3, bar.width=0.11)
## Coordinates: mus_musculus.male.svg ... 
## CPU cores: 1 
## Element "a" is removed: a4174 !
## ggplots/grobs: mus_musculus.male.svg ... 
## ggplot: ENSMUSG00000026764, DBA.2J  C57BL.6  CD1  
## ggplot: ENSMUSG00000025479, DBA.2J  C57BL.6  CD1  
## Legend plot ... 
## CPU cores: 1 
## Converting "ggplot" to "grob" ... 
## ENSMUSG00000026764_DBA.2J_1  ENSMUSG00000026764_C57BL.6_1  ENSMUSG00000026764_CD1_1  ENSMUSG00000025479_DBA.2J_1  ENSMUSG00000025479_C57BL.6_1  ENSMUSG00000025479_CD1_1  
## Converting "ggplot" to "grob" ... 
## 
Spatially-enriched mouse genes. The two genes in the image are enriched in the mouse brain relative to kidney and liver. Top: down-regulated in brain. Bottom: up-regulated in brain.

Figure 18: Spatially-enriched mouse genes
The two genes in the image are enriched in the mouse brain relative to kidney and liver. Top: down-regulated in brain. Bottom: up-regulated in brain.

Line graph of expression profiles of the two genes in (Figure 18).

profile_gene(rbind(df.up.mus[1, ], df.down.mus[1, ]))
Line graph of gene expression profiles. The count data is normalized and replicates are aggregated.

Figure 19: Line graph of gene expression profiles
The count data is normalized and replicates are aggregated.

10 Integrated Co-visualization

The co-visuaization is a novel functionality to co-visualize single cell and bulk data in forms of embedding plots (PCA, UMAP, TSNE) and SHMs respectively. The key feature is single cells in a cluster are matched to source bulk tissue. Two matching approaches are provided: manual and automatic.

10.1 Manual matching

In manual matching, single cell clusters can be automatically detected or pre-defined by users. When matching with source bulk tissues, the matching relationship needs to be manualy defined. In R-command line, the matching is defined in a name list while in Shiny App it is simply dragging and dropping.

The example single cell data of mouse brain are from a oligodendrocyte heterogeneity study in mouse central nervous system (Marques et al. 2016). Before co-visualizing, single cell data are pre-processed and clustered, which is learned from Bioconductor OCSA. Since these steps are not the focus, details are not explained.

10.1.1 Clustering single cells

To obtain reproducible results, always start a new R session and set a fixed seed for Random Number Generator at the beginning, which is required only once in each R session.

set.seed(10)

Read the example single cell data.

  sce.manual.pa <- system.file("extdata/shinyApp/example", "sce_manual_mouse.rds", package="spatialHeatmap")
  sce.manual <- readRDS(sce.manual.pa)

Quality control through mitochondria and spike-in genes.

  stats <- perCellQCMetrics(sce.manual, subsets=list(Mt=rowData(sce.manual)$featureType=='mito'), threshold=1)
  sub.fields <- 'subsets_Mt_percent'
  ercc <- 'ERCC' %in% altExpNames(sce.manual)
  if (ercc) sub.fields <- c('altexps_ERCC_percent', sub.fields)
  qc <- perCellQCFilters(stats, sub.fields=sub.fields, nmads=3)
  # Discard unreliable cells.
  colSums(as.matrix(qc))
##            low_lib_size          low_n_features high_subsets_Mt_percent 
##                       0                       0                       0 
##                 discard 
##                       0
  sce.manual <- sce.manual[, !qc$discard]

Normalization.

  clusters <- quickCluster(sce.manual)
  sce.manual <- computeSumFactors(sce.manual, cluster=clusters) 
  sce.manual <- logNormCounts(sce.manual)

Dimensionality reduction through PCA, UMAP, and TSNE.

  # Variance modelling.
  df.var <- modelGeneVar(sce.manual)
  top.hvgs <- getTopHVGs(df.var, prop = 0.1, n = 3000)
  # Dimensionality reduction. 
  sce.manual <- denoisePCA(sce.manual, technical=df.var, subset.row=top.hvgs)
  sce.manual <- runTSNE(sce.manual, dimred="PCA")
  sce.manual <- runUMAP(sce.manual, dimred = "PCA")

Clustering. Cell clusters are detected by first building a graph object then partitioning the graph, where cells are nodes in the graph.

  # Build graph.
  snn.gr <- buildSNNGraph(sce.manual, use.dimred="PCA") 
  # Partition graph to detect cell clusters.
  cluster <- paste0('clus', cluster_walktrap(snn.gr)$membership)
  table(cluster)
## cluster
## clus1 clus2 clus3 clus4 clus5 clus6 
##   101    83    50    59    29    20

Cell cluster assignments need to be store in the colData slot of SingleCellExperiment. Cell clusters/groups pre-defined by users need to be stored in the label column while cell clusters automatically detected by clustering algorithm are stored in the cluster column. If there are experimental variables such as treatments or time points, they should be stored in the expVar column.

  cdat <- colData(sce.manual) 
  lab.lgc <- 'label' %in% make.names(colnames(cdat))
  if (lab.lgc) {
    cdat <- cbind(cluster=cluster, colData(sce.manual))
    idx <- colnames(cdat) %in% c('cluster', 'label')
    cdat <- cdat[, c(which(idx), which(!idx))]
  } else cdat <- cbind(cluster=cluster, colData(sce.manual))
  colnames(cdat) <- make.names(colnames(cdat))
  colData(sce.manual) <- cdat; cdat[1:3, ]
## DataFrame with 3 rows and 8 columns
##       cluster        label         age inferred.cell.type
##   <character>  <character> <character>        <character>
## 1       clus3 hypothalamus         p22              NFOL1
## 2       clus3       SN.VTA         p22              NFOL2
## 3       clus4  dorsal.horn         p20              NFOL2
##                      Sex      strain      expVar sizeFactor
##              <character> <character> <character>  <numeric>
## 1                      F      C57BL6     control    1.22941
## 2 pooled male and female         CD1     control    1.16755
## 3                      ?      C57BL6     control    1.08944

Embedding plot of single cells with clusters labeled by the label column in colData.

 plotUMAP(sce.manual, colour_by="label")
Embedding plot single cells. The cell clusters are defined by the `label` column in `colData`.

Figure 20: Embedding plot single cells
The cell clusters are defined by the label column in colData.

10.1.2 aSVG of Mouse Brain

The spatial features in mouse brain aSVG are extracted. They are the bulk tissues to be matched with cell clusters.

  svg.mus <- system.file("extdata/shinyApp/example", "mus_musculus.brain.svg", package="spatialHeatmap")
  # Spatial features to match with single cell clusters.
  feature.df <- return_feature(svg.path=svg.mus)
## Accessing features... 
## Element "a" is removed: a4174 !
## 
## Recommendation: please remove the 'transform' attribute with a 'matrix' value in the following groups by ungrouping and regrouping the respective groups in Inkscape. Otherwise, colors in spatial heatmap might be shifted! 
## EFO_0000530 
## 
## mus_musculus.brain.svg,
  feature.df$feature
##  [1] "path16294"                    "path16312"                   
##  [3] "path16316"                    "path5402"                    
##  [5] "medulla.oblongata"            "cerebral.cortex"             
##  [7] "corpus.striatum"              "diencephalon"                
##  [9] "pituitary.gland"              "hippocampus"                 
## [11] "cerebellum"                   "brainstem"                   
## [13] "midbrain"                     "dorsal.plus.ventral.thalamus"
## [15] "hypothalamus"                 "nose"                        
## [17] "corpora.quadrigemina"

The single cell clusters can be matched to bulk tissues according to cluster assignments in the label or cluster column in colData.

10.1.3 Matching by Custom Clusters

The custom cell clusters are defined in the label column of colData, which are cell sources provided in the original study.

  unique(colData(sce.manual)$label) 
##  [1] "hypothalamus"    "SN.VTA"          "dorsal.horn"     "cortex.S1"      
##  [5] "Amygdala"        "corpus.callosum" "zona.incerta"    "striatum"       
##  [9] "hippocampus.CA1" "dentate.gyrus"

Aggregate cells by clusters defined in the label column.

 sce.manual.aggr <- aggr_rep(sce.manual, assay.na='logcounts', sam.factor='label', con.factor='expVar', aggr='mean')
## Syntactically valid column names are made!

Manually create the matching list.

lis.match <- list(hypothalamus=c('hypothalamus'), cortex.S1=c('cerebral.cortex'))

Co-visualization on gene St18.

  shm.lis <- spatial_hm(svg.path=svg.mus, data=sce.manual.aggr, ID=c('St18'), height=0.7, legend.r=1.5, legend.key.size=0.02, legend.text.size=12, legend.nrow=2, sce.dimred=sce.manual, dimred='PCA', cell.group='label', assay.na='logcounts', tar.cell=c('matched'), lis.rematch=lis.match, bar.width=0.1, dim.lgd.nrow=1)
## Coordinates: mus_musculus.brain.svg ... 
## CPU cores: 1 
## Element "a" is removed: a4174 !
## 
## Recommendation: please remove the 'transform' attribute with a 'matrix' value in the following groups by ungrouping and regrouping the respective groups in Inkscape. Otherwise, colors in spatial heatmap might be shifted! 
## EFO_0000530 
## 
## Features in data not mapped: SN.VTA, dorsal.horn, cortex.S1, Amygdala, corpus.callosum, zona.incerta, striatum, hippocampus.CA1, dentate.gyrus 
## ggplots/grobs: mus_musculus.brain.svg ... 
## ggplot: St18, control  6h.post.stress  
## Legend plot ... 
## CPU cores: 1 
## Converting "ggplot" to "grob" ... 
## St18_control_1  St18_6h.post.stress_1  
## Converting "ggplot" to "grob" ... 
##   
## Converting "ggplot" to "grob" ... 
## dim_St18_control_1  dim_St18_6h.post.stress_1
Co-visualizing single cells and bulk tissues by custom cell clusters. The expression profiles of gene `St18` are used.

Figure 21: Co-visualizing single cells and bulk tissues by custom cell clusters
The expression profiles of gene St18 are used.

10.1.4 Matching by Auto-Clusters

Automatically detected cell clusters are stored in the cluster column in colData.

 unique(colData(sce.manual)$cluster) 
## [1] "clus3" "clus4" "clus2" "clus1" "clus5" "clus6"

Aggregate cells by clusters defined in the label column.

  sce.manual.aggr <- aggr_rep(sce.manual, assay.na='logcounts', sam.factor='cluster', con.factor=NULL, aggr='mean')
## Syntactically valid column names are made!

Manually create the matching list.

  lis.match <- list(clus1=c('hypothalamus'), clus3=c('cerebral.cortex', 'midbrain'))

Co-visualization on gene St18.

shm.lis <- spatial_hm(svg.path=svg.mus, data=sce.manual.aggr, ID=c('St18'), height=0.7, legend.r=1.5, legend.key.size=0.02, legend.text.size=12, legend.nrow=3, sce.dimred=sce.manual, dimred='PCA', cell.group='cluster', assay.na='logcounts', tar.cell=c('matched'), lis.rematch=lis.match, bar.width=0.11, dim.lgd.nrow=1)
## Coordinates: mus_musculus.brain.svg ... 
## CPU cores: 1 
## Element "a" is removed: a4174 !
## 
## Recommendation: please remove the 'transform' attribute with a 'matrix' value in the following groups by ungrouping and regrouping the respective groups in Inkscape. Otherwise, colors in spatial heatmap might be shifted! 
## EFO_0000530 
## 
## Features in data not mapped: clus3, clus4, clus2, clus1, clus5, clus6 
## ggplots/grobs: mus_musculus.brain.svg ... 
## ggplot: St18, con  
## Legend plot ... 
## CPU cores: 1 
## Converting "ggplot" to "grob" ... 
## St18_con_1  
## Converting "ggplot" to "grob" ... 
##   
## Converting "ggplot" to "grob" ... 
## dim_St18_con_1
Co-visualizing single cells and bulk tissues by automatically-detected cell clusters. The expression profiles of gene `St18` are used.

Figure 22: Co-visualizing single cells and bulk tissues by automatically-detected cell clusters
The expression profiles of gene St18 are used.

10.2 Auto-Matching

Except for being manually defined, the matching between cells and bulk tissues in co-visualization can also be automatic. The automatic process is carried out by combining and co-clustering bulk and single cell data. The bulk and single cell data need to be derived from the same organ or same large tissue. If same organ, the single cell data are assayed on the whole organ. By contrast, if single cell data are from a whole tissue, the bulk tissues should be sub-tissues.

The potential applications of auto-matching include: 1) Reduce single cell RNA-seq (scRNA-seq) complexities. In conventional scRNA-seq, there is ususally a complex and laborious stage of isolating single cells. Auto-matching has the potential to avoid such processes since it only requires scRNA-seq on a whole organ; 2) Discover novel cell types. The cells with bulk tissue assignments are assumed to be major populations in the bulk, while cells without bulk assignemnts are likely to be novel cell types or cells at the bulk tissue boundaries; and 3) Estimate cellular compositions. If bulk tissues are representative of a whole organ, cellular compositions of the organ could be estimated according to their bulk tissue assignments. This application is useful in disease diagnose and treatment, since it helps to analyze each cell type’s contribution to the disease.

The auto-matching includes two main functionalities. One is the auto-matching optimization and another is the co-visualization. The former is developed to optimize the workflow on trainning data sets, while the latter uses the optimized parameter settings for co-visualization.

10.2.1 Workflow Overview

Figure 23 is the co-clustering workflow overview. The inputs are RNA-seq raw count data of bulk tissues and single cells of the same organ (Figure 23.1). The single cells should come from the whole organ or at least covers the bulk tissues. The identities of each bulk tissue and each cell need to be labeled so as to evaluate the co-clustering performance. Bulk and single cell data are initially filtered at low strigency. The main difference between bulk and single cells is the sparsity in the latter. To reduce such difference, the bulk and single-cell data are combined, normalized, and then separated (Figure 23.2). After separation, the normalized bulk data are filtered to remove genes of low and constant expressions (Figure 23.3). The normalized single-cell data are also filtered to remove genes and cells having high zero-count rates. After filtered, the gene dimensionalities of single-cell data are reduced using PCA or UMAP method, and the top dimensionalities are clustered (Figure 23.4). In each cell cluster, cells having low similarities with other cells in the same cluster are filtered (Figure 23.5), and therefore the remaining clusters are more homogeneous (Figure 23.6). The filtered bulk and filtered single cells are combined and co-clustered (Figure 23.7).

The results include three types of co-clusters: 1) Two bulk tissues are clustered with cells. The source bulk is assigned to each cell according to Spearman’s correlation coefficient. For example, in Figure 23.8a bulk A is assigned to cell a1 because a1 has higher similarity with A than B. Since the true source bulk of a1 is A, this assignment is TRUE. By contrast, cell b1 also has higher similarity with A than B, and A is assigned to b1, but this assignment is labeled FALSE since the true source bulk of b1 is B; 2). Only one bulk tissue is clustered with cells. This bulk is assigned to all the cells in the same co-cluster (23.8b); and 3) No bulk is included. All these cells are discarded (Figure 23.8c). Lastly, the Spearman’s correlation coefficient and TRUE or FALSE assignments are used to create ROC plots and evaluate the performance (Figure 23.9).

Overview of coclustering. The coclustering are illustrated in 9 steps. Basically, bulk and single data are initially filtered, then combined, normalized, and separated. The normalized bulk and single cell data are filtered again. Single cells are clustered and resulting clusters are refined. Subsequently bulk and single cells are combined and co-clustered. In each co-cluster, bulk tissues are assigned to cells. The assignments are evaluated by ROC curves.

Figure 23: Overview of coclustering
The coclustering are illustrated in 9 steps. Basically, bulk and single data are initially filtered, then combined, normalized, and separated. The normalized bulk and single cell data are filtered again. Single cells are clustered and resulting clusters are refined. Subsequently bulk and single cells are combined and co-clustered. In each co-cluster, bulk tissues are assigned to cells. The assignments are evaluated by ROC curves.

10.2.2 Optimization Demonstration

Since real optimizations have high demand on computing power and take a long time, it is demonstrated on toy data. Thus the result parameter settings may not be really optimal. The example bulk and single cell RNA-seq data are from Arabidopsis thaliana (Arabidopsis) root. Bulk tissue data comprise all the major root tissues such as epidermis, cortex, endodermis, xylem, columella, which are generated in a research on alternative splicing and lincRNA regulation (Li et al. 2016). The two single cell data sets are derived from the whole root, which are produced in a study of single cell Arabidopsis root atlas (Shahan et al. 2020). The identities of bulk and single cells are all labeled.

The optimization focuses on parameters of normalization methods, filtering, dimensionality reduction methods, refining homogeneous cell clusters, number of top dimensionalities in co-clustering, graph-building methods in co-clustering. The optimization is performed by running the co-clustering workflow (Figure 23) on each of the single cell data sets. The parameter settings being optimized is fixed and all settings of other parameters are varied across all possible combinations.

Each running of the workflow yields an AUC value, thus after running the workflow on all possible settings combinations one parameter settings has a set of AUC values. The AUCs are filtered according to some criteria and the remaining AUCs are averaged. A settings with a higher mean AUC than its counterparts are taken as optimal in a parameter. For example, when optimizing dimensionality reduction methods, the settings are PCA and UMAP. If the mean of remaining AUCs of PCA is 0.6 while UMAP is 0.55, PCA is regarded as the optimal.

Since optimzation on example data also takes a relatively long time, most of the following steps are not evaluated. A common computer with 4G memory and 4 CPUs is enough to run the following optimization process.

To obtain reproducible results, always start a new R session and set a fixed seed for Random Number Generator at the beginning, which is required only once in each R session.

set.seed(10)

Read bulk and two single cell data.

  blk <- readRDS(system.file("extdata/cocluster/data", "bulk_cocluster.rds", package="spatialHeatmap")) # Bulk.
  sc10 <- readRDS(system.file("extdata/cocluster/data", "sc10_cocluster.rds", package="spatialHeatmap")) # Single cell.
  sc11 <- readRDS(system.file("extdata/cocluster/data", "sc11_cocluster.rds", package="spatialHeatmap")) # Single cell.
  blk; sc10; sc11
## class: SummarizedExperiment 
## dim: 2805 45 
## metadata(0):
## assays(1): counts
## rownames(2805): AT1G01070 AT1G01120 ... ATCG00650 ATCG00770
## rowData names(0):
## colnames(45): PHLM_COMP PHLM_COMP ... HAIR CORT
## colData names(0):
## class: SingleCellExperiment 
## dim: 577 1893 
## metadata(0):
## assays(1): counts
## rownames(577): AT1G01470 AT1G02400 ... AT5G66870 AT5G67080
## rowData names(0):
## colnames(1893): atricho endo ... atricho cortex
## colData names(0):
## reducedDimNames(0):
## mainExpName: NULL
## altExpNames(0):
## class: SingleCellExperiment 
## dim: 577 2364 
## metadata(0):
## assays(1): counts
## rownames(577): AT1G01470 AT1G02400 ... AT5G66870 AT5G67080
## rowData names(0):
## colnames(2364): atricho tricho ... endo atricho
## colData names(0):
## reducedDimNames(0):
## mainExpName: NULL
## altExpNames(0):

These example data are already pre-processed. To demonstrate the optimization process the pre-processing steps are perfomed again with few genes or cells removed.

Inital filtering with low strigency before normalization.

  blk <- filter_data(data=blk, pOA=c(0.2, 15), CV=c(1.5, 100))
  fil.init <- filter_cell(lis=list(sc10=sc10, sc11=sc11), bulk=blk, gen.rm='^ATCG|^ATCG', min.cnt=1, p.in.cell=0.3, p.in.gen=0.1); fil.init

Combine and normalize bulk and single cell data, then separate them. By default computeSumFactors (fct) in scran package is used (Lun, McCarthy, and Marioni 2016). If cpm=TRUE, additional normalization of counts per million is applied.

 norm.fct <- norm_multi(dat.lis=fil.init, cpm=FALSE) # fct.
 norm.cpm <- norm_multi(dat.lis=fil.init, cpm=TRUE) # fct + cpm

Secondary filtering with higher strigency after normalization. Four sets of filtering parameter settings are created. In bulk data, genes with expression values over A across samples of over proportion p and with coefficinet of variance (CV) between cv1 and cv2 are retained. In cell data, genes with expression values over min.cnt of at least proportion p.in.gen are retained, and cells with with expression values over min.cnt of at least proportion p.in.cell are retained.

 df.par.fil <- data.frame(p=c(0.1, 0.2, 0.3, 0.4), A=rep(1, 4), cv1=c(0.1, 0.2, 0.3, 0.4), cv2=rep(100, 4), min.cnt=rep(1, 4), p.in.cell=c(0.1, 0.25, 0.3, 0.35), p.in.gen=c(0.01, 0.05, 0.1, 0.15))
 df.par.fil
##     p A cv1 cv2 min.cnt p.in.cell p.in.gen
## 1 0.1 1 0.1 100       1      0.10     0.01
## 2 0.2 1 0.2 100       1      0.25     0.05
## 3 0.3 1 0.3 100       1      0.30     0.10
## 4 0.4 1 0.4 100       1      0.35     0.15

Filter bulk and cell data using the four filtering settings. The results are automatically saved in the working directory wk.dir and are recognized in the downstream. Thus the working directory should be the same across the entire workflow.

  if (!dir.exists('opt_res')) dir.create('opt_res')
  fct.fil.all <- filter_iter(bulk=norm.fct$bulk, cell.lis=list(sc10=norm.fct$sc10, sc11=norm.fct$sc11), df.par.fil=df.par.fil, gen.rm='^ATCG|^ATCG', wk.dir='opt_res', norm.meth='fct')
  
  cpm.fil.all <- filter_iter(bulk=norm.cpm$bulk, cell.lis=list(sc10=norm.cpm$sc10, sc11=norm.cpm$sc11), df.par.fil=df.par.fil, gen.rm='^ATCG|^ATCG', wk.dir='opt_res', norm.meth='cpm')

To evaluate the downstream auto-matching performance, a ground-truth matching relationship is required in form of data.frame. The cell and trueBulk refer to bulk tissue identifiers in aSVG files, single cell identifiers and bulk tissue identifiers in the data for co-clustering, respectively. If a cell matches multiple bulk tissues, bulk identifiers are separated by comma, semicolon, or single space such as NONHAIR,LRC_NONHAIR. The SVGBulk is the bulk identifiers in aSVG files, which are recognized in co-visualization.

  match.pa <- system.file("extdata/cocluster/data", "match_arab_root_cocluster.txt", package="spatialHeatmap")
  df.match.arab <- read.table(match.pa, header=TRUE, row.names=1, sep='\t')
  df.match.arab[1:3, ]
##   SVGBulk           cell            trueBulk
## 1 NONHAIR        atricho NONHAIR,LRC_NONHAIR
## 2    COLU colu.dist.colu                COLU
## 3    COLU  colu.dist.lrc                COLU

In real application, the whole optimization takes a long time and requires a lot of computation power. For example, combined bulk and cell data with 6945 genes and 7747 samples requires about 20G memory for coclustering. To speed up computation, the optimization function coclus_opt provides two levels of parallel computing through BiocParallel (Morgan et al. 2021). The first one is BatchtoolsParam and relies on the slurm scheduler and the second one utilizes MulticoreParam.

Before optimzation, users could check the parallelization guide by setting parallel.info=TRUE, then it returns the max possible parallelizations for each level respectively.

  coclus_opt(wk.dir='opt_res', parallel.info=TRUE, dimred=c('PCA', 'UMAP'), graph.meth=c('knn', 'snn'), sim=seq(0.2, 0.4, by=0.1), sim.p=seq(0.2, 0.4, by=0.1), dim=seq(5, 7, by=1))

A slurm template is provided for the first level parallelization. Users are advised to make a new copy and set slurm parameters in the new copy.

  file.copy(system.file("extdata/cocluster", "slurm.tmpl", package="spatialHeatmap"), './slurm.tmpl')

If slurm is installed, users could take advantage of two levels of parallelizations. For instance, the first- and second-level parallelizations are set 3 and 2 cpu cores respectively. The wk.dir is the same in secondary filtering.

sim and sim.p are parameters in refining cell clusters (Figure 23.5). Specifically, in a cell cluster, cells having similarities over sim with other cells in the same cluster of at least proportion sim.p would remain. sim is Spearman’ or Pearson’s correlation coefficient. dim is the number of top dimensionalities (equivalent to genes) in co-clustering. Since the three parameters are related to each other, they are treated as a set spd.set.

  opt <- coclus_opt(wk.dir='opt_res', dimred=c('PCA', 'UMAP'), graph.meth=c('knn', 'snn'), sim=seq(0.2, 0.4, by=0.1), sim.p=seq(0.2, 0.4, by=0.1), dim=seq(5, 7, by=1), df.match=df.match.arab, batch.par=BatchtoolsParam(workers=3, cluster="slurm", template='slurm.tmpl', RNGseed=100), multi.core.par=MulticoreParam(workers=2), verbose=FALSE)

If slurm is not available, optimization can be parallelized only at the second-level by setting batch.par=NULL.

opt <- coclus_opt(wk.dir='opt_res', dimred=c('PCA', 'UMAP'), graph.meth=c('knn', 'snn'), sim=seq(0.2, 0.4, by=0.1), sim.p=seq(0.2, 0.4, by=0.1), dim=seq(5, 7, by=1), df.match=df.match.arab, batch.par=NULL, multi.core.par=MulticoreParam(workers=2))

The performace of each combination of parameter settings on each single cell data set is measured by an AUC value in ROC curve. These AUCs are filtered according to a cutoff (aucs over 0.5) and corresponding total bulk assignments (total.min) and total true assignments (true.min). The following demonstrates how to visualize the AUCs and select optimal parameter settings.

Extract AUCs for each filtering settings across aucs at 0.5, 0.6, 0.7, 0.8, 0.9 respectively.

df.lis.fil <- auc_stat(wk.dir='opt_res', tar.par='filter', total.min=500, true.min=300, aucs=round(seq(0.5, 0.9, 0.1), 1))

Mean AUCs by each filtering settings and AUC cutoff.

df.lis.fil$df.auc.mean[1:3, ]
mean_auc_bar(df.lis.fil[[1]], bar.width=0.07, title='Mean AUCs by filtering settings')
Mean AUCs of filtering settings. One bar refers to mean AUCs of a filtering settings at a certain AUC cutoff.

Figure 24: Mean AUCs of filtering settings
One bar refers to mean AUCs of a filtering settings at a certain AUC cutoff.

All AUCs by each filtering settings and AUC cutoff.

 auc_violin(df.lis=df.lis.fil, xlab='Filtering settings')
All AUCs of filtering settings. A violin plot refers to all AUCs of a filtering settings at a certain AUC cutoff.

Figure 25: All AUCs of filtering settings
A violin plot refers to all AUCs of a filtering settings at a certain AUC cutoff.

According to the mean AUCs, optimal filtering settings are fil1, fil2, fil3.

df.par.fil[c(1, 2, 3), ]
##     p A cv1 cv2 min.cnt p.in.cell p.in.gen
## 1 0.1 1 0.1 100       1      0.10     0.01
## 2 0.2 1 0.2 100       1      0.25     0.05
## 3 0.3 1 0.3 100       1      0.30     0.10

Extract AUCs for normalization methods across aucs at 0.5, 0.6, 0.7, 0.8, 0.9 respectively.

  df.lis.norm <- auc_stat(wk.dir='opt_res', tar.par='norm', total.min=500, true.min=300, aucs=round(seq(0.5, 0.9, 0.1), 1))

Mean AUCs by each normalization method and AUC cutoff.

df.lis.norm$df.auc.mean[1:3, ]
mean_auc_bar(df.lis.norm[[1]], bar.width=0.07, title='Mean AUCs by normalization methods')

All AUCs by each normalization method and AUC cutoff.

auc_violin(df.lis=df.lis.norm, xlab='Normalization methods')

Optimal normalization method: fct (computeSumFactors).

Extract AUCs for graph-building methods across aucs at 0.5, 0.6, 0.7, 0.8, 0.9 respectively.

df.lis.graph <- auc_stat(wk.dir='opt_res', tar.par='graph', total.min=500, true.min=300, aucs=round(seq(0.5, 0.9, 0.1), 1))

Mean AUCs by each graph-building method and AUC cutoff.

df.lis.graph$df.auc.mean[1:3, ]
mean_auc_bar(df.lis.graph[[1]], bar.width=0.07, title='Mean AUCs by graph-building methods')

All AUCs by each graph-building method and AUC cutoff.

auc_violin(df.lis=df.lis.graph, xlab='Graph-building methods')

Optimal graph-building methods: knn (buildKNNGraph).

Extract AUCs for dimensionality reduction methods across aucs at 0.5, 0.6, 0.7, 0.8, 0.9 respectively.

df.lis.dimred <- auc_stat(wk.dir='opt_res', tar.par='dimred', total.min=500, true.min=300, aucs=round(seq(0.5, 0.9, 0.1), 1))

Mean AUCs by each dimensionality reduction method and AUC cutoff.

df.lis.dimred$df.auc.mean[1:3, ]
# Mean AUCs by each dimensionality reduction method and AUC cutoff.
mean_auc_bar(df.lis.dimred[[1]], bar.width=0.07, title='Mean AUCs by dimensionality reduction methods')

All AUCs by each dimensionality reduction method and AUC cutoff.

auc_violin(df.lis=df.lis.dimred, xlab='Dimensionality reduction')

Optimal dimensionality reduction method: pca (denoisePCA).

Extract AUCs for spd.set across aucs at 0.5, 0.6, 0.7, 0.8, 0.9 respectively.

df.lis.spd <- auc_stat(wk.dir='opt_res', tar.par='spd.set', total.min=500, true.min=300, aucs=round(seq(0.5, 0.9, 0.1), 1))
df.lis.spd$auc0.5$df.frq[1:3, ]

All AUCs of top five spd.sets ranked by frequency across aucs at 0.5, 0.6, 0.7, 0.8, 0.9 respectively.

spd_auc_violin(df.lis=df.lis.spd, n=5, xlab='spd.sets', x.vjust=0.6)

Top five spd.sets across aucs at 0.5, 0.6, 0.7, 0.8, 0.9 respectively are taken as optimal spd.sets.

n <- 5; df.spd.opt <- NULL
for (i in df.lis.spd) {
  df.spd.opt <- rbind(df.spd.opt, i$df.frq[seq_len(n), c('sim', 'sim.p', 'dim')])
}
df.spd.opt$spd.set <- paste0('s', df.spd.opt$sim, 'p', df.spd.opt$sim.p, 'd', df.spd.opt$dim)
df.spd.opt <- subset(df.spd.opt, !duplicated(spd.set))
df.spd.opt[1:3, ]

In real application, the optimized settings need to be validated on data sets from other organs of different species, which is presented below.

10.2.3 Optimization in Real Case

Ideally, the co-clustering should be optimized on different organs from different organisms as many possible. The single cell data need to be generated on whole organs and each cell’s identity need to be labeled. Such data are less common and not easy to obtain in public databases, since most single cell RNA-seq (scRNA-seq) assays only focus on specific cell populations rather than whole organs, which are isolated by microdissection or fluorescent assisted cell sorting (FACS). As a result, the co-clustering optimization is performed only on five single cell data sets of Arabidopsis thaliana (Arabidopsis) root. The optimized parameter settings are validated on mouse brain and kidney.

The optimization in real case has high demand on computing power and takes a long time, so most of the following steps are not evaluated. The following steps are not explained in details since they are the same as last section.

The bulk (Li et al. 2016) and five single cell (Shahan et al. 2020) data sets of Arabidopsis root are accessed from the same studies as last section. Details about how to access and format them are described here. In the following, blk.arb.rt refers to bulk data and sc.arab.rt10, sc.arab.rt11, sc.arab.rt12, sc.arab.rt30, sc.arab.rt31 refers to the five single cell data sets respectively.

To obtain reproducible results, always start a new R session and set a fixed seed for Random Number Generator at the beginning, which is required only once in each R session.

set.seed(10)

Inital filtering with low strigency before normalization.

blk.arab.rt <- filter_data(data=blk.arab.rt, pOA=c(0.05, 5), CV=c(0.05, 100))
fil.init <- filter_cell(lis=list(sc10=sc.arab.rt10, sc11=sc.arab.rt11, sc12=sc.arab.rt12, sc30=sc.arab.rt30, sc31=sc.arab.rt31), bulk=blk, gen.rm='^ATCG|^ATCG', min.cnt=1, p.in.cell=0.01, p.in.gen=0.05); fil.init

Combine and normalize bulk and single cell data, then separate them.

norm.fct <- norm_multi(dat.lis=fil.init, cpm=FALSE) # fct.
norm.cpm <- norm_multi(dat.lis=fil.init, cpm=TRUE) # fct + cpm

Secondary filtering with higher strigency after normalization. Four sets of filtering parameter settings are created.

df.par.fil <- data.frame(p=c(0.1, 0.2, 0.3, 0.4), A=rep(1, 4), cv1=c(0.1, 0.2, 0.3, 0.4), cv2=rep(100, 4), min.cnt=rep(1, 4), p.in.cell=c(0.1, 0.25, 0.3, 0.35), p.in.gen=c(0.01, 0.05, 0.1, 0.15))
df.par.fil
##     p A cv1 cv2 min.cnt p.in.cell p.in.gen
## 1 0.1 1 0.1 100       1      0.10     0.01
## 2 0.2 1 0.2 100       1      0.25     0.05
## 3 0.3 1 0.3 100       1      0.30     0.10
## 4 0.4 1 0.4 100       1      0.35     0.15

Filter bulk and cell data using the four filtering settings. The results are automatically saved in the working directory wk.dir and are recognized in the downstream. Thus the working directory should be the same across the entire workflow.

if (!dir.exists('opt_real_res')) dir.create('opt_real_res')

fct.fil.all <- filter_iter(bulk=norm.fct$bulk, cell.lis=list(sc10=norm.fct$sc10, sc11=norm.fct$sc11, sc12=norm.fct$sc12, sc30=norm.fct$sc30, sc31=norm.fct$sc31), df.par.fil=df.par.fil, gen.rm='^ATCG|^ATCG', wk.dir='opt_real_res', norm.meth='fct')
  
cpm.fil.all <- filter_iter(bulk=norm.cpm$bulk, cell.lis=list(sc10=norm.cpm$sc10, sc11=norm.cpm$sc11, sc12=norm.cpm$sc12, sc30=norm.cpm$sc30, sc31=norm.cpm$sc31), df.par.fil=df.par.fil, gen.rm='^ATCG|^ATCG', wk.dir='opt_real_res', norm.meth='cpm')

Ground-truth matching relationship across cell, trueBulk, and SVGBulk.

match.pa <- system.file("extdata/cocluster/data", "match_arab_root_cocluster.txt", package="spatialHeatmap")
df.match.arab <- read.table(match.pa, header=TRUE, row.names=1, sep='\t')
df.match.arab[1:3, ]
##   SVGBulk           cell            trueBulk
## 1 NONHAIR        atricho NONHAIR,LRC_NONHAIR
## 2    COLU colu.dist.colu                COLU
## 3    COLU  colu.dist.lrc                COLU

The max possible parallelizations for each level respectively.

coclus_opt(wk.dir='opt_real_res', parallel.info=TRUE, dimred=c('PCA', 'UMAP'), graph.meth=c('knn', 'snn'), sim=seq(0.2, 0.8, by=0.1), sim.p=seq(0.2, 0.8, by=0.1), dim=seq(5, 40, by=1))

Make a new copy of the default slurm template and set parameters in the new copy.

file.copy(system.file("extdata/cocluster", "slurm.tmpl", package="spatialHeatmap"), './slurm.tmpl')

If slurm is installed, users could take advantage of two levels of parallelizations. For instance, the first- and second-level parallelizations are set 3 and 2 cpu cores respectively. The wk.dir is the same in secondary filtering. Note the settings of spd.set (sim/sim.p/dim) has wider ranges than in last section. The parallel computation is performed at High-Performance Computing Center (HPCC) at University of California, Riverside.

opt <- coclus_opt(wk.dir='opt_real_res', dimred=c('PCA', 'UMAP'), graph.meth=c('knn', 'snn'), sim=seq(0.2, 0.8, by=0.1), sim.p=seq(0.2, 0.8, by=0.1), dim=seq(5, 40, by=1), df.match=df.match.arab, batch.par=BatchtoolsParam(workers=3, cluster="slurm", template='slurm.tmpl', RNGseed=100), multi.core.par=MulticoreParam(workers=2), verbose=FALSE)

If slurm is not available, optimization can be parallelized only at the second-level by setting batch.par=NULL.

opt <- coclus_opt(wk.dir='opt_real_res', dimred=c('PCA', 'UMAP'), graph.meth=c('knn', 'snn'), sim=seq(0.2, 0.8, by=0.1), sim.p=seq(0.2, 0.8, by=0.1), dim=seq(5, 40, by=1), df.match=df.match.arab, batch.par=NULL, multi.core.par=MulticoreParam(workers=2))

The performace of each combination of parameter settings on each single cell data set is measured by an AUC value in ROC curve. These AUCs are filtered according to a cutoff (aucs over 0.5) and corresponding total bulk assignments (total.min) and total true assignments (true.min). The following demonstrates how to visualize the AUCs and select optimal parameter settings.

Extract AUCs for each filtering settings across aucs at 0.5, 0.6, 0.7, 0.8, 0.9 respectively.

df.lis.fil <- auc_stat(wk.dir='opt_real_res', tar.par='filter', total.min=500, true.min=300, aucs=round(seq(0.5, 0.9, 0.1), 1))

Mean AUCs by each filtering settings and AUC cutoff.

df.lis.fil$df.auc.mean[1:3, ]
mean_auc_bar(df.lis.fil[[1]], bar.width=0.07, title='Mean AUCs by filtering settings')
Mean AUCs of filtering settings in real optimization. One bar refers to mean AUCs of a filtering settings at a certain AUC cutoff.

Figure 26: Mean AUCs of filtering settings in real optimization
One bar refers to mean AUCs of a filtering settings at a certain AUC cutoff.

All AUCs by each filtering settings and AUC cutoff.

 auc_violin(df.lis=df.lis.fil, xlab='Filtering settings')
All AUCs of filtering settings in real optimization. A violin plot refers to all AUCs of a filtering settings at a certain AUC cutoff.

Figure 27: All AUCs of filtering settings in real optimization
A violin plot refers to all AUCs of a filtering settings at a certain AUC cutoff.

According to the mean AUCs, fil1 and fil2 are selected as optimal filtering settings.

df.par.fil[c(1, 2), ]
##     p A cv1 cv2 min.cnt p.in.cell p.in.gen
## 1 0.1 1 0.1 100       1      0.10     0.01
## 2 0.2 1 0.2 100       1      0.25     0.05

Extract AUCs for normalization methods across aucs at 0.5, 0.6, 0.7, 0.8, 0.9 respectively.

df.lis.norm <- auc_stat(wk.dir='opt_real_res', tar.par='norm', total.min=500, true.min=300, aucs=round(seq(0.5, 0.9, 0.1), 1))

Mean AUCs by each normalization method and AUC cutoff.

df.lis.norm$df.auc.mean[1:3, ]
mean_auc_bar(df.lis.norm[[1]], bar.width=0.07, title='Mean AUCs by normalization methods')

All AUCs by each normalization method and AUC cutoff.

auc_violin(df.lis=df.lis.norm, xlab='Normalization methods')

Optimal normalization method: fct (computeSumFactors).

Extract AUCs for graph-building methods across aucs at 0.5, 0.6, 0.7, 0.8, 0.9 respectively.

df.lis.graph <- auc_stat(wk.dir='opt_real_res', tar.par='graph', total.min=500, true.min=300, aucs=round(seq(0.5, 0.9, 0.1), 1))

Mean AUCs by each graph-building method and AUC cutoff.

df.lis.graph$df.auc.mean[1:3, ]
mean_auc_bar(df.lis.graph[[1]], bar.width=0.07, title='Mean AUCs by graph-building methods')

All AUCs by each graph-building method and AUC cutoff.

auc_violin(df.lis=df.lis.graph, xlab='Graph-building methods')

Since knn (buildKNNGraph) and snn (buildSNNGraph) have similar mean AUCs, both are selected as optimal graph-building methods.

Extract AUCs for dimensionality reduction methods across aucs at 0.5, 0.6, 0.7, 0.8, 0.9 respectively.

df.lis.dimred <- auc_stat(wk.dir='opt_real_res', tar.par='dimred', total.min=500, true.min=300, aucs=round(seq(0.5, 0.9, 0.1), 1))

Mean AUCs by each dimensionality reduction method and AUC cutoff.

df.lis.dimred$df.auc.mean[1:3, ]
# Mean AUCs by each dimensionality reduction method and AUC cutoff.
mean_auc_bar(df.lis.dimred[[1]], bar.width=0.07, title='Mean AUCs by dimensionality reduction methods')

All AUCs by each dimensionality reduction method and AUC cutoff.

auc_violin(df.lis=df.lis.dimred, xlab='Dimensionality reduction')

Optimal dimensionality reduction method: pca (denoisePCA).

Extract AUCs for spd.set across aucs at 0.5, 0.6, 0.7, 0.8, 0.9 respectively.

df.lis.spd <- auc_stat(wk.dir='opt_real_res', tar.par='spd.set', total.min=500, true.min=300, aucs=round(seq(0.5, 0.9, 0.1), 1))
df.lis.spd$auc0.5$df.frq[1:3, ]

All AUCs of top five spd.sets ranked by frequency across aucs at 0.5, 0.6, 0.7, 0.8, 0.9 respectively.

spd_auc_violin(df.lis=df.lis.spd, n=5, xlab='spd.sets', x.vjust=0.6)

Top five spd.sets across aucs at 0.5, 0.6, 0.7, 0.8, 0.9 respectively are taken as optimal spd.sets. s, p, d stands for sim, sim.p, dim respectively. E.g. s0.2p0.5d12 means sim = 0.2, sim.p = 0.5, dim = 12.

n <- 5; df.spd.opt <- NULL
for (i in df.lis.spd) {
  df.spd.opt <- rbind(df.spd.opt, i$df.frq[seq_len(n), c('sim', 'sim.p', 'dim')])
}

df.spd.opt$spd.set <- paste0('s', df.spd.opt$sim, 'p', df.spd.opt$sim.p, 'd', df.spd.opt$dim)

df.spd.opt <- subset(df.spd.opt, !duplicated(spd.set))
df.spd.opt[1:3, ]
##     sim sim.p dim    spd.set
## 74  0.2   0.4   6 s0.2p0.4d6
## 183 0.2   0.7   7 s0.2p0.7d7
## 182 0.2   0.7   6 s0.2p0.7d6

The optimal parameter settings at this stage are listed in the table below.


Table 3: Optimized parameter settings on Arabidopsis thaliana root data sets
normalization filtering.set dimensionality.reduction graph.building spd.set
fct fil1, fil2 pca knn, snn s0.2p0.4d6
fct fil1, fil2 pca knn, snn s0.2p0.7d7
fct fil1, fil2 pca knn, snn s0.2p0.7d6
fct fil1, fil2 pca knn, snn s0.2p0.4d7
fct fil1, fil2 pca knn, snn s0.2p0.6d6
fct fil1, fil2 pca knn, snn s0.3p0.4d7
fct fil1, fil2 pca knn, snn s0.2p0.2d7
fct fil1, fil2 pca knn, snn s0.2p0.2d12
fct fil1, fil2 pca knn, snn s0.3p0.5d12
fct fil1, fil2 pca knn, snn s0.3p0.5d11
fct fil1, fil2 pca knn, snn s0.2p0.8d12
fct fil1, fil2 pca knn, snn s0.2p0.5d12
fct fil1, fil2 pca knn, snn s0.3p0.5d13
fct fil1, fil2 pca knn, snn s0.3p0.2d13
fct fil1, fil2 pca knn, snn s0.4p0.2d13
fct fil1, fil2 pca knn, snn s0.3p0.3d13
fct fil1, fil2 pca knn, snn s0.4p0.2d12
fct fil1, fil2 pca knn, snn s0.2p0.2d21
fct fil1, fil2 pca knn, snn s0.3p0.6d36
fct fil1, fil2 pca knn, snn s0.3p0.4d12
fct fil1, fil2 pca knn, snn s0.4p0.3d14

Next, these optimal settings are validated on mouse brain, mouse kindney, and Arabidopsis root data sets. In mouse brain, the bulk RNA-seq data are generated in a research on the impact of placental endocrine on mouse cerebellar development (Vacher et al. 2021) and the scRNA-seq data are from a study of mouse brain molecular atlas (Ortiz et al. 2020). The bulk count data are produced using systemPipeR (2.1.12) (Backman and Girke 2016). Details about how to access and format bulk and single data are described here. In the following, blk.mus.brain and sc.mus.brain refers to bulk and single cell data respectively. The validation is performed by applying these optimal settings on the same coclustering workflow, so the following procedures are not detailed.

Initial filtering.

blk.mus.brain <- filter_data(data=blk.mus.brain, pOA=c(0.05, 5), CV=c(0.05, 100)) 
mus.brain.lis <- filter_cell(lis=list(sc.mus=sc.mus.brain), bulk=blk.mus.brain, gen.rm=NULL, min.cnt=1, p.in.cell=0.01, p.in.gen=0.05, verbose=FALSE) 

Bulk and single cell are combined and normalized, then separated.

mus.brain.lis.nor <- norm_multi(dat.lis=mus.brain.lis, cpm=FALSE)

Secondary filtering. Since fil1 and fil2 exhibit similar performaces, only fil1 is used.

blk.mus.brain.fil <- filter_data(data=mus.brain.lis.nor$bulk, pOA=c(0.1, 1), CV=c(0.1, 100), verbose=FALSE) 
mus.brain.lis.fil <- filter_cell(lis=list(sc.mus=mus.brain.lis.nor$sc.mus), bulk=blk.mus.brain.fil, gen.rm=NULL, min.cnt=1, p.in.cell=0.1, p.in.gen=0.01, verbose=FALSE)

Matching table indicating true bulk tissues of each cell type and corresponding SVG bulk (spatial feature).

match.mus.brain.pa <- system.file("extdata/shinyApp/example", "match_mouse_brain_cocluster.txt", package="spatialHeatmap")
df.match.mus.brain <- read.table(match.mus.brain.pa, header=TRUE, row.names=1, sep='\t')
df.match.mus.brain
##           SVGBulk      cell    trueBulk
## 1      cerebellum      cere        CERE
## 2     hippocampus      hipp        HIPP
## 3 cerebral.cortex   isocort CERE.CORTEX
## 4     hippocampus retrohipp        HIPP
## 5    hypothalamus   hypotha     HYPOTHA

Since knn and snn display similar performances, only knn is used. All optimal spd.set settings in Table 3 are tested, and results are shown in Figure 28a.

mus.brain.df.para <- cocluster(bulk=mus.brain.lis.fil$bulk, cell=mus.brain.lis.fil$sc.mus, df.match=df.match.mus.brain, df.para=df.spd.opt[, c('sim', 'sim.p', 'dim')], graph.meth='knn', dimred='PCA', return.all=FALSE, multi.core.par=MulticoreParam(workers=2))

In mouse kidney, four bulk tissues are selected: proximal straight tubule in cortical medullary rays (PTS2), cortical collecting duct (CCD), and cortical thick ascending limb of the loop of Henle (cTAL), glomerulus. PTS2 data are from a research on cell-type selective markers in mouse kidney (Clark et al. 2019), CCD and cTAL are from transcriptome analysis of major renal collecting duct cell types in mouse kidney (Chen et al. 2017), and glomerulus is from a transcriptome atlas study of mouse glomerulus (Karaiskos et al. 2018). The FASTQ files of the four tissues are downloaded from original studies and raw count data are generated with systemPipeR (2.1.12) (Backman and Girke 2016). The single cell data are accessed from an investigation in cellular targets of mouse kidney metabolic acidosis (Park et al. 2018). Details about how to access and format bulk and single data are described here.

The validating procedures on mouse kindey are same with mouse brain except that after initial filtering replicates in each bulk are reduced to 3 by using function reduce_rep due to two many replicates. The results are shown in Figure 28b.

In Arabidopsis root, the same bulk tissues (Li et al. 2016) and two additional single cell data sets (sc9, sc51, (Shahan et al. 2020)) from the same studies as real optimization are used. Details about how to access and format them are described here. The procedures of validating optimized settings are the same with mouse brain except that in normalization two single cell data sets are used instead of one. The results are shown in Figure 28c and d. 

As comparisons, random combinations of non-optimal settings are generated and tested. The graph-building methods have two settings knn and snn, and both are taken as optimal, thus they are all used for generating random combinations.

df.par.rdn <- random_para(fil.set=c('fil3', 'fil4'), norm='cpm', dimred='umap', graph.meth=c('knn', 'snn'), sim=round(seq(0.2, 0.8, by=0.1), 1), sim.p=round(seq(0.2, 0.8,by=0.1), 1), dim=seq(5, 40, by=1), df.spd.opt=df.spd.opt)
df.par.rnd[1:3,  ]

These random settings are tested on each of the four validating data sets, where other settings such as initial filtering are not changed. The results are shown in Figure 29c and d.

The AUCs of optimal and random settings are presented in Figure 28 and Figure 29 respectively. In both figures, if total bulk assignments < 500 or total true assignments < 300, AUCs are set 0. It is clear that the optimal settings exhibit better performance than random settings, so the optimization workflow is reliable at least to some extent. In Figure 28, asterisks indicate optimal settings have AUCs >= 0.5, total bulk assignments >= 500, total true assignments >= 300 across all four data sets. These settings are regarded as final optimal settings (Table 4).

Validating optimal settings. AUCs of optimal settings on each validating data sets. One bar refers to one AUC of one optimal settings. Asterisks indicate optimal settings have AUC >= 0.5, total bulk assignments >= 500, total true assignments >= 300 across all four data sets. If total bulk assignments < 500 or total true assignments < 300, AUCs are set 0. (a) Mouse brain. (b) Mouse kidndy. (c) Arabidopsis root of single cell 9. (d) Arabidopsis root of single cell 51.

Figure 28: Validating optimal settings
AUCs of optimal settings on each validating data sets. One bar refers to one AUC of one optimal settings. Asterisks indicate optimal settings have AUC >= 0.5, total bulk assignments >= 500, total true assignments >= 300 across all four data sets. If total bulk assignments < 500 or total true assignments < 300, AUCs are set 0. (a) Mouse brain. (b) Mouse kidndy. (c) Arabidopsis root of single cell 9. (d) Arabidopsis root of single cell 51.

Random settings. AUCs of random settings on each validating data sets. One bar refers to one AUC of one optimal settings. If total bulk assignments < 500 or total true assignments < 300, AUCs are set 0. (a) Mouse brain. (b) Mouse kidndy. (c) Arabidopsis root of single cell 9. (d) Arabidopsis root of single cell 51.

Figure 29: Random settings
AUCs of random settings on each validating data sets. One bar refers to one AUC of one optimal settings. If total bulk assignments < 500 or total true assignments < 300, AUCs are set 0. (a) Mouse brain. (b) Mouse kidndy. (c) Arabidopsis root of single cell 9. (d) Arabidopsis root of single cell 51.


Table 4: Final optimal parameter settings after validation.
normalization filtering.set dimensionality.reduction graph.building spd.set
fct fil1, fil2 pca knn, snn s0.2p0.2d12
fct fil1, fil2 pca knn, snn s0.3p0.5d11
fct fil1, fil2 pca knn, snn s0.2p0.8d12
fct fil1, fil2 pca knn, snn s0.2p0.5d12
fct fil1, fil2 pca knn, snn s0.3p0.5d13

10.2.4 Application on Mouse Brain

This section demonstrates co-visulization of bulk and single cells using the auto-matching functionality on mouse brain data. The bulk (Vacher et al. 2021) and single cell (Ortiz et al. 2020) RNA-seq data are from the same studies as in validating optimal settings. Both data sets are reduced for demonstration purpose. The optimal settings of fct, fil1, pca, knn, s0.2p0.8d12 in Table 4 are used.

To obtain reproducible results, always start a new R session and set a fixed seed for Random Number Generator at the beginning, which is required only once in each R session.

set.seed(10)

Read bulk and single cell data.

# Example bulk data.
blk.mus.pa <- system.file("extdata/shinyApp/example", "bulk_mouse_cocluster.txt", package="spatialHeatmap") 
blk.mus <- as.matrix(read.table(blk.mus.pa, header=TRUE, row.names=1, sep='\t', check.names=FALSE))
blk.mus[1:3, 1:5]
##          CERE.CORTEX HIPP HYPOTHA CERE CERE.CORTEX
## AI593442         177  256      50   24         285
## Actr3b           513 1465     228  244         666
## Adcy1            701 1243      57 1910         836
# Example single cell data.
sc.mus.pa <- system.file("extdata/shinyApp/example", "cell_mouse_cocluster.txt", package="spatialHeatmap") 
sc.mus <- as.matrix(read.table(sc.mus.pa, header=TRUE, row.names=1, sep='\t', check.names=FALSE))
sc.mus[1:3, 1:5]
##          isocort isocort isocort isocort olfa
## AI593442       2       2       2       5    0
## Actr3b         3       5       4       4    1
## Adcy1          3       6       6       6    0

Initial filtering.

blk.mus <- filter_data(data=blk.mus, sam.factor=NULL, con.factor=NULL, pOA=c(0.1, 5), CV=c(0.2, 100), verbose=FALSE) 
mus.lis <- filter_cell(lis=list(sc.mus=sc.mus), bulk=blk.mus, gen.rm=NULL, min.cnt=1, p.in.cell=0.5, p.in.gen=0.1, verbose=FALSE) 

Normalization. Bulk and single cell are combined and normalized, then separated.

mus.lis.nor <- read_cache(cache.pa, 'mus.lis.nor') # Retrieve data from cache.
if (is.null(mus.lis.nor)) { # Save normalized data to cache if not cached.
  mus.lis.nor <- norm_multi(dat.lis=mus.lis, cpm=FALSE)
  save_cache(dir=cache.pa, overwrite=TRUE, mus.lis.nor)
}

Secondary filtering.

blk.mus.fil <- filter_data(data=logcounts(mus.lis.nor$bulk), sam.factor=NULL, con.factor=NULL, pOA=c(0.1, 0.5), CV=c(0.2, 100), verbose=FALSE) 
mus.lis.fil <- filter_cell(lis=list(sc.mus=logcounts(mus.lis.nor$sc.mus)), bulk=blk.mus.fil, gen.rm=NULL, min.cnt=1, p.in.cell=0.05, p.in.gen=0.02, verbose=FALSE)

The aSVG file of mouse brain.

svg.mus <- system.file("extdata/shinyApp/example", "mus_musculus.brain.svg", package="spatialHeatmap")
# Spatial features.  
feature.df <- return_feature(svg.path=svg.mus)
## Accessing features... 
## Element "a" is removed: a4174 !
## 
## Recommendation: please remove the 'transform' attribute with a 'matrix' value in the following groups by ungrouping and regrouping the respective groups in Inkscape. Otherwise, colors in spatial heatmap might be shifted! 
## EFO_0000530 
## 
## mus_musculus.brain.svg,

Matching table indicating true bulk tissues of each cell type and corresponding SVG bulk (spatial feature).

match.mus.brain.pa <- system.file("extdata/shinyApp/example", "match_mouse_brain_cocluster.txt", package="spatialHeatmap")
df.match.mus.brain <- read.table(match.mus.brain.pa, header=TRUE, row.names=1, sep='\t')
df.match.mus.brain
##           SVGBulk      cell    trueBulk
## 1      cerebellum      cere        CERE
## 2     hippocampus      hipp        HIPP
## 3 cerebral.cortex   isocort CERE.CORTEX
## 4     hippocampus retrohipp        HIPP
## 5    hypothalamus   hypotha     HYPOTHA

The SVG bulk tissues are in the aSVG file.

df.match.mus.brain$SVGBulk %in% feature.df$feature
## [1] TRUE TRUE TRUE TRUE TRUE

Cluster single cells. Cluster labels are stored in label column in colData.

clus.sc <- read_cache(cache.pa, 'clus.sc') # Retrieve data from cache.
if (is.null(clus.sc)) {
  clus.sc <- cluster_cell(data=mus.lis.fil$sc.mus, min.dim=10, max.dim=50, graph.meth='knn', dimred='PCA')
  save_cache(dir=cache.pa, overwrite=TRUE, clus.sc)
}
colData(clus.sc)[1:3, ]
## DataFrame with 3 rows and 2 columns
##               label        cell
##         <character> <character>
## isocort          17     isocort
## isocort          12     isocort
## isocort          12     isocort

Refine cell clusters.

cell.refined <- refine_cluster(clus.sc, sim=0.2, sim.p=0.8, sim.meth='spearman', verbose=FALSE)

Include matching information in colData.

cell.refined <- true_bulk(cell.refined, df.match.mus.brain)
colData(cell.refined)[1:3, ]
## DataFrame with 3 rows and 5 columns
##                 label        cell index.all    trueBulk         SVGBulk
##           <character> <character> <integer> <character>     <character>
## isocort             1     isocort       354 CERE.CORTEX cerebral.cortex
## corti.sub           1   corti.sub      1491        none            none
## hipp                1        hipp      2372        HIPP     hippocampus

Cocluster bulk and single cells.

roc.lis <- read_cache(cache.pa, 'roc.lis') # Retrieve data from cache.
if (is.null(roc.lis)) {
  roc.lis <- coclus_roc(bulk=mus.lis.fil$bulk, cell.refined=cell.refined, df.match=df.match.mus.brain, min.dim=12, graph.meth='knn', dimred='PCA') 
  save_cache(dir=cache.pa, overwrite=TRUE, roc.lis)
}

The auto-matching results are listed in roc.lis$df.roc. predictor is the similarity (Spearman’s or Pearson’s correlation coefficient) between bulk and cells within a co-cluster, which is used to assign bulk tissues to cells (Figure 23.8). response indicates whether the bulk assignment is correct according to the matching table. index is the cell index in the SingleCellExperiment after cell clusters are refined.

roc.lis$df.roc[1:3, ]
##       assignedBulk    cell response predictor index trueBulk      SVGBulk
## CERE          CERE hypotha    FALSE 0.6433566  3176  HYPOTHA hypothalamus
## CERE1         CERE hypotha    FALSE 0.4685315  3179  HYPOTHA hypothalamus
## CERE2         CERE hypotha    FALSE 0.4685315  3205  HYPOTHA hypothalamus
table(roc.lis$df.roc$response)
## 
## FALSE  TRUE 
##    23   551

ROC curve is created according to roc.lis$df.roc and the AUC value indicates the auto-matching performance.

plot(roc.lis$roc.obj, print.auc=TRUE)
ROC curve of auto-matching on mouse brain data. The AUC value is a measure of accuracy on bulk assignments.

Figure 30: ROC curve of auto-matching on mouse brain data
The AUC value is a measure of accuracy on bulk assignments.

Incorporate cell.refined in roc.lis for downstream co-visualization.

res.lis <- c(list(cell.refined=cell.refined), roc.lis)

The processes of clustering single cells, refining cell clusters, and coclustering bulk and single cells can be performed in a single function call. Setting return.all=TRUE returns a list of refined cell clusters, ROC object, and a data.frame of auto-matching results. If return.all=FALSE, a data.frame of parameter settings and resulting AUC is returned.

res.lis <- cocluster(bulk=mus.lis.fil$bulk, cell=mus.lis.fil$sc.mus, df.match=df.match.mus.brain, df.para=NULL, sim=0.2, sim.p=0.8, dim=12, graph.meth='knn', dimred='PCA', sim.meth='spearman', return.all=TRUE)
res.lis <- res.lis[[1]]

The function cocluster accepts multiple combinations of parameter settings provided in a data.frame, and coclustering on these combinations can be performed in parallel on multiple cpu cores through multi.core.par.

Multiple combinations of parameter settings. If some parameters are not specified in this table such as graph.meth, their default settings will be used.

df.par <- data.frame(sim=c(0.2, 0.3), sim.p=c(0.8, 0.7), dim=c(12, 13))
df.par
##   sim sim.p dim
## 1 0.2   0.8  12
## 2 0.3   0.7  13

The coclustering is run on 2 cpu cores (workers=2).

res.multi <- cocluster(bulk=mus.lis.fil$bulk, cell=mus.lis.fil$sc.mus, df.match=df.match.mus.brain, df.para=df.par, sc.dim.min=10, max.dim=50, sim=0.2, sim.p=0.8, dim=12, graph.meth='knn', dimred='PCA', sim.meth='spearman', return.all=TRUE, multi.core.par=MulticoreParam(workers=2))

The auto-matching results through coclustering can be tailored through “Lasso Select” on the convenience Shiny app (desired_bulk_shiny) or manually defining desired bulk. If the former, save cell.refined in an .rds file by saveRDS(cell.refined, file='cell.refined.rds') and upload cell.refined.rds to the Shiny app.

Example of desired bulk downloaded from the convenience Shiny app.

desired.blk.pa <- system.file("extdata/shinyApp/example", "selected_cells_with_desired_bulk.txt", package="spatialHeatmap")
df.desired.bulk <- read.table(desired.blk.pa, header=TRUE, row.names=1, sep='\t')
df.desired.bulk[1:3, ]
##          x        y key desiredSVGBulk dimred
## 1 4.389422 6.917510  62     cerebellum    PCA
## 2 4.586877 8.077207  75     cerebellum    PCA
## 3 6.366188 7.695782  76     cerebellum    PCA

Before manually defining desired bulk, check cells in the embedding plot.

plot_dim(res.lis$cell.refined, dim='PCA', color.by='cell', x.break=seq(-10, 10, 2), y.break=seq(-10, 10, 2))
PCA embedding plot of mouse brain single cell data. Single cell data after cluster refining are plotted.

Figure 31: PCA embedding plot of mouse brain single cell data
Single cell data after cluster refining are plotted.

Manually define desired bulk for certain cells by x-y coordinates ranges in the embedding plot. The dimred reveals where the coordinates come from and are required.

df.desired.bulk <- data.frame(x.min=c(2, 6), x.max=c(4, 10), y.min=c(6, 8), y.max=c(8, 10), desiredSVGBulk=c('cerebral.cortex', 'cerebral.cortex'), dimred='PCA')
df.desired.bulk
##   x.min x.max y.min y.max  desiredSVGBulk dimred
## 1     2     4     6     8 cerebral.cortex    PCA
## 2     6    10     8    10 cerebral.cortex    PCA

Extract cells with bulk assignments. If df.desired.bulk is provided a value, the corrresponding assignments are incorporated in res.lis$df.roc, and response and predictor is set TRUE and 1 respectively. thr is a cutoff for the predictor in res.lis$df.roc, so thr=0 denotes predictor is not filtered. true.only=TRUE indicates only true assignments are extracted.

sce.lis <- sub_asg(res.lis=res.lis, thr=0, df.desired.bulk=df.desired.bulk, df.match=df.match.mus.brain, true.only=TRUE)
## No cells selected for row: 2!

Aggregate extracted cells by SVGBulk. The aggregated cells are equivalent to bulk tissues (spatial features) in the aSVG. The aggregated abundance profiles are used to color matching bulk tissues in the aSVG image.

sce.aggr <- aggr_rep(data=sce.lis$cell.sub, assay.na='logcounts', sam.factor='SVGBulk', con.factor=NULL, aggr='mean')

Co-visualize bulk and single cells with aggregated abundance profiles of gene Adcy1. tar.bulk refers to the target bulk in aSVG and all corresponding cells are highlighted. Cells with true assignments of tar.bulk are colored according to the color key, while other corresponding cells with false or without assignments are colored black. All other cells not corresponding to tar.bulk are in gray.

shm.lis1 <- spatial_hm(svg.path=svg.mus, data=sce.aggr, ID=c('Adcy1'), legend.nrow=4, sce.dimred=sce.lis$cell.refined, dimred='PCA', assay.na='logcounts', tar.bulk=c('hippocampus'), profile=TRUE, dim.lgd.text.size=10, dim.lgd.nrow=1, bar.width=0.1)
## Coordinates: mus_musculus.brain.svg ... 
## CPU cores: 1 
## Element "a" is removed: a4174 !
## 
## Recommendation: please remove the 'transform' attribute with a 'matrix' value in the following groups by ungrouping and regrouping the respective groups in Inkscape. Otherwise, colors in spatial heatmap might be shifted! 
## EFO_0000530 
## 
## ggplots/grobs: mus_musculus.brain.svg ... 
## ggplot: Adcy1, con  
## Legend plot ... 
## CPU cores: 1 
## Converting "ggplot" to "grob" ... 
## Adcy1_con_1  
## Converting "ggplot" to "grob" ... 
##   
## The reduced dimensionality in "data" is used: PCA .
## Converting "ggplot" to "grob" ... 
## dim_Adcy1_con_1
Co-visualizing bulk and single cells of mouse brain with abundance profiles. The aggregated expression profiles of gene `Adcy1` are plotted.

Figure 32: Co-visualizing bulk and single cells of mouse brain with abundance profiles
The aggregated expression profiles of gene Adcy1 are plotted.

Co-visualize bulk and single cells without abundance profiles.

shm.lis2 <- spatial_hm(svg.path=svg.mus, data=sce.aggr, ID=c('Adcy1'), legend.nrow=4, sce.dimred=sce.lis$cell.refined, dimred='PCA', tar.bulk=c('hippocampus'), profile=FALSE, dim.lgd.text.size=10, dim.lgd.nrow=1)
## Coordinates: mus_musculus.brain.svg ... 
## CPU cores: 1 
## Element "a" is removed: a4174 !
## 
## Recommendation: please remove the 'transform' attribute with a 'matrix' value in the following groups by ungrouping and regrouping the respective groups in Inkscape. Otherwise, colors in spatial heatmap might be shifted! 
## EFO_0000530 
## 
## ggplots/grobs: mus_musculus.brain.svg ... 
## ggplot: Adcy1, con  
## Legend plot ... 
## CPU cores: 1 
## Converting "ggplot" to "grob" ... 
## Adcy1_con_1  
## Converting "ggplot" to "grob" ... 
##   
## The reduced dimensionality in "data" is used: PCA .
## Converting "ggplot" to "grob" ... 
## dim_mus_musculus.brain.svg
Co-visualizing bulk and single cells of mouse brain without abundance profiles. The matching between cells and SVG bulk is denoted by color in-between.

Figure 33: Co-visualizing bulk and single cells of mouse brain without abundance profiles
The matching between cells and SVG bulk is denoted by color in-between.

10.2.5 Using Shiny App

The auto-matching utility is included in the Shiny app. To use it, the bulk, single cell data, and matching table need to be stored in a SingleCellExperiment object. Bulk and single cell raw count data are combined and stored in the assay slot and are labeled by bulk and cell respectively by the column bulkCell in colData slot. The matching table is stored in the metadata list with the name df.match. The example below illustrates these rules.

sce.auto <- readRDS(system.file("extdata/shinyApp/example", 'sce_auto_bulk_cell_mouse_brain.rds', package="spatialHeatmap"))
colData(sce.auto)
## DataFrame with 4466 rows and 1 column
##                bulkCell
##             <character>
## CERE.CORTEX        bulk
## HIPP               bulk
## HYPOTHA            bulk
## CERE               bulk
## CERE.CORTEX        bulk
## ...                 ...
## retrohipp          cell
## retrohipp          cell
## retrohipp          cell
## retrohipp          cell
## retrohipp          cell
metadata(sce.auto)$df.match
##           SVGBulk      cell    trueBulk
## 1      cerebellum      cere        CERE
## 2     hippocampus      hipp        HIPP
## 3 cerebral.cortex   isocort CERE.CORTEX
## 4     hippocampus retrohipp        HIPP
## 5    hypothalamus   hypotha     HYPOTHA

11 Shiny App

In additon to running spatialHeatmap from R, the package includes a Shiny App that provides access to the same functionalities from an intuitive-to-use web browser interface. Apart from being very user-friendly, this App conveniently organizes the results of the entire visualization workflow in a single browser window with options to adjust the parameters of the individual components interactively. For instance, genes can be selected and replotted in the SHM simply by clicking the corresponding rows in the expression table included in the same window. This representation is very efficient in guiding the interpretation of the results in a visual and user-friendly manner. For testing purposes, the spatialHeatmap Shiny App also includes ready-to-use sample expression data and aSVG images along with embedded user instructions.

11.1 Local System

The Shiny App of spatialHeatmap can be launched from an R session with the following function call.

shiny_shm()

The dashboard panels of the Shiny App are organized as follows:

  1. Landing Page: gallery of pre-upload examples, menu for uploading custom files, selecting default files, or downloading example files.
  2. Spatial Heatmap: spatially colored images based on aSVG images selected and/or uploaded by user, matrix heatmap, network graph, data table (replicates aggregated).
  3. Spatial Enrichment: data table (with replicates), menu for enrichment, enrichment results.
  4. Parameter Control Menu: postioned on top of each panel.

A screenshot is shown below depicting an SHM plot generated with the Shiny App of spatialHeatmap (Figure 34).

Screenshot of spatialHeatmap's Shiny App.

Figure 34: Screenshot of spatialHeatmap’s Shiny App

After launching, the Shiny App displays by default one of the included data sets.

The assay data and aSVG images are uploaded to the Shiny App as tabular files (e.g. in CSV or TSV format) and SVG files, respectively. To also allow users to upload gene expression data stored in SummarizedExperiment objects, one can export it from R to a tabular file with the filter_data function, where the user specifies the directory path under the dir argument. This will create in the target directory a tablular file named customData.txt in TSV format. The column names in this file preserve the experimental design information from the colData slot by concatenating the corresponding sample and condition information separated by double underscores. An example of this format is shown in Table 1.

se.fil.arab <- filter_data(data=se.aggr.sh, ann="Target.Description", sam.factor='sample', con.factor='condition', pOA=c(0.03, 6), CV=c(0.30, 100), dir='./')

To interactively access gene-, transcript- or protein-level annotations in the plots and tables of the Shiny App, such as viewing functional descriptions by moving the cursor over network nodes, the corresponding annotation column needs to be present in the rowData slot and its column name assigned to the metadata argument. In the exported tabular file, the extra annotation column is appended to the expression matrix.

Alternatively, once can export the three slots (assay, colData, rowData) of SummarizedExperiment in three tabular files and upload them separately.

11.2 Server Deployment

As most Shiny Apps, spatialHeatmap can be deployed as a centralized web service. A major advantage of a web server deployment is that the functionalities can be accessed remotely by anyone on the internet without the need to use R on the user system. For deployment one can use custom web servers or cloud services, such as AWS, GCP or shinysapps.io. An example web instance for testing spatialHeatmap online is available here.

11.3 Custom Shiny App

The spatialHeatmap package also allows users to create customized Shiny App instances using the custom_shiny function. This function provides options to include custom example data and aSVGs, and define default values within most visualization panels (e.g. color schemes, image dimensions). For details users want to consult the help file of the custom_shiny function. To maximize flexibility, the default parameters are stored in a yaml file on the Shiny App. This makes it easy to refine and optimize default parameters simply by changing this yaml file.

11.4 Database Backend

To maintain scalability, the customized Shiny app is designed to work with HDF5-based database (Fischer, Smith, and Pau 2020; Pagès 2020), which enables users to incorporate data and aSVGs in a batch. The database is constructed with the function write_hdf5. Basically, the accepted data formats are data.frame or SummarizedExperiment. Each data set is saved in an independent HDF5 database. Meanwhile, a pairing table describing matchings between data and aSVGs is required. All individual databases and the pairing table are then compressed in the file “data_shm.tar”. Accordingly, all aSVG files should be compressed in another tar file such as “aSVGs.tar”. Finally, these two tar files are included in the Shiny app by feeding their paths in a list to custom_shiny.

Except for user-provided data, the database is able to store data sets downloaded from GEO and Expression Atlas. The detailed examples of the database utility are documented in the help file of write_hdf5.

12 Supplementary Section

12.1 Numeric Data

The numceric data used to color the features in aSVG images can be provided as three different object types including vector, data.frame and SummerizedExperiment. When working with complex omics-based assay data then the latter provides the most flexibility, and thus should be the preferred container class for managing numeric data in spatialHeatmap. Both data.frame and SummarizedExperiment can hold data from many measured items, such as many genes or proteins. In contrast to this, the vector class is only suitable for data from single items. Due to its simplicity this less complex container is often useful for testing or when dealing with simple data sets.

In data assayed only at spatial dimension, there are two factors samples and conditions, while data assayed at spatial and temporal dimension contains an additional factor time points or development stages. The spatialHeatmap is able to work with both data types. In this section, the application of SHMs on spatial data is explained first in terms of the three object types, which is more popular. Later, the spatiotemporal usage of SHMs is presented in a specific subsection.

12.1.1 Object Types

This subsection refers to data assayed only at spatial dimension, where two factors samples and conditions are involved.

12.1.1.1 vector

When using numeric vectors as input to spatial_hm, then their name slot needs to be populated with strings matching the feature names in the corresponding aSVG. To also specify conditions, their labels need to be appended to the feature names with double underscores as separator, i.e. ’feature__condition’.

The following example replots the toy example for two spatial features (‘occipital lobe’ and ‘parietal lobe’) and two conditions (‘1’ and ‘2’).

vec <- sample(x=1:100, size=5) # Random numeric values
names(vec) <- c('occipital lobe__condition1', 'occipital lobe__condition2', 'parietal lobe__condition1', 'parietal lobe__condition2', 'notMapped') # Assign unique names to random values
vec
## occipital lobe__condition1 occipital lobe__condition2 
##                          9                         74 
##  parietal lobe__condition1  parietal lobe__condition2 
##                         76                         55 
##                  notMapped 
##                         72

With this configuration the resulting plot contains two spatial heatmap plots for the human brain, one for ‘condition 1’ and another one for ‘contition 2’. To keep the build time and storage size of this package to a minimum, the spatial_hm function call in the code block below is not evaluated. Thus, the corresponding SHM is not shown in this vignette.

spatial_hm(svg.path=svg.hum, data=vec, ID='toy', ncol=1, legend.r=1.2, sub.title.size=14, ft.trans='g4320')

12.1.1.2 data.frame

Compared to the above vector input, data.frames are structured here like row-wise appended vectors, where the name slot information in the vectors is stored in the column names. Each row also contains a name that corresponds to the corresponding item name, such as a gene ID. The naming of spatial features and conditions in the column names follows the same conventions as the naming used for the name slots in the above vector example.

The following illustrates this with an example where a numeric data.frame with random numbers is generated containing 20 rows and 5 columns. To avoid name clashes, the values in the rows and columns should be unique.

df.test <- data.frame(matrix(sample(x=1:1000, size=100), nrow=20)) # Create numeric data.frame
colnames(df.test) <- names(vec) # Assign column names
rownames(df.test) <- paste0('gene', 1:20) # Assign row names
df.test[1:3, ]
##       occipital lobe__condition1 occipital lobe__condition2
## gene1                        438                        338
## gene2                        423                        797
## gene3                        511                        285
##       parietal lobe__condition1 parietal lobe__condition2 notMapped
## gene1                       694                       857       906
## gene2                       955                       209       972
## gene3                       754                        48       435

With the resulting data.frame one can generate the same SHM as in the previous example, that used a named vector as input to the spatial_hm function. Additionally, one can now select each row by its name (here gene ID) under the ID argument.

spatial_hm(svg.path=svg.hum, data=df.test, ID=c('gene1'), ncol=1, legend.r=1.2, sub.title.size=14)

Additional information can be appended to the data.frame column-wise, such as annotation data including gene descriptions. This information can then be displayed interactively in the network plots of the Shiny App by placing the curser over network nodes.

df.test$ann <- paste0('ann', 1:20)
df.test[1:3, ]
##       occipital lobe__condition1 occipital lobe__condition2
## gene1                        438                        338
## gene2                        423                        797
## gene3                        511                        285
##       parietal lobe__condition1 parietal lobe__condition2 notMapped  ann
## gene1                       694                       857       906 ann1
## gene2                       955                       209       972 ann2
## gene3                       754                        48       435 ann3

12.1.1.3 SummarizedExperiment

The SummarizedExperiment class is a much more extensible and flexible container for providing metadata for both rows and columns of numeric data stored in tabular format.

To import experimental design information from tabular files, users can provide a target file that will be stored in the colData slot of the SummarizedExperiment (SE, Morgan et al. (2018)) object. In other words, the target file provides the metadata for the columns of the numeric assay data. Usually, the target file contains at least two columns: one for the features/samples and one for the conditions. Replicates are indicated by identical entries in these columns. The actual numeric matrix representing the assay data is stored in the assay slot, where the rows correspond to items, such as gene IDs. Optionally, additional annotation information for the rows (e.g. gene descriptions) can be stored in the rowData slot.

For constructing a valid SummarizedExperiment object, that can be used by the spatial_hm function, the target file should meet the following requirements.

  1. It should be imported with read.table or read.delim into a data.frame or the data.frame can be constructed in R on the fly (as shown below).

  2. It should contain at least two columns. One column represents the features/samples and the other one the conditions. The rows in the target file correspond to the columns of the numeric data stored in the assay slot. If the condition column is empty, then the same condition is assumed under the corresponding features/samples entry.

  3. The feature/sample names must have matching entries in corresponding aSVG to be considered in the final SHM. Note, the double underscore is a special string that is reserved for specific purposes in spatialHeatmap, and thus should be avoided for naming feature/samples and conditions.

The following example illustrates the design of a valid SummarizedExperiment object for generating SHMs. In this example, the ‘occipital lobe’ tissue has 2 conditions and each condition has 2 replicates. Thus, there are 4 assays for occipital lobe, and the same design applies to the parietal lobe tissue.

sample <- c(rep('occipital lobe', 4), rep('parietal lobe', 4))
condition <- rep(c('condition1', 'condition1', 'condition2', 'condition2'), 2)
target.test <- data.frame(sample=sample, condition=condition, row.names=paste0('assay', 1:8))
target.test
##                sample  condition
## assay1 occipital lobe condition1
## assay2 occipital lobe condition1
## assay3 occipital lobe condition2
## assay4 occipital lobe condition2
## assay5  parietal lobe condition1
## assay6  parietal lobe condition1
## assay7  parietal lobe condition2
## assay8  parietal lobe condition2

The assay slot is populated with a 8 x 20 data.frame containing random numbers. Each column corresponds to an assay in the target file (here imported into colData), while each row corresponds to a gene.

df.se <- data.frame(matrix(sample(x=1:1000, size=160), nrow=20))
rownames(df.se) <- paste0('gene', 1:20)
colnames(df.se) <- row.names(target.test)
df.se[1:3, ]
##       assay1 assay2 assay3 assay4 assay5 assay6 assay7 assay8
## gene1    334    227    571    562    412    454      8    703
## gene2    860    508    302    685    284    645    294    134
## gene3    930     27     51    151    561    873    220    679

Next, the final SummarizedExperiment object is constructed by providing the numeric and target data under the assays and colData arguments, respectively.

se <- SummarizedExperiment(assays=df.se, colData=target.test)
se
## class: SummarizedExperiment 
## dim: 20 8 
## metadata(0):
## assays(1): ''
## rownames(20): gene1 gene2 ... gene19 gene20
## rowData names(0):
## colnames(8): assay1 assay2 ... assay7 assay8
## colData names(2): sample condition

If needed row-wise annotation information (e.g. for genes) can be included in the SummarizedExperiment object as well. This can be done during the construction of the initial object, or as below by injecting the information into an existing SummarizedExperiment object.

rowData(se) <- df.test['ann']

In this simple example, possible normalization and filtering steps are skipped. Yet, the aggregation of replicates is performed as shown below.

se.aggr <- aggr_rep(data=se, sam.factor='sample', con.factor='condition', aggr='mean')
## Syntactically valid column names are made!
assay(se.aggr)[1:3, ]
##       occipital.lobe__condition1 occipital.lobe__condition2
## gene1                      280.5                      566.5
## gene2                      684.0                      493.5
## gene3                      478.5                      101.0
##       parietal.lobe__condition1 parietal.lobe__condition2
## gene1                     433.0                     355.5
## gene2                     464.5                     214.0
## gene3                     717.0                     449.5

With the fully configured SummarizedExperiment object, a similar SHM is plotted as in the previous examples.

spatial_hm(svg.path=svg.hum, data=se.aggr, ID=c('gene1'), ncol=1, legend.r=1.2, sub.title.size=14, ft.trans=c('g4320'))

12.1.2 Spatiotemporal Data

The above explanations on the three object types are applicable to data at a single spatial dimension directly. If the data is measured at spatial and temporal dimension, there are three factors: samples, conditions, and time points such as development stages. The three object types are still applicable, but the formatting convention is slightly different.

Specifically, there are three options to format the spatiotemporal data: 1) Combine samples and conditions. In vector names and data.frame column names, the composite sample-condition factor and time factor should be concatenated by double underscore, while in SummarizedExperiment the former and latter should be regarded as the “sample” and “condition” columns respectively; 2) Combine samples and times. In vector names and data.frame column names, the composite sample-time factor and condition factor should be concatenated by double underscore (see here), while in SummarizedExperiment the former and latter should be regarded as the “sample” and “condition” columns respectively; or 3) Combine samples, conditions, and times. The composite sample-time-condition factor will be the full names in vector and full column names in data.frame without the double underscore (see here), while in SummarizedExperiment they will be the “sample” column and the “condition” column will be ignored (see here).

Which option to choose depends on the specific data and aSVGs, and the chosen option is expected to achieve optimal visualization. Regardless of the options, the pivotal requirement that target spatial features in aSVG must have matching counterparts in data should always be fulfilled (see here).

12.2 aSVG File

12.2.1 aSVG repository

A public aSVG repository, that can be used by spatialHeatmap directly, is the one maintained by the EBI Gene Expression Group. It contains annatomical aSVG images from different species. The same aSVG images are also used by the web service of the Expression Atlas project. In addition, the spatialHeatmap has its own repository called spatialHeatmap aSVG Repository, that stores custom aSVG files developed for this project (e.g. Figure 8).

If users cannot find a suitable aSVG image in these two repositories, we have developed a step-by-step SVG tutorial for creating custom aSVG images. For example, the BAR eFP browser at University of Toronto contains many anatomical images. These images can be used as templates in the SVG tutorial to construct custom aSVGs.

Moreover, in the future we will add more aSVGs to our repository, and users are welcome to deposit their own aSVGs to this repository to share them with the community.

12.2.2 Update aSVG features

To create and edit existing feature identifiers in aSVGs, the update_feature function can be used. The demonstration below, creates an empty folder tmp.dir1 and copies into it the homo_sapiens.brain.svg image provided by the spatialHeatmap package. Subsequently, selected feature annotations are updated in this file.

tmp.dir1 <- paste0(normalizePath(tempdir(check=TRUE), winslash="/", mustWork=FALSE), '/shm1')
if (!dir.exists(tmp.dir1)) dir.create(tmp.dir1)
svg.hum <- system.file("extdata/shinyApp/example", 'homo_sapiens.brain.svg', package="spatialHeatmap") 
file.copy(from=svg.hum, to=tmp.dir1, overwrite=TRUE) # Copy "homo_sapiens.brain.svg" file into 'tmp.dir1'

Query the above aSVG with feature and species keywords, and return the resulting matches in a data.frame.

feature.df <- return_feature(feature=c('frontal cortex', 'prefrontal cortex'), species=c('homo sapiens', 'brain'), dir=tmp.dir1, remote=NULL, keywords.any=FALSE)
feature.df

Subsequently, create a character vector of new feature identifiers corresponding to each of the returned features.

Sample code that creates new feature names and stores them in a character vector.

f.new <- c('prefrontalCortex', 'frontalCortex')

Similarly, if the stroke (outline thickness) or color need to update, make vectors respectively and make sure each entry corresponds to each returned feature.

s.new <- c('0.05', '0.1') # New strokes.
c.new <- c('red', 'green') # New colors.

Next, new features, strokes, and colors are added to the feature data.frame as three columns with the names featureNew, strokeNew, and colorNew respectively. The three column names are used by the update_feature function to look up new entries.

feature.df.new <- cbind(featureNew=f.new, strokeNew=s.new, colorNew=c.new, feature.df)
feature.df.new

Finally, the features, strokes, and colors are updated in the aSVG file(s) located in the tmp.dir1 directory.

update_feature(df.new=feature.df.new, dir=tmp.dir1)


13 Version Informaion

sessionInfo()
## R version 4.2.0 RC (2022-04-19 r82224)
## Platform: x86_64-pc-linux-gnu (64-bit)
## Running under: Ubuntu 20.04.4 LTS
## 
## Matrix products: default
## BLAS:   /home/biocbuild/bbs-3.15-bioc/R/lib/libRblas.so
## LAPACK: /home/biocbuild/bbs-3.15-bioc/R/lib/libRlapack.so
## 
## locale:
##  [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
##  [3] LC_TIME=en_GB              LC_COLLATE=C              
##  [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
##  [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
##  [9] LC_ADDRESS=C               LC_TELEPHONE=C            
## [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       
## 
## attached base packages:
## [1] stats4    stats     graphics  grDevices utils     datasets  methods  
## [8] base     
## 
## other attached packages:
##  [1] BiocParallel_1.30.0         igraph_1.3.1               
##  [3] scater_1.24.0               ggplot2_3.3.5              
##  [5] scran_1.24.0                scuttle_1.6.0              
##  [7] SingleCellExperiment_1.18.0 GEOquery_2.64.0            
##  [9] ExpressionAtlas_1.24.0      xml2_1.3.3                 
## [11] limma_3.52.0                SummarizedExperiment_1.26.0
## [13] Biobase_2.56.0              GenomicRanges_1.48.0       
## [15] GenomeInfoDb_1.32.0         IRanges_2.30.0             
## [17] S4Vectors_0.34.0            BiocGenerics_0.42.0        
## [19] MatrixGenerics_1.8.0        matrixStats_0.62.0         
## [21] spatialHeatmap_2.2.0        knitr_1.38                 
## [23] BiocStyle_2.24.0           
## 
## loaded via a namespace (and not attached):
##   [1] shinydashboard_0.7.2      utf8_1.2.2               
##   [3] tidyselect_1.1.2          RSQLite_2.2.12           
##   [5] AnnotationDbi_1.58.0      htmlwidgets_1.5.4        
##   [7] grid_4.2.0                Rtsne_0.16               
##   [9] pROC_1.18.0               munsell_0.5.0            
##  [11] ScaledMatrix_1.4.0        codetools_0.2-18         
##  [13] preprocessCore_1.58.0     statmod_1.4.36           
##  [15] av_0.7.0                  withr_2.5.0              
##  [17] colorspace_2.0-3          filelock_1.0.2           
##  [19] highr_0.9                 rstudioapi_0.13          
##  [21] labeling_0.4.2            GenomeInfoDbData_1.2.8   
##  [23] farver_2.1.0              bit64_4.0.5              
##  [25] distinct_1.8.0            rhdf5_2.40.0             
##  [27] vctrs_0.4.1               generics_0.1.2           
##  [29] rols_2.24.0               xfun_0.30                
##  [31] BiocFileCache_2.4.0       fastcluster_1.2.3        
##  [33] R6_2.5.1                  doParallel_1.0.17        
##  [35] ggbeeswarm_0.6.0          rsvd_1.0.5               
##  [37] locfit_1.5-9.5            rsvg_2.3.1               
##  [39] bitops_1.0-7              rhdf5filters_1.8.0       
##  [41] cachem_1.0.6              gridGraphics_0.5-1       
##  [43] DelayedArray_0.22.0       assertthat_0.2.1         
##  [45] promises_1.2.0.1          scales_1.2.0             
##  [47] nnet_7.3-17               beeswarm_0.4.0           
##  [49] gtable_0.3.0              beachmat_2.12.0          
##  [51] WGCNA_1.71                rlang_1.0.2              
##  [53] genefilter_1.78.0         splines_4.2.0            
##  [55] lazyeval_0.2.2            impute_1.70.0            
##  [57] checkmate_2.1.0           BiocManager_1.30.17      
##  [59] yaml_2.3.5                reshape2_1.4.4           
##  [61] backports_1.4.1           httpuv_1.6.5             
##  [63] Hmisc_4.7-0               tools_4.2.0              
##  [65] bookdown_0.26             ggplotify_0.1.0          
##  [67] ellipsis_0.3.2            gplots_3.1.3             
##  [69] jquerylib_0.1.4           RColorBrewer_1.1-3       
##  [71] ggdendro_0.1.23           dynamicTreeCut_1.63-1    
##  [73] Rcpp_1.0.8.3              plyr_1.8.7               
##  [75] visNetwork_2.1.0          base64enc_0.1-3          
##  [77] sparseMatrixStats_1.8.0   progress_1.2.2           
##  [79] zlibbioc_1.42.0           purrr_0.3.4              
##  [81] RCurl_1.98-1.6            prettyunits_1.1.1        
##  [83] rpart_4.1.16              viridis_0.6.2            
##  [85] cowplot_1.1.1             ggrepel_0.9.1            
##  [87] cluster_2.1.3             magrittr_2.0.3           
##  [89] RSpectra_0.16-1           data.table_1.14.2        
##  [91] magick_2.7.3              grImport_0.9-5           
##  [93] mime_0.12                 hms_1.1.1                
##  [95] evaluate_0.15             xtable_1.8-4             
##  [97] XML_3.99-0.9              jpeg_0.1-9               
##  [99] gridExtra_2.3             compiler_4.2.0           
## [101] tibble_3.1.6              KernSmooth_2.23-20       
## [103] crayon_1.5.1              htmltools_0.5.2          
## [105] tzdb_0.3.0                later_1.3.0              
## [107] Formula_1.2-4             tidyr_1.2.0              
## [109] geneplotter_1.74.0        DBI_1.1.2                
## [111] dbplyr_2.1.1              MASS_7.3-57              
## [113] rappdirs_0.3.3            readr_2.1.2              
## [115] Matrix_1.4-1              cli_3.3.0                
## [117] metapod_1.4.0             parallel_4.2.0           
## [119] pkgconfig_2.0.3           flashClust_1.01-2        
## [121] foreign_0.8-82            plotly_4.10.0            
## [123] foreach_1.5.2             annotate_1.74.0          
## [125] vipor_0.4.5               bslib_0.3.1              
## [127] dqrng_0.3.0               rngtools_1.5.2           
## [129] XVector_0.36.0            doRNG_1.8.2              
## [131] yulab.utils_0.0.4         stringr_1.4.0            
## [133] digest_0.6.29             Biostrings_2.64.0        
## [135] rmarkdown_2.14            htmlTable_2.4.0          
## [137] uwot_0.1.11               edgeR_3.38.0             
## [139] DelayedMatrixStats_1.18.0 curl_4.3.2               
## [141] shiny_1.7.1               gtools_3.9.2             
## [143] lifecycle_1.0.1           jsonlite_1.8.0           
## [145] Rhdf5lib_1.18.0           BiocNeighbors_1.14.0     
## [147] viridisLite_0.4.0         fansi_1.0.3              
## [149] pillar_1.7.0              lattice_0.20-45          
## [151] KEGGREST_1.36.0           fastmap_1.1.0            
## [153] httr_1.4.2                survival_3.3-1           
## [155] GO.db_3.15.0              glue_1.6.2               
## [157] FNN_1.1.3                 UpSetR_1.4.0             
## [159] png_0.1-7                 iterators_1.0.14         
## [161] bluster_1.6.0             bit_4.0.4                
## [163] stringi_1.7.6             sass_0.4.1               
## [165] HDF5Array_1.24.0          blob_1.2.3               
## [167] DESeq2_1.36.0             BiocSingular_1.12.0      
## [169] latticeExtra_0.6-29       caTools_1.18.2           
## [171] memoise_2.0.1             dplyr_1.0.8              
## [173] irlba_2.3.5

14 Funding

This project has been funded by NSF awards: PGRP-1546879, PGRP-1810468, PGRP-1936492.

15 References

Backman, Tyler W. H., and Thomas Girke. 2016. “SystemPipeR: NGS Workflow and Report Generation Environment.” BMC Bioinformatics 17 (1). https://doi.org/10.1186/s12859-016-1241-0.

Bezrutczyk, Margaret, Nora R Zöllner, Colin P S Kruse, Thomas Hartwig, Tobias Lautwein, Karl Köhrer, Wolf B Frommer, and Ji-Yun Kim. 2021. “Evidence for Phloem Loading via the Abaxial Bundle Sheath Cells in Maize Leaves.” Plant Cell 33 (3): 531–47.

Cardoso-Moreira, Margarida, Jean Halbert, Delphine Valloton, Britta Velten, Chunyan Chen, Yi Shao, Angélica Liechti, et al. 2019. “Gene Expression Across Mammalian Organ Development.” Nature 571 (7766): 505–9.

Chang, Winston, and Barbara Borges Ribeiro. 2018. Shinydashboard: Create Dashboards with ’Shiny’. https://CRAN.R-project.org/package=shinydashboard.

Chang, Winston, Joe Cheng, JJ Allaire, Cars on Sievert, Barret Schloerke, Yihui Xie, Jeff Allen, Jonathan McPherson, Alan Dipert, and Barbara Borges. 2021. Shiny: Web Application Framework for R. https://CRAN.R-project.org/package=shiny.

Chen, Lihe, Jae Wook Lee, Chung-Lin Chou, Anil V Nair, Maria A Battistone, Teodor G Păunescu, Maria Merkulova, et al. 2017. “Transcriptomes of Major Renal Collecting Duct Cell Types in Mouse Identified by Single-Cell RNA-seq.” Proc. Natl. Acad. Sci. U. S. A. 114 (46): E9989–E9998.

Clark, Jevin Z, Lihe Chen, Chung-Lin Chou, Hyun Jun Jung, Jae Wook Lee, and Mark A Knepper. 2019. “Representation and Relative Abundance of Cell-Type Selective Markers in Whole-Kidney RNA-Seq Data.” Kidney Int. 95 (4): 787–96.

Davis, Sean, and Paul Meltzer. 2007. “GEOquery: A Bridge Between the Gene Expression Omnibus (Geo) and Bioconductor.” Bioinformatics 14: 1846–7.

Fischer, Bernd, Mike Smith, and Gregoire Pau. 2020. Rhdf5: R Interface to Hdf5. https://github.com/grimbough/rhdf5.

Gatto, Laurent. 2019. Rols: An R Interface to the Ontology Lookup Service. http://lgatto.github.com/rols/.

Gautier, Laurent, Leslie Cope, Benjamin M. Bolstad, and Rafael A. Irizarry. 2004. “Affy—Analysis of Affymetrix Genechip Data at the Probe Level.” Bioinformatics 20 (3): 307–15. https://doi.org/10.1093/bioinformatics/btg405.

Huber, W., V. J. Carey, R. Gentleman, S. An ders, M. Carlson, B. S. Carvalho, H. C. Bravo, et al. 2015. “Orchestrating High-Throughput Genomic Analysis Wit H Bioconductor.” Nature Methods 12 (2): 115–21. http://www.nature.com/nmeth/journal/v12/n2/full/nmeth.3252.html.

Karaiskos, Nikos, Mahdieh Rahmatollahi, Anastasiya Boltengagen, Haiyue Liu, Martin Hoehne, Markus Rinschen, Bernhard Schermer, et al. 2018. “A Single-Cell Transcriptome Atlas of the Mouse Glomerulus.” J. Am. Soc. Nephrol. 29 (8): 2060–8.

Keays, Maria. 2019. ExpressionAtlas: Download Datasets from Embl-Ebi Expression Atlas.

Langfelder, Peter, and Steve Horvath. 2008. “WGCNA: An R Package for Weighted Correlation Network Analysis.” BMC Bioinformatics 9 (December): 559.

Langfelder, Peter, Bin Zhang, and with contributions from Steve Horvath. 2016. DynamicTreeCut: Methods for Detection of Clusters in Hierarchical Clustering Dendrograms. https://CRAN.R-project.org/package=dynamicTreeCut.

Langfelder, P., and S. Horvath. 2012. “Fast R Functions for Robust Correlations and Hierarchical Clustering.” J. Stat. Softw. 46 (11). https://www.ncbi.nlm.nih.gov/pubmed/23050260.

Lekschas, Fritz, Harald Stachelscheid, Stefanie Seltmann, and Andreas Kurtz. 2015. “Semantic Body Browser: Graphical Exploration of an Organism and Spatially Resolved Expression Data Visualization.” Bioinformatics 31 (5): 794–96.

Li, Song, Masashi Yamada, Xinwei Han, Uwe Ohler, and Philip N Benfey. 2016. “High-Resolution Expression Map of the Arabidopsis Root Reveals Alternative Splicing and lincRNA Regulation.” Dev. Cell 39 (4): 508–22.

Love, Michael I., Wolfgang Huber, and Simon Anders. 2014. “Moderated Estimation of Fold Change and Dispersion for Rna-Seq Data with Deseq2.” Genome Biology 15 (12): 550. https://doi.org/10.1186/s13059-014-0550-8.

Lun, Aaron T. L., Davis J. McCarthy, and John C. Marioni. 2016. “A Step-by-Step Workflow for Low-Level Analysis of Single-Cell Rna-Seq Data with Bioconductor.” F1000Res. 5: 2122. https://doi.org/10.12688/f1000research.9501.2.

Maag, Jesper L V. 2018. “Gganatogram: An R Package for Modular Visualisation of Anatograms and Tissues Based on Ggplot2.” F1000Res. 7 (September): 1576.

Marques, Sueli, Amit Zeisel, Simone Codeluppi, David van Bruggen, Ana Mendanha Falcão, Lin Xiao, Huiliang Li, et al. 2016. “Oligodendrocyte Heterogeneity in the Mouse Juvenile and Adult Central Nervous System.” Science 352 (6291): 1326–9.

McCarthy, Davis J., Chen, Yunshun, Smyth, and Gordon K. 2012. “Differential Expression Analysis of Multifactor Rna-Seq Experiments with Respect to Biological Variation.” Nucleic Acids Research 40 (10): 4288–97.

Merkin, Jason, Caitlin Russell, Ping Chen, and Christopher B Burge. 2012. “Evolutionary Dynamics of Gene and Isoform Regulation in Mammalian Tissues.” Science 338 (6114): 1593–9.

Morgan, Martin, Valerie Obenchain, Jim Hester, and Hervé Pagès. 2018. SummarizedExperiment: SummarizedExperiment Container.

Morgan, Martin, Jiefei Wang, Valerie Obenchain, Michel Lang, Ryan Thompson, and Nitesh Turaga. 2021. BiocParallel: Bioconductor Facilities for Parallel Evaluation. https://github.com/Bioconductor/BiocParallel.

Mungall, Christopher J, Carlo Torniai, Georgios V Gkoutos, Suzanna E Lewis, and Melissa A Haendel. 2012. “Uberon, an integrative multi-species anatomy ontology.” Genome Biol. 13 (1): R5. https://doi.org/10.1186/gb-2012-13-1-r5.

Muschelli, John, Elizabeth Sweeney, and Ciprian Crainiceanu. 2014. “BrainR: Interactive 3 and 4D Images of High Resolution Neuroimage Data.” R J. 6 (1): 41–48.

Mustroph, Angelika, M Eugenia Zanetti, Charles J H Jang, Hans E Holtan, Peter P Repetti, David W Galbraith, Thomas Girke, and Julia Bailey-Serres. 2009. “Profiling Translatomes of Discrete Cell Populations Resolves Altered Cellular Priorities During Hypoxia in Arabidopsis.” Proc Natl Acad Sci U S A 106 (44): 18843–8.

Narsai, Reena, David Secco, Matthew D Schultz, Joseph R Ecker, Ryan Lister, and James Whelan. 2017. “Dynamic and Rapid Changes in the Transcriptome and Epigenome During Germination and in Developing Rice ( Oryza Sativa ) Coleoptiles Under Anoxia and Re-Oxygenation.” Plant J. 89 (4): 805–24.

Ortiz, Cantin, Jose Fernandez Navarro, Aleksandra Jurek, Antje Märtin, Joakim Lundeberg, and Konstantinos Meletis. 2020. “Molecular Atlas of the Adult Mouse Brain.” Science Advances 6 (26): eabb3446.

Pagès, Hervé. 2020. HDF5Array: HDF5 Backend for Delayedarray Objects.

Papatheodorou, Irene, Nuno A Fonseca, Maria Keays, Y Amy Tang, Elisabet Barrera, Wojciech Bazant, Melissa Burke, et al. 2018. “Expression Atlas: gene and protein expression across multiple studies and organisms.” Nucleic Acids Res. 46 (D1): D246–D251. https://doi.org/10.1093/nar/gkx1158.

Park, Jihwan, Rojesh Shrestha, Chengxiang Qiu, Ayano Kondo, Shizheng Huang, Max Werth, Mingyao Li, Jonathan Barasch, and Katalin Suszták. 2018. “Single-Cell Transcriptomics of the Mouse Kidney Reveals Potential Cellular Targets of Kidney Disease.” Science 360 (6390): 758–63.

Prudencio, Mercedes, Veronique V Belzil, Ranjan Batra, Christian A Ross, Tania F Gendron, Luc J Pregent, Melissa E Murray, et al. 2015. “Distinct Brain Transcriptome Profiles in C9orf72-Associated and Sporadic ALS.” Nat. Neurosci. 18 (8): 1175–82.

Ravasz, E, A L Somera, D A Mongru, Z N Oltvai, and A L Barabási. 2002. “Hierarchical Organization of Modularity in Metabolic Networks.” Science 297 (5586): 1551–5.

Ritchie, Matthew E, Belinda Phipson, Di Wu, Yifang Hu, Charity W Law, Wei Shi, and Gordon K Smyth. 2015. “limma Powers Differential Expression Analyses for RNA-Sequencing and Microarray Studies.” Nucleic Acids Research 43 (7): e47. https://doi.org/10.1093/nar/gkv007.

Robinson, Mark D, Davis J McCarthy, and Gordon K Smyth. 2010. “EdgeR: A Bioconductor Package for Differential Expression Analysis of Digital Gene Expression Data.” Bioinformatics 26 (1): 139–40. https://doi.org/10.1093/bioinformatics/btp616.

Shahan, Rachel, Che-Wei Hsu, Trevor M Nolan, Benjamin J Cole, Isaiah W Taylor, Anna Hendrika Cornelia Vlot, Philip N Benfey, and Uwe Ohler. 2020. “A Single Cell Arabidopsis Root Atlas Reveals Developmental Trajectories in Wild Type and Cell Identity Mutants.” bioRxiv.

Tiberi, Simone, and Mark D. Robinson. 2020. Distinct: Distinct: A Method for Differential Analyses via Hierarchical Permutation Tests. https://github.com/SimoneTiberi/distinct.

Vacher, Claire-Marie, Helene Lacaille, Jiaqi J O’Reilly, Jacquelyn Salzbank, Dana Bakalar, Sonia Sebaoui, Philippe Liere, et al. 2021. “Placental Endocrine Function Shapes Cerebellar Development and Social Behavior.” Nat. Neurosci. 24 (10): 1392–1401.

Waese, Jamie, Jim Fan, Asher Pasha, Hans Yu, Geoffrey Fucile, Ruian Shi, Matthew Cumming, et al. 2017. “EPlant: Visualizing and Exploring Multiple Levels of Data for Hypothesis Generation in Plant Biology.” Plant Cell 29 (8): 1806–21.

Winter, Debbie, Ben Vinegar, Hardeep Nahal, Ron Ammar, Greg V Wilson, and Nicholas J Provart. 2007. “An ‘Electronic Fluorescent Pictograph’ Browser for Exploring and Analyzing Large-Scale Biological Data Sets.” PLoS One 2 (8): e718.