Contents

1 Introduction

This package includes two methods for differentially expressed genes (DEGs) detection in RNA-seq and scRNA-seq datasets, respectively. The first method is the SFMEB that is used to identify DEGs in the same or different species RNA-seq dataset. Given that non-DE genes have some similarities in features, the SFMEB covers those non-DE genes in feature space, then those DE genes, which are enormously different from non-DE genes, being regarded as outliers and rejected outside the ball. The method on this package are described in the article A scaling-free minimum enclosing ball method to detect differentially expressed genes for RNA-seq data by Zhou, Y., Yang, B., Wang, J. et al.  BMC Genomics, 22, 479 (2021). The second method is the scMEB which is the extension of the SFMEB. The scMEB is a novel and fast method for detecting single-cell DEGs without prior cell clustering results. The details about the scMEB could be refered to the article scMEB: A fast and clustering-independent method for detecting differentially expressed genes in single-cell RNA-seq data by Zhu, J.D and Yang, Y.L. (2023, pending publication)

2 The steps of the SFMEB method

The SFMEB method is developed for detecting differential expression genes in the same or different species. Compared with existing methods, it is no need to normalize data in advance. Besides, the SFMEB method could be easily applied to the same or different species data and without changing too much. We have implemented the SFMEB method via an R function NIMEB(). The method consists three steps.

Step 1: Data Pre-processing;

Step 2: Training a model for the training genes;

Step 3: Discriminating a gene whether a DE gene.

We employ a simulation and real dataset for the same and different species to illustrate the usage of the SFMEB method.

2.1 Preparations

To install the MEB package into your R environment, start R and enter:

install.packages("BiocManager")
BiocManager::install("MEB")

Then, the MEB package is ready to load.

library(MEB)

2.2 Data format

In order to show the usage of SFMEB method, we introduce the example data sets, which includes the simulation and real data for the same and different species. The next we will show the introduction of datasets in the package.

There are six datasets in the data subdirectory of MEB package, in which four datasets are linked to the SFMEB method. To consistent with standard Bioconductor representations, we transform the format of dataset as SummarizedExperiment, please refer R package SummarizedExperiment for details. The four datasets are sim_data_sp, sim_data_dsp, real_data_sp and real_data_dsp.

real_data_sp is a real dataset for the same species, which comes from RNA-seq: an assessment of technical reproducibility and comparisonwith gene expression arrays by Marioni J.C., Mason C.E., et al. (2008). Genome Res. 18(9), 1509–1517.

real_data_dsp is a real dataset for the different species, which comes from The evolution of gene expression levels in mammalian organs by Brawand, D., Soumillon, M., Necsulea, A. and Julien, P. et al. (2011). Nature, 478, 343-348.

sim_data_sp and sim_data_dsp are two simulation datasets for the same and different species, respectively. Refering A scaling-free minimum enclosing ball method to detect differentially expressed genes for RNA-seq data by Zhou, Y., Yang, B., Wang, J. et al.  BMC Genomics, 22, 479 (2021) for the generation procedure.

data(sim_data_sp)
sim_data_sp
## class: SummarizedExperiment 
## dim: 10943 2 
## metadata(0):
## assays(1): ''
## rownames(10943): 1 2 ... 10942 10943
## rowData names(0):
## colnames(2): sample1 sample2
## colData names(0):

sim_data_sp.RData includes 2 columns,

  • the first column is the RNA-seq short read counts for the first sample;

  • the second column is the RNA-seq short read counts for the second sample;

  • each row represents a gene, and the first 1000 genes are housekeeping genes.

data(real_data_sp)
real_data_sp
## class: SummarizedExperiment 
## dim: 16519 10 
## metadata(0):
## assays(1): ''
## rownames(16519): ENSG00000149925 ENSG00000102144 ... ENSG00000012817
##   ENSG00000198692
## rowData names(0):
## colnames(10): R1L1Kidney R1L2Liver ... R2L3Liver R2L6Kidney
## colData names(0):

real_data_sp includes 10 columns,

  • there are two samples about kidney and liver, and each with five biological replicates;

  • each row represents a gene, and the first 530 genes are housekeeping genes.

data(sim_data_dsp)
sim_data_dsp
## class: SummarizedExperiment 
## dim: 18472 4 
## metadata(0):
## assays(1): ''
## rownames(18472): 1 2 ... 18471 18472
## rowData names(0):
## colnames(4): genelength1 count1 genelength2 count2
## colData names(0):

sim_data_dsp.RData includes 4 columns,

  • the first and the third columns are the gene length for two species;

  • the second and the fouth columns are the RNA-seq short read counts for two species;

  • each row represents an orthologous gene, and the first 1000 genes are the conserved genes.

data(real_data_dsp)
real_data_dsp
## class: SummarizedExperiment 
## dim: 19330 4 
## metadata(0):
## assays(1): ''
## rownames(19330): 85 190 ... 20928 20929
## rowData names(2): Ensembl.Gene.ID Mouse.Ensembl.Gene.ID
## colnames(4): ExonicLength Human_Brain_Male1 ExonicLength.1
##   Mouse_Brain_Male1
## colData names(0):

real_data_dsp.RData includes 4 columns,

  • the first and the third columns are the gene length for human and mouse;

  • the second and the fouth columns are the RNA-seq short read counts for human and mouse;

  • each row represents an orthologous gene, and the first 143 genes are the conserved genes.

2.3 Training a model for the training genes

Based on a part of known housekeeping and conserved genes, we can train our model for the above four datasets. The next we will show how to use the NIMEB() function to train a model.

  1. Simulation data for the same species
library(SummarizedExperiment)
data(sim_data_sp)
gamma <- seq(1e-06,5e-05,1e-06)
sim_model_sp <- NIMEB(countsTable=assay(sim_data_sp), train_id=1:1000, gamma,
nu = 0.01, reject_rate = 0.05, ds = FALSE)
  1. Real data for the same species
data(real_data_sp)
gamma <- seq(1e-06,5e-05,1e-06)
real_model_sp <- NIMEB(countsTable=assay(real_data_sp), train_id=1:530,
gamma, nu = 0.01, reject_rate = 0.1, ds = FALSE)
  1. Simulation data for the different species
data(sim_data_dsp)
gamma <- seq(1e-07,2e-05,1e-06)
sim_model_dsp <- NIMEB(countsTable=assay(sim_data_dsp), train_id=1:1000, gamma,
nu = 0.01, reject_rate = 0.1, ds = TRUE)
  1. Real data for the different species
data(real_data_dsp)
gamma <- seq(5e-08,5e-07,1e-08)
real_model_dsp <- NIMEB(countsTable=assay(real_data_dsp), train_id=1:143, 
                        gamma, nu = 0.01, reject_rate = 0.1, ds = TRUE)

The output for NIMEB() includes “model”, “gamma” and train_error. model is the model we used to discriminate a new gene, gamma represents the selected gamma parameters in model NIMEB, train_error represents the corresponding train_error when the value of gamma changed.

2.4 Discriminating a gene whether a DE gene

Giving the model, we could predict a gene and find out whether DE gene. For example, in sim_data_sp data, we predict the discrimination results as follows:

sim_model_sp_pred <- predict(sim_model_sp$model, assay(sim_data_sp))
summary(sim_model_sp_pred)
##    Mode   FALSE    TRUE 
## logical    4008    6935

Based on the model we trained, we could discriminate each genes whether DE gene, if the discrimination result is TRUE/FALSE, the gene is non-DE/DE gene.

3 The usage of the scMEB method

We add a new function scMEB() for detecting differential expressed genes in scRNA-seq data without prior clustering results. There is a example to introduce the usage of this function:

  1. Load the package and example scRNA-seq data
library(SingleCellExperiment)

The simulation data is generated by splatter package (Zappia L, et al. 2017). The data include 5,000 genes and 100 cells.

data(sim_scRNA_data)
sim_scRNA_data
## class: SingleCellExperiment 
## dim: 5000 100 
## metadata(0):
## assays(1): counts
## rownames(5000): Gene1 Gene2 ... Gene4999 Gene5000
## rowData names(6): Gene BaseGeneMean ... DEFacGroup1 DEFacGroup2
## colnames(100): Cell1 Cell2 ... Cell99 Cell100
## colData names(0):
## reducedDimNames(0):
## mainExpName: NULL
## altExpNames(0):

We randomly sample 1,000 stable genes from the simulation data.

data(stable_gene)
head(stable_gene)
## [1] "Gene2635" "Gene4243" "Gene1318" "Gene2218" "Gene4753" "Gene4661"
length(stable_gene)
## [1] 1000
  1. Training a model for the simulation scRNA-seq data
sim_scRNA <- scMEB(sce=sim_scRNA_data, stable_idx=stable_gene, 
filtered = TRUE, gamma = seq(1e-04,0.001,1e-05), nu = 0.01, 
reject_rate = 0.1)
## Warning in (function (A, nv = 5, nu = nv, maxit = 1000, work = nv + 7, reorth =
## TRUE, : You're computing too large a percentage of total singular values, use a
## standard svd instead.

Predict a gene and find out whether DE gene. For sim_data_sp data, we predict the discrimination results as follows:

sim_scRNA_pred <- predict(sim_scRNA$model, sim_scRNA$dat_pca)
summary(sim_scRNA_pred)
##    Mode   FALSE    TRUE 
## logical     358    4642

The discrimination result TRUE/FALSE correspond that gene is non-DE/DE gene.

scMEB also provides a metric for ranking the genes, that is, the distance between the gene and the sphere of the ball in the feature space. And the larger the distance is, the more likely it is that the gene is a DEG.

table(sim_scRNA$dist>0)
## 
## FALSE  TRUE 
##   358  4642
sim_scRNA_dist <- data.frame(Gene=rownames(sim_scRNA_data),
                             Distance=sim_scRNA$dist)
head(sim_scRNA_dist)
##    Gene   Distance
## 1 Gene1 0.08121673
## 2 Gene2 0.08843538
## 3 Gene3 0.11847235
## 4 Gene4 0.12806588
## 5 Gene5 0.01926240
## 6 Gene6 0.09841178
sessionInfo()
## R version 4.4.0 beta (2024-04-15 r86425)
## Platform: x86_64-pc-linux-gnu
## Running under: Ubuntu 22.04.4 LTS
## 
## Matrix products: default
## BLAS:   /home/biocbuild/bbs-3.19-bioc/R/lib/libRblas.so 
## LAPACK: /usr/lib/x86_64-linux-gnu/lapack/liblapack.so.3.10.0
## 
## locale:
##  [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
##  [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8    
##  [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
##  [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
##  [9] LC_ADDRESS=C               LC_TELEPHONE=C            
## [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       
## 
## time zone: America/New_York
## tzcode source: system (glibc)
## 
## attached base packages:
## [1] stats4    stats     graphics  grDevices utils     datasets  methods  
## [8] base     
## 
## other attached packages:
##  [1] SingleCellExperiment_1.26.0 SummarizedExperiment_1.34.0
##  [3] Biobase_2.64.0              GenomicRanges_1.56.0       
##  [5] GenomeInfoDb_1.40.0         IRanges_2.38.0             
##  [7] S4Vectors_0.42.0            BiocGenerics_0.50.0        
##  [9] MatrixGenerics_1.16.0       matrixStats_1.3.0          
## [11] MEB_1.18.0                  BiocStyle_2.32.0           
## 
## loaded via a namespace (and not attached):
##  [1] tidyselect_1.2.1          viridisLite_0.4.2        
##  [3] vipor_0.4.7               dplyr_1.1.4              
##  [5] viridis_0.6.5             fastmap_1.1.1            
##  [7] digest_0.6.35             rsvd_1.0.5               
##  [9] lifecycle_1.0.4           statmod_1.5.0            
## [11] magrittr_2.0.3            compiler_4.4.0           
## [13] rlang_1.1.3               sass_0.4.9               
## [15] tools_4.4.0               utf8_1.2.4               
## [17] yaml_2.3.8                knitr_1.46               
## [19] S4Arrays_1.4.0            DelayedArray_0.30.0      
## [21] wrswoR_1.1.1              abind_1.4-5              
## [23] BiocParallel_1.38.0       grid_4.4.0               
## [25] fansi_1.0.6               beachmat_2.20.0          
## [27] e1071_1.7-14              colorspace_2.1-0         
## [29] edgeR_4.2.0               ggplot2_3.5.1            
## [31] logging_0.10-108          scales_1.3.0             
## [33] cli_3.6.2                 rmarkdown_2.26           
## [35] crayon_1.5.2              generics_0.1.3           
## [37] httr_1.4.7                DelayedMatrixStats_1.26.0
## [39] scuttle_1.14.0            ggbeeswarm_0.7.2         
## [41] cachem_1.0.8              proxy_0.4-27             
## [43] zlibbioc_1.50.0           parallel_4.4.0           
## [45] BiocManager_1.30.22       XVector_0.44.0           
## [47] vctrs_0.6.5               Matrix_1.7-0             
## [49] jsonlite_1.8.8            bookdown_0.39            
## [51] BiocSingular_1.20.0       BiocNeighbors_1.22.0     
## [53] ggrepel_0.9.5             irlba_2.3.5.1            
## [55] beeswarm_0.4.0            scater_1.32.0            
## [57] locfit_1.5-9.9            limma_3.60.0             
## [59] jquerylib_0.1.4           glue_1.7.0               
## [61] codetools_0.2-20          gtable_0.3.5             
## [63] UCSC.utils_1.0.0          ScaledMatrix_1.12.0      
## [65] munsell_0.5.1             tibble_3.2.1             
## [67] pillar_1.9.0              htmltools_0.5.8.1        
## [69] GenomeInfoDbData_1.2.12   R6_2.5.1                 
## [71] sparseMatrixStats_1.16.0  evaluate_0.23            
## [73] lattice_0.22-6            bslib_0.7.0              
## [75] class_7.3-22              Rcpp_1.0.12              
## [77] gridExtra_2.3             SparseArray_1.4.0        
## [79] xfun_0.43                 pkgconfig_2.0.3