1. Introduction

MicrobiotaProcess is an R package for analysis, visualization and biomarker discovery of microbial datasets. It supports the import of microbiome census data, calculating alpha index and provides functions to visualize rarefaction curves. Moreover, it also supports visualizing the abundance of taxonomy of samples. And It also provides functions to perform the PCA, PCoA and hierarchical cluster analysis. In addition, MicrobiotaProcess also provides a method for the biomarker discovery of metagenome or other datasets.

2. MicrobiotaProcess profiling

2.1 import function

MicrobiotaProcess has an feature to import phylogenetic sequencing data from dada2(Callahan et al. 2016) and qiime2(Bolyen et al. 2019) taxonomic clustering pipelines. The resulting object after import is phyloseq object(McMurdie and Holmes 2013)

# import data from dada2 pipeline.
seqtabfile <- system.file("extdata", "seqtab.nochim.rds", package="MicrobiotaProcess")
seqtab <- readRDS(seqtabfile)
taxafile <- system.file("extdata", "taxa_tab.rds",package="MicrobiotaProcess")
seqtab <- readRDS(seqtabfile)
taxa <- readRDS(taxafile)
# the seqtab and taxa are output of dada2
sampleda <- system.file("extdata", "mouse.time.dada2.txt", package="MicrobiotaProcess")
ps_dada2 <- import_dada2(seqtab=seqtab, taxatab=taxa, sampleda=sampleda)
ps_dada2
## phyloseq-class experiment-level object
## otu_table()   OTU Table:         [ 232 taxa and 19 samples ]
## sample_data() Sample Data:       [ 19 samples by 2 sample variables ]
## tax_table()   Taxonomy Table:    [ 232 taxa by 7 taxonomic ranks ]
## refseq()      DNAStringSet:      [ 232 reference sequences ]
# import data from qiime2 pipeline
otuqzafile <- system.file("extdata", "table.qza", package="MicrobiotaProcess")
taxaqzafile <- system.file("extdata", "taxa.qza", package="MicrobiotaProcess")
mapfile <- system.file("extdata", "metadata_qza.txt", package="MicrobiotaProcess")
ps_qiime2 <- import_qiime2(otuqza=otuqzafile, taxaqza=taxaqzafile, mapfilename=mapfile)
ps_qiime2
## phyloseq-class experiment-level object
## otu_table()   OTU Table:         [ 138 taxa and 87 samples ]
## sample_data() Sample Data:       [ 87 samples by 23 sample variables ]
## tax_table()   Taxonomy Table:    [ 138 taxa by 8 taxonomic ranks ]
## refseq()      DNAStringSet:      [ 138 reference sequences ]

2.2 Rarefaction visualization

Rarefaction, based on sampling technique, was used to compensate for the effect of sample size on the number of units observed in a sample(Siegel 2004). MicrobiotaProcess provided ggrarecurve to plot the curves, based on rrarefy of vegan(Oksanen et al. 2019).

# for reproducibly random number 
set.seed(1024)
p_rare <- ggrarecurve(obj=ps_dada2, 
                      indexNames=c("Observe","Chao1","ACE"), 
                      chunks=300) +
          theme(legend.spacing.y=unit(0.02,"cm"),
                legend.text=element_text(size=6))
p_rare

2.3 calculate alpha index and visualization

MicrobiotaProcess provides get_alphaindex to calculate alpha index and the ggbox to visualize the result

alphaobj <- get_alphaindex(ps_dada2)
head(as.data.frame(alphaobj))
##        Observe     Chao1       ACE  Shannon   Simpson         J sample  time
## F3D0       102 102.10000 102.57080 3.822501 0.9633082 0.8264916   F3D0 Early
## F3D1        99  99.14286  99.48498 3.991029 0.9709703 0.8685363   F3D1 Early
## F3D141      74  74.00000  74.00000 3.463816 0.9517886 0.8047779 F3D141  Late
## F3D142      48  48.00000  48.00000 3.116764 0.9386549 0.8051155 F3D142  Late
## F3D143      56  56.00000  56.00000 3.292717 0.9464422 0.8179949 F3D143  Late
## F3D144      47  47.00000  47.00000 2.989304 0.9306528 0.7764129 F3D144  Late
p_alpha <- ggbox(alphaobj, geom="violin", factorNames="time") + 
    scale_fill_manual(values=c("#2874C5", "#EABF00"))+
    theme(strip.background = element_rect(colour=NA, fill="grey"))
p_alpha

2.4 The visualization of taxonomy abundance

MicrobiotaProcess presents the ggbartax for the visualization of composition of microbial communities.

# relative abundance
otubar <- ggbartax(obj=ps_dada2) +
      xlab(NULL) +
      ylab("relative abundance (%)")
otubar

If you want to get the abundance of specific levels of class, You can use get_taxadf then use ggbartax to visualize.

phytax <- get_taxadf(obj=ps_dada2, taxlevel=2)
phybar <- ggbartax(obj=phytax) +
          xlab(NULL) + ylab("relative abundance (%)")
phybar

Moreover, the abundance (count) of taxonomy also can be visualized by setting count to TRUE, and the facet of plot can be showed by setting the facetNames.

phybar2 <- ggbartax(obj=phytax, facetNames="time", count=TRUE) + xlab(NULL) + ylab("abundance")
phybar2

classtax <- get_taxadf(obj=ps_dada2, taxlevel=3)
classbar <- ggbartax(obj=classtax, facetNames="time", count=FALSE) +
                xlab(NULL) + ylab("relative abundance (%)")
classbar

2.5 PCA and PCoA analysis

PCA (Principal component analysis) and PCoA (Principal Coordinate Analysis) are general statistical procedures to compare groups of samples. And PCoA can based on the phylogenetic or count-based distance metrics, such as Bray-Curtis, Jaccard, Unweighted-UniFrac and weighted-UniFrac. MicrobiotaProcess presents the get_pca, get_pcoa and ggordpoint for the analysis.

pcares <- get_pca(obj=ps_dada2, method="hellinger")
# Visulizing the result
pcaplot <- ggordpoint(obj=pcares, biplot=TRUE, speciesannot=TRUE,
                      factorNames=c("time"), ellipse=TRUE) +
           scale_colour_manual(values=c("#2874C5", "#EABF00")) +
           scale_fill_manual(values=c("#2874C5", "#EABF00"))
pcaplot

pcoares <- get_pcoa(obj=ps_dada2, distmethod="euclidean", method="hellinger")
# Visualizing the result
pcoaplot <- ggordpoint(obj=pcoares, biplot=TRUE, speciesannot=TRUE,
                       factorNames=c("time"), ellipse=TRUE) +
            scale_colour_manual(values=c("#2874C5", "#EABF00")) +
            scale_fill_manual(values=c("#2874C5", "#EABF00"))
pcoaplot

2.6 Hierarchical cluster analysis

beta diversity metrics can assess the differences between microbial communities. It can be visualized with PCA or PCoA, this can also be visualized with hierarchical clustering. MicrobiotaProcess also implements the analysis based on ggtree(Yu et al. 2017).

hcsample <- get_clust(obj=ps_dada2, distmethod="euclidean",
              method="hellinger", hclustmethod="average")
# rectangular layout
clustplot1 <- ggclust(obj=hcsample,
                      layout = "rectangular",
                      pointsize=1,
                      fontsize=0,
                      factorNames=c("time")) +
              scale_color_manual(values=c("#2874C5", "#EABF00")) +
              theme_tree2(legend.position="right",
                          plot.title = element_text(face="bold", lineheight=25,hjust=0.5))
clustplot1

# circular layout
clustplot2 <- ggclust(obj=hcsample,
                      layout = "circular",
                      pointsize=1,
                      fontsize=2,
                      factorNames=c("time")) +
              scale_color_manual(values=c("#2874C5", "#EABF00")) +
              theme(legend.position="right")
clustplot2

3. Need helps?

If you have questions/issues, please visit github issue tracker.

4. Session information

Here is the output of sessionInfo() on the system on which this document was compiled:

## R version 4.0.5 (2021-03-31)
## Platform: x86_64-pc-linux-gnu (64-bit)
## Running under: Ubuntu 18.04.5 LTS
## 
## Matrix products: default
## BLAS:   /home/biocbuild/bbs-3.12-bioc/R/lib/libRblas.so
## LAPACK: /home/biocbuild/bbs-3.12-bioc/R/lib/libRlapack.so
## 
## locale:
##  [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
##  [3] LC_TIME=en_US.UTF-8        LC_COLLATE=C              
##  [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
##  [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
##  [9] LC_ADDRESS=C               LC_TELEPHONE=C            
## [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       
## 
## attached base packages:
## [1] stats     graphics  grDevices utils     datasets  methods   base     
## 
## other attached packages:
##  [1] MicrobiotaProcess_1.2.2 tidytree_0.3.3          treeio_1.14.3          
##  [4] ggtree_2.4.1            phyloseq_1.34.0         forcats_0.5.1          
##  [7] stringr_1.4.0           dplyr_1.0.5             purrr_0.3.4            
## [10] readr_1.4.0             tidyr_1.1.3             tibble_3.1.0           
## [13] tidyverse_1.3.1         DT_0.18                 ggplot2_3.3.3          
## 
## loaded via a namespace (and not attached):
##   [1] TH.data_1.0-10      colorspace_2.0-0    ggsignif_0.6.1     
##   [4] modeltools_0.2-23   ellipsis_0.3.1      XVector_0.30.0     
##   [7] fs_1.5.0            aplot_0.0.6         rstudioapi_0.13    
##  [10] farver_2.1.0        ggrepel_0.9.1       mvtnorm_1.1-1      
##  [13] fansi_0.4.2         lubridate_1.7.10    coin_1.4-1         
##  [16] xml2_1.3.2          codetools_0.2-18    splines_4.0.5      
##  [19] libcoin_1.0-8       knitr_1.32          ade4_1.7-16        
##  [22] jsonlite_1.7.2      broom_0.7.6         cluster_2.1.2      
##  [25] dbplyr_2.1.1        BiocManager_1.30.12 compiler_4.0.5     
##  [28] httr_1.4.2          rvcheck_0.1.8       backports_1.2.1    
##  [31] assertthat_0.2.1    Matrix_1.3-2        lazyeval_0.2.2     
##  [34] cli_2.4.0           htmltools_0.5.1.1   prettyunits_1.1.1  
##  [37] tools_4.0.5         igraph_1.2.6        gtable_0.3.0       
##  [40] glue_1.4.2          reshape2_1.4.4      Rcpp_1.0.6         
##  [43] Biobase_2.50.0      cellranger_1.1.0    jquerylib_0.1.3    
##  [46] vctrs_0.3.7         Biostrings_2.58.0   rhdf5filters_1.2.0 
##  [49] multtest_2.46.0     ape_5.4-1           nlme_3.1-152       
##  [52] iterators_1.0.13    xfun_0.22           ps_1.6.0           
##  [55] Rmisc_1.5           rvest_1.0.0         lifecycle_1.0.0    
##  [58] gtools_3.8.2        zoo_1.8-9           zlibbioc_1.36.0    
##  [61] MASS_7.3-53.1       scales_1.1.1        hms_1.0.0          
##  [64] sandwich_3.0-0      parallel_4.0.5      biomformat_1.18.0  
##  [67] rhdf5_2.34.0        yaml_2.2.1          gridExtra_2.3      
##  [70] sass_0.3.1          reshape_0.8.8       stringi_1.5.3      
##  [73] highr_0.9           S4Vectors_0.28.1    foreach_1.5.1      
##  [76] permute_0.9-5       ggstar_1.0.2        BiocGenerics_0.36.1
##  [79] matrixStats_0.58.0  rlang_0.4.10        pkgconfig_2.0.3    
##  [82] evaluate_0.14       lattice_0.20-41     Rhdf5lib_1.12.1    
##  [85] labeling_0.4.2      patchwork_1.1.1     htmlwidgets_1.5.3  
##  [88] tidyselect_1.1.0    plyr_1.8.6          magrittr_2.0.1     
##  [91] R6_2.5.0            IRanges_2.24.1      generics_0.1.0     
##  [94] multcomp_1.4-16     DBI_1.1.1           pillar_1.6.0       
##  [97] haven_2.4.0         withr_2.4.1         mgcv_1.8-34        
## [100] prettydoc_0.4.1     survival_3.2-10     modelr_0.1.8       
## [103] crayon_1.4.1        utf8_1.2.1          rmarkdown_2.7      
## [106] progress_1.2.2      grid_4.0.5          readxl_1.3.1       
## [109] data.table_1.14.0   vegan_2.5-7         reprex_2.0.0       
## [112] digest_0.6.27       stats4_4.0.5        munsell_0.5.0      
## [115] bslib_0.2.4

5. References

Bolyen, Evan, Jai Ram Rideout, Matthew R Dillon, Nicholas A Bokulich, Christian C Abnet, Gabriel A Al-Ghalith, Harriet Alexander, et al. 2019. “Reproducible, Interactive, Scalable and Extensible Microbiome Data Science Using Qiime 2.” Nature Biotechnology 37 (8): 852–57. https://doi.org/10.1038/s41587-019-0209-9.

Callahan, Benjamin J, Paul J McMurdie, Michael J Rosen, Andrew W Han, Amy Jo A Johnson, and Susan P Holmes. 2016. “DADA2: High-Resolution Sample Inference from Illumina Amplicon Data.” Nature Methods 13 (7): 581. https://doi.org/10.1038/nmeth.3869.

McMurdie, Paul J., and Susan Holmes. 2013. “Phyloseq: An R Package for Reproducible Interactive Analysis and Graphics of Microbiome Census Data.” PLoS ONE 8 (4): e61217. https://doi.org/10.1371/journal.pone.0061217.

Oksanen, Jari, F. Guillaume Blanchet, Michael Friendly, Roeland Kindt, Pierre Legendre, Dan McGlinn, Peter R. Minchin, et al. 2019. “Vegan: Community Ecology Package.” https://CRAN.R-project.org/package=vegan.

Siegel, Andrew F. 2004. “Rarefaction Curves.” Encyclopedia of Statistical Sciences 10. https://doi.org/10.1002/0471667196.ess2195.pub2.

Yu, Guangchuang, David Smith, Huachen Zhu, Yi Guan, and Tommy Tsan-Yuk Lam. 2017. “Ggtree: An R Package for Visualization and Annotation of Phylogenetic Trees with Their Covariates and Other Associated Data.” Methods in Ecology and Evolution 8 (1): 28–36. https://doi.org/10.1111/2041-210X.12628.