Geneplast is designed for evolutionary and plasticity analysis based on orthologous groups distribution in a given species tree. It uses Shannon information theory and orthologs abundance to estimate the Evolutionary Plasticity Index. Additionally, it implements the Bridge algorithm to determine the evolutionary root of a given gene based on its orthologs distribution
geneplast 1.16.0
Geneplast is designed for evolutionary and plasticity analysis based on the distribuion of orthologous groups in a given species tree. It uses Shannon information theory to estimate the Evolutionary Plasticity Index (EPI) (Dalmolin et al. 2011, Castro et al. (2008)).
Figure 1 shows a toy example to illustrate the analysis. The observed itens in Figure 1a are distributed evenly among the different species (i.e. high diversity), while Figure 1b shows the opposite case. The diversity is given by the normalized Shannon’s diversity and represents the distribution of orthologous and paralogous genes in a set of species. High diversity represents an homogeneous distribution among the evaluated species, while low diversity indicates that few species concentrate most of the observed orthologous genes.
The EPI characterizes the evolutionary history of a given orthologous group (OG). It accesses the distribution of orthologs and paralogs and is defined as,
\[EPI=1-\frac{H\alpha}{\sqrt{D\alpha}}, (1)\]
where Dα represents the OG abundance and Hα the OG diversity. Low values of Dα combined with high values for Hα indicates an orthologous group of low plasticity, that is, few OG members distributed over many species. It also indicates that the OG might have experienced few modifications (i.e. duplication and deletion episodes) during the evolution. Note that 0 ≤ Hα ≤ 1 and Dα ≥ 1. As a result, 0 ≤ EPI ≤ 1. For further information about the EPI, please see (Dalmolin et al. 2011).
Figure 1. Toy examples showing the distribution of orthologous and paralogous genes in a given species tree. (a) OG of low abundance Dα, high diversity (Hα) and consequently low plasticity (PI). In this hypothetical case, the OG comprises orthologous genes observed in all species, without apparent deletion or duplication episodes. (b) in this example the OG is observed in many species, but not all, with many paralogs in some of them. Green numbers represents the number of orthologous genes in each species.
geneplast also implements a new algorithm called Bridge in order to interrogate the evolutionary root of a given gene based on the distribution of its orthologs. The Bridge algorithm assesses the probability that an ortholog of a given gene is present in each last common ancestor (LCA) of a given species (in a given species tree). As a result, this approach infers the evolutionary root representing the gene emergence. The method is designed to deal with large scale queries in order to interrogate, for example, all genes annotated in a network (please refer to (Castro et al. 2008) for a case study illustrating the advantages of using this approach).
To illustrate the rooting inference consider the evolutionary scenarios presented in Figure 2 for the same hypothetical OGs. These OGs comprise a number of orthologous genes distributed among 13 species, and the pattern of presence or absence is indicated by green and grey colours, respectively. Observe that at least one ortholog is present in all extant species in Figure 2a. To explain this common genetic trait, one possible evolutionary scenario could assume that the ortholog was present in the LCA of all species and was genetically transmitted up to the descendants. For this case, the evolutionary root might be placed at the bottom of the species tree (i.e. node g). The same reasoning can be done in Figure 2b, but with the evolutionary root placed at node d. The geneplast rooting pipeline is designed to infer the most consistent rooting scenario for the observed orthologs in a given species tree. The pipeline provides a consistency score called Dscore which estimates the stability of the inferred root, as well as an associated empirical p-value computed by permutation analysis.
Figure 2. Possible evolutionary rooting scenarios for the same toy examples depitected in Figure 1. (a, b) Red circles indicate the evolutionary roots that best explain the observed orthologs in this species tree.
The orthology data required to run geneplast is available in the gpdata.gs dataset. This dataset includes four objects containing information about Clusters of Orthologous Groups derived from the STRING database, release 9.1. geneplast can also be used with other sources of orthology information, provided that the input is set according to the gpdata.gs data structure (note: in order to reduce the processing time this example uses a subset of the STRING database).
library(geneplast)
data(gpdata.gs)
The first step is to create an OGP object by running the gplast.preprocess
function. This example uses 121 eukaryotic species from the STRING database and all OGs mapped to the genome stabilty gene network (Castro et al. 2008). Next, the gplast
function perform the plasticity analysis and the gplast.get
returns the results:
1 - Create an object of class OGP
.
ogp <- gplast.preprocess(cogdata=cogdata, sspids=sspids, cogids=cogids, verbose=FALSE)
2 - Run the gplast
function.
ogp <- gplast(ogp, verbose=FALSE)
3 - Get results.
res <- gplast.get(ogp,what="results")
head(res)
## [1] abundance diversity plasticity
## <0 rows> (or 0-length row.names)
The results are returned in a 3-column data.frame
with OG ids (cogids) identified in row.names
. Columns are named as abundance, diversity, and plasticity.
The metric abundance simply indicates the ratio of orthologs and paralogs by species. For example, KOG0011 cromprises 201 genes distributed in 116 eukaryotic species, with a resulting abundance of 1.7328. Abundance of 1 indicates an one-to-one orthology relationship, while high abundance denotes many duplication episodes on the OG’s evolutionary history. Diversity is obtained applying normalized Shannon entropy on orthologous distribution and Plasticity is obtained by EPI index, as described equation (1).
The rooting analysis starts with an OGR
object by running the groot.preprocess
function. This example uses all OGs mapped to the genome stability gene network using H. sapiens as reference species (Castro et al. 2008) and is set to perform 100 permutations for demonstration purposes (for a full analysis, please set Permutations
≥1000). Next, the groot
function performs the rooting analysis and the results are retrieved by groot.get
, which returns a data.frame
listing the root of each OG evaluated by the groot
method. The pipeline also returns the inconsistency score, which estimates the stability of the rooting analysis, as well as the associated empirical p-value. Additionally, the groot.plot
function allows the visualization of the inferred root for a given OG (e.g. Figure 3) and the LCAs for the reference species (Figure 4).
1 - Create an object of class OGR.
ogr <- groot.preprocess(cogdata=cogdata, phyloTree=phyloTree, spid="9606", cogids=cogids, verbose=FALSE)
2 - Run the groot function.
set.seed(1)
ogr <- groot(ogr, nPermutations=100, verbose=FALSE)
3 - Get results.
res <- groot.get(ogr,what="results")
head(res)
## Root Dscore Pvalue AdjPvalue
## NOG251516 3 0.67 1.53e-07 2.17e-05
## NOG80202 4 1.00 2.15e-10 3.06e-08
## NOG72146 6 0.82 9.44e-07 1.34e-04
## NOG44788 6 0.56 8.51e-05 1.21e-02
## NOG39906 7 1.00 6.55e-08 9.30e-06
## NOG45364 9 0.83 2.13e-05 3.03e-03
4 - Check the inferred root of a given OG
groot.plot(ogr,whichOG="NOG40170")
## PDF file 'gproot_NOG40170_9606LCAs.pdf' has been generated!
5 - Visualization of the LCAs for the reference species in the analysis (i.e. H. sapiens)
groot.plot(ogr,plot.lcas = TRUE)
## PDF file 'gproot_9606LCAs.pdf' has been generated!
Figure 3. Inferred evolutionary rooting scenario for NOG40170. Monophyletic groups are ordered to show all branches of the tree below the queried species in the analysis.
Figure 4. Visualization of the LCAs for the reference species in the analysis.
This example shows how to assess all OGs annotated for H. sapiens.
1 - Load orthogy data from the geneplast.data.string.v91 package.
# source("https://bioconductor.org/biocLite.R")
# biocLite("geneplast.data.string.v91")
library(geneplast.data.string.v91)
data(gpdata_string_v91)
2 - Create an object of class ‘OGR’ for a reference ‘spid’.
ogr <- groot.preprocess(cogdata=cogdata, phyloTree=phyloTree, spid="9606")
3 - Run the groot
function and infer the evolutionary roots.
Note: this step should take a long processing time due to the large number of OGs in the input data (also, nPermutations
argument is set to 100 for demonstration purpose only).
ogr <- groot(ogr, nPermutations=100, verbose=TRUE)
This example aims to show the evolutionary root of a protein-protein interaction (PPI) network, mapping the appearance of each gene in a given species tree. The next steps show how to tranfer evolutionary rooting information from geneplast to a graph model. Note: to make this work the gene annotation available from the input PPI network needs to match the annotation available from the geneplast data (in this case, ENTREZ gene IDs are used to match the datasets).
1 - Load a PPI network and required packages. The igraph
object called ‘ppi.gs’ provides PPI information for apoptosis and genome-stability genes (Castro et al. 2008).
library(RedeR)
library(igraph)
library(RColorBrewer)
data(ppi.gs)
2 - Map rooting information on the igraph
object.
g <- ogr2igraph(ogr, cogdata, ppi.gs, idkey = "ENTREZ")
3 - Adjust colors for rooting information.
pal <- brewer.pal(9, "RdYlBu")
color_col <- colorRampPalette(pal)(25) #set a color for each root!
g <- att.setv(g=g, from="Root", to="nodeColor", cols=color_col, na.col = "grey80", breaks = seq(1,25))
4 - Aesthetic adjusts for some graph attributes.
g <- att.setv(g = g, from = "SYMBOL", to = "nodeAlias")
E(g)$edgeColor <- "grey80"
V(g)$nodeLineColor <- "grey80"
5 - Send the igraph
object to RedeR interface.
rdp <- RedPort()
calld(rdp)
resetd(rdp)
addGraph(rdp, g)
addLegend.color(rdp, colvec=g$legNodeColor$scale, size=15, labvec=g$legNodeColor$legend, title="Roots represented in Fig4")
6 - Get apoptosis and genome-stability sub-networks.
g1 <- induced_subgraph(g=g, V(g)$name[V(g)$Apoptosis==1])
g2 <- induced_subgraph(g=g, V(g)$name[V(g)$GenomeStability==1])
7 - Group apoptosis and genome-stability genes into containers.
myTheme <- list(nestFontSize=25, zoom=80, isNest=TRUE, gscale=65, theme=2)
addGraph(rdp, g1, gcoord=c(25, 50), theme = c(myTheme, nestAlias="Apoptosis"))
addGraph(rdp, g2, gcoord=c(75, 50), theme = c(myTheme, nestAlias="Genome Stability"))
relax(rdp, p1=50, p2=50, p3=50, p4=50, p5= 50, ps = TRUE)
Figure 5. Inferred evolutionary roots of a protein-protein interaction network.
This example aims to show the evolutionary root of regulons (Fletcher et al. 2013). The idea is to map the appearance of each regulon (and the corresponding target genes) in a species tree. The next steps show how to tranfer evolutionary rooting information from geneplast to a graph model. Note: to make this work the gene annotation available from the input regulatory network needs to match the annotation available from the geneplast data (in this case, ENTREZ gene IDs are used to match the datasets).
1 - Load a TNI
class object and required packages. The rtni1st
object provides regulons available from the Fletcher2013b data package computed from breast cancer data (Fletcher et al. 2013).
library(RTN)
library(Fletcher2013b)
library(RedeR)
library(igraph)
library(RColorBrewer)
data("rtni1st")
2 - Extract two regulons from rtni1st
into an igraph
object.
regs <- c("FOXM1","PTTG1")
g <- tni.graph(rtni1st, gtype = "rmap", tfs = regs)
3 - Map rooting information on the igraph
object.
g <- ogr2igraph(ogr, cogdata, g, idkey = "ENTREZ")
4 - Adjust colors for rooting information.
pal <- brewer.pal(9, "RdYlBu")
color_col <- colorRampPalette(pal)(25) #set a color for each root!
g <- att.setv(g=g, from="Root", to="nodeColor", cols=color_col, na.col = "grey80", breaks = seq(1,25))
5 - Aesthetic adjusts for some graph attributes.
idx <- V(g)$SYMBOL %in% regs
V(g)$nodeFontSize[idx] <- 30
V(g)$nodeFontSize[!idx] <- 1
E(g)$edgeColor <- "grey80"
V(g)$nodeLineColor <- "grey80"
6 - Send the igraph
object to RedeR interface.
rdp <- RedPort()
calld(rdp)
resetd(rdp)
addGraph( rdp, g, layout=NULL)
addLegend.color(rdp, colvec=g$legNodeColor$scale, size=15, labvec=g$legNodeColor$legend, title="Roots represented in Fig4")
relax(rdp, 15, 100, 20, 50, 10, 100, 10, 2, ps=TRUE)
Figure 6. Inferred evolutionary roots of two regulators (FOXM1 and PTTG1) and the corresponding targets.
## R version 4.0.3 (2020-10-10)
## Platform: x86_64-pc-linux-gnu (64-bit)
## Running under: Ubuntu 18.04.5 LTS
##
## Matrix products: default
## BLAS: /home/biocbuild/bbs-3.12-bioc/R/lib/libRblas.so
## LAPACK: /home/biocbuild/bbs-3.12-bioc/R/lib/libRlapack.so
##
## locale:
## [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
## [3] LC_TIME=en_US.UTF-8 LC_COLLATE=C
## [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
## [7] LC_PAPER=en_US.UTF-8 LC_NAME=C
## [9] LC_ADDRESS=C LC_TELEPHONE=C
## [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
##
## attached base packages:
## [1] stats graphics grDevices utils datasets methods base
##
## other attached packages:
## [1] geneplast_1.16.0 BiocStyle_2.18.0
##
## loaded via a namespace (and not attached):
## [1] Rcpp_1.0.5 bookdown_0.21 lattice_0.20-41
## [4] ape_5.4-1 snow_0.4-3 digest_0.6.27
## [7] grid_4.0.3 nlme_3.1-150 magrittr_1.5
## [10] evaluate_0.14 rlang_0.4.8 stringi_1.5.3
## [13] data.table_1.13.2 rmarkdown_2.5 tools_4.0.3
## [16] stringr_1.4.0 igraph_1.2.6 parallel_4.0.3
## [19] xfun_0.18 yaml_2.2.1 compiler_4.0.3
## [22] pkgconfig_2.0.3 BiocManager_1.30.10 htmltools_0.5.0
## [25] knitr_1.30
Castro, Mauro AA, Rodrigo JS Dalmolin, Jose CF Moreira, Jose CM Mombach, and Rita MC de Almeida. 2008. “Evolutionary Origins of Human Apoptosis and Genome-Stability Gene Networks.” Nucleic Acids Research 36 (19):6269–83. https://doi.org/10.1093/nar/gkn636.
Dalmolin, Rodrigo JS, Mauro AA Castro, Jose Rybarczyk-Filho, Luis Souza, Rita MC de Almeida, and Jose CF Moreira. 2011. “Evolutionary Plasticity Determination by Orthologous Groups Distribution.” Biology Direct 6 (1):22. https://doi.org/10.1186/1745-6150-6-22.
Fletcher, Michael, Mauro Castro, Suet-Feung Chin, Oscar Rueda, Xin Wang, Carlos Caldas, Bruce Ponder, Florian Markowetz, and Kerstin Meyer. 2013. “Master Regulators of FGFR2 Signalling and Breast Cancer Risk.” Nature Communications 4:2464. https://doi.org/10.1038/ncomms3464.