BgeeDB, an R package for retrieval of curated expression datasets and for gene list enrichment tests

BgeeDB is a collection of functions to import data from the Bgee database (http://bgee.org/) directly into R, and to facilitate downstream analyses, such as gene set enrichment test based on expression of genes in anatomical structures. Bgee provides annotated and processed expression data and expression calls from curated wild-type healthy samples, from human and many other animal species.

The package retrieves the annotation of RNA-seq or Affymetrix experiments integrated into the Bgee database, and downloads into R the quantitative data and expression calls produced by the Bgee pipeline. The package also allows to run GO-like enrichment analyses based on anatomical terms, where genes are mapped to anatomical terms by expression patterns, based on the topGO package. This is the same as the TopAnat web-service available at (http://bgee.org/?page=top_anat#/), but with more flexibility in the choice of parameters and developmental stages.

In summary, the BgeeDB package allows to: * 1. List annotation of RNA-seq and microarray data available the Bgee database * 2. Download the processed gene expression data available in the Bgee database * 3. Download the gene expression calls and use them to perform TopAnat analyses

If you find a bug or have any issues to use BgeeDB please write a bug report in our own GitHub issues manager available at (https://github.com/BgeeDB/BgeeDB_R/issues)

Installation

In R:

#if (!requireNamespace("BiocManager", quietly=TRUE))
    #install.packages("BiocManager")
#BiocManager::install("BgeeDB")

In case BgeeDB is run on Windows please be sure your OS has curl installed. It is installed by default on Windows 10, version 1803 or later. If Git for Windows is installed on the OS then curl is already installed. If not installed please install it before running BgeeDB in order to avoid potential timeout errors when downloading files.

How to use BgeeDB package

Load the package

library(BgeeDB)

Running example: downloading and formatting processed RNA-seq data

List available species in Bgee

The listBgeeSpecies() function allows to retrieve available species in the Bgee database, and which data types are available for each species.

listBgeeSpecies()

## 
## Querying Bgee to get release information...
## 
## Building URL to query species in Bgee release 14_1...
## 
## Submitting URL to Bgee webservice... (https://r.bgee.org/bgee14_1/?page=r_package&action=get_all_species&display_type=tsv&source=BgeeDB_R_package&source_version=2.16.0)
## 
## Query to Bgee webservice successful!

##       ID           GENUS     SPECIES_NAME         COMMON_NAME AFFYMETRIX   EST
## 1   6239  Caenorhabditis          elegans            nematode       TRUE FALSE
## 2   7217      Drosophila        ananassae                          FALSE FALSE
## 3   7227      Drosophila     melanogaster           fruit fly       TRUE  TRUE
## 4   7230      Drosophila       mojavensis                          FALSE FALSE
## 5   7237      Drosophila    pseudoobscura                          FALSE FALSE
## 6   7240      Drosophila         simulans                          FALSE FALSE
## 7   7244      Drosophila          virilis                          FALSE FALSE
## 8   7245      Drosophila           yakuba                          FALSE FALSE
## 9   7955           Danio            rerio           zebrafish       TRUE  TRUE
## 10  8364         Xenopus       tropicalis western clawed frog      FALSE  TRUE
## 11  9031          Gallus           gallus             chicken      FALSE FALSE
## 12  9258 Ornithorhynchus         anatinus            platypus      FALSE FALSE
## 13  9365       Erinaceus        europaeus            hedgehog      FALSE FALSE
## 14  9544          Macaca          mulatta             macaque       TRUE FALSE
## 15  9593         Gorilla          gorilla             gorilla      FALSE FALSE
## 16  9597             Pan         paniscus              bonobo      FALSE FALSE
## 17  9598             Pan      troglodytes          chimpanzee      FALSE FALSE
## 18  9606            Homo          sapiens               human       TRUE  TRUE
## 19  9615           Canis lupus familiaris                 dog      FALSE FALSE
## 20  9685           Felis            catus                 cat      FALSE FALSE
## 21  9796           Equus         caballus               horse      FALSE FALSE
## 22  9823             Sus           scrofa                 pig      FALSE FALSE
## 23  9913             Bos           taurus              cattle      FALSE FALSE
## 24  9986     Oryctolagus        cuniculus              rabbit      FALSE FALSE
## 25 10090             Mus         musculus               mouse       TRUE  TRUE
## 26 10116          Rattus       norvegicus                 rat       TRUE FALSE
## 27 10141           Cavia        porcellus          guinea pig      FALSE FALSE
## 28 13616     Monodelphis        domestica             opossum      FALSE FALSE
## 29 28377          Anolis     carolinensis         green anole      FALSE FALSE
##    IN_SITU RNA_SEQ
## 1     TRUE    TRUE
## 2    FALSE    TRUE
## 3     TRUE    TRUE
## 4    FALSE    TRUE
## 5    FALSE    TRUE
## 6    FALSE    TRUE
## 7    FALSE    TRUE
## 8    FALSE    TRUE
## 9     TRUE    TRUE
## 10    TRUE    TRUE
## 11   FALSE    TRUE
## 12   FALSE    TRUE
## 13   FALSE    TRUE
## 14   FALSE    TRUE
## 15   FALSE    TRUE
## 16   FALSE    TRUE
## 17   FALSE    TRUE
## 18   FALSE    TRUE
## 19   FALSE    TRUE
## 20   FALSE    TRUE
## 21   FALSE    TRUE
## 22   FALSE    TRUE
## 23   FALSE    TRUE
## 24   FALSE    TRUE
## 25    TRUE    TRUE
## 26   FALSE    TRUE
## 27   FALSE    TRUE
## 28   FALSE    TRUE
## 29   FALSE    TRUE

It is possible to list all species from a specific release of Bgee with the release argument (see listBgeeRelease() function), and order the species according to a specific columns with the ordering argument. For example:

listBgeeSpecies(release = "13.2", order = 2)

## 
## Querying Bgee to get release information...
## 
## Building URL to query species in Bgee release 13_2...
## 
## Submitting URL to Bgee webservice... (https://r.bgee.org/bgee13/?page=species&display_type=tsv&source=BgeeDB_R_package&source_version=2.16.0)
## 
## Query to Bgee webservice successful!

##       ID           GENUS SPECIES_NAME COMMON_NAME AFFYMETRIX   EST IN_SITU
## 17 28377          Anolis carolinensis      anolis      FALSE FALSE   FALSE
## 13  9913             Bos       taurus         cow      FALSE FALSE   FALSE
## 1   6239  Caenorhabditis      elegans   c.elegans       TRUE FALSE    TRUE
## 3   7955           Danio        rerio   zebrafish       TRUE  TRUE    TRUE
## 2   7227      Drosophila melanogaster    fruitfly       TRUE  TRUE    TRUE
## 5   9031          Gallus       gallus     chicken      FALSE FALSE   FALSE
## 8   9593         Gorilla      gorilla     gorilla      FALSE FALSE   FALSE
## 11  9606            Homo      sapiens       human       TRUE  TRUE   FALSE
## 7   9544          Macaca      mulatta     macaque      FALSE FALSE   FALSE
## 16 13616     Monodelphis    domestica     opossum      FALSE FALSE   FALSE
## 14 10090             Mus     musculus       mouse       TRUE  TRUE    TRUE
## 6   9258 Ornithorhynchus     anatinus    platypus      FALSE FALSE   FALSE
## 9   9597             Pan     paniscus      bonobo      FALSE FALSE   FALSE
## 10  9598             Pan  troglodytes  chimpanzee      FALSE FALSE   FALSE
## 15 10116          Rattus   norvegicus         rat      FALSE FALSE   FALSE
## 12  9823             Sus       scrofa         pig      FALSE FALSE   FALSE
## 4   8364         Xenopus   tropicalis     xenopus      FALSE  TRUE    TRUE
##    RNA_SEQ
## 17    TRUE
## 13    TRUE
## 1     TRUE
## 3    FALSE
## 2    FALSE
## 5     TRUE
## 8     TRUE
## 11    TRUE
## 7     TRUE
## 16    TRUE
## 14    TRUE
## 6     TRUE
## 9     TRUE
## 10    TRUE
## 15    TRUE
## 12    TRUE
## 4     TRUE

Create a new Bgee object

In the following example we will choose to focus on mouse (“Mus_musculus”) RNA-seq. Species can be specified using their name or their NCBI taxonomic IDs. To specify that RNA-seq data want to be downloaded, the dataType argument is set to “rna_seq”. To download Affymetrix microarray data, set this argument to “affymetrix”.

bgee <- Bgee$new(species = "Mus_musculus", dataType = "rna_seq")

## 
## Querying Bgee to get release information...
## 
## Building URL to query species in Bgee release 14_1...
## 
## Submitting URL to Bgee webservice... (https://r.bgee.org/bgee14_1/?page=r_package&action=get_all_species&display_type=tsv&source=BgeeDB_R_package&source_version=2.16.0)
## 
## Query to Bgee webservice successful!
## 
## API key built: 3fe87320130948088207ebd4acd1a7639f7b35b41e815d808675e552b0a66e36ea51adc79754706197a2b148890fce891ae232735ff9c5eeccc6bda5af816172

Note 1: It is possible to work with data from a specific release of Bgee by specifying the release argument, see listBgeeRelease() function.

Note 2: The functions of the package will store the downloaded files in a versioned folder created by default in the working directory. These cache files allow faster re-access to the data. The directory where data are stored can be changed with the pathToData argument.

Retrieve the annotation of mouse RNA-seq datasets

The getAnnotation() function will output the list of RNA-seq experiments and libraries available in Bgee for mouse.

annotation_bgee_mouse <- getAnnotation(bgee)

## 
## Saved annotation files in /tmp/RtmpWj2bWw/Rbuild6b1368972677/BgeeDB/vignettes/Mus_musculus_Bgee_14_1 folder.

# list the first experiments and libraries
lapply(annotation_bgee_mouse, head)

## $sample.annotation
##   Experiment.ID Library.ID Anatomical.entity.ID Anatomical.entity.name
## 1     ERP104395 ERX2187486       UBERON:0000082 adult mammalian kidney
## 2     ERP104395 ERX2187487       UBERON:0000082 adult mammalian kidney
## 3     ERP104395 ERX2187488       UBERON:0000082 adult mammalian kidney
## 4     ERP104395 ERX2187498       UBERON:0000945                stomach
## 5     ERP104395 ERX2187499       UBERON:0000945                stomach
## 6     ERP104395 ERX2187500       UBERON:0000945                stomach
##         Stage.ID                Stage.name  Sex   Strain         Platform.ID
## 1 UBERON:0000113 post-juvenile adult stage male C57BL/6J Illumina HiSeq 2000
## 2 UBERON:0000113 post-juvenile adult stage male C57BL/6J Illumina HiSeq 2000
## 3 UBERON:0000113 post-juvenile adult stage male C57BL/6J Illumina HiSeq 2000
## 4 UBERON:0000113 post-juvenile adult stage male C57BL/6J Illumina HiSeq 2000
## 5 UBERON:0000113 post-juvenile adult stage male C57BL/6J Illumina HiSeq 2000
## 6 UBERON:0000113 post-juvenile adult stage male C57BL/6J Illumina HiSeq 2000
##   Library.type Library.orientation TMM.normalization.factor
## 1       single                  NA                 1.183867
## 2       single                  NA                 1.211967
## 3       single                  NA                 1.150353
## 4       single                  NA                 1.152138
## 5       single                  NA                 1.009166
## 6       single                  NA                 1.101729
##   TPM.expression.threshold FPKM.expression.threshold Read.count
## 1                 0.043961                  0.036158   22187319
## 2                 0.046119                  0.038656   29584966
## 3                 0.032360                  0.026583   25448291
## 4                 0.036161                  0.032999   25212673
## 5                 0.024721                  0.025872   25083482
## 6                 0.030623                  0.027480   23784442
##   Mapped.read.count Min.read.length Max.read.length All.genes.percent.present
## 1          19888728              50              50                     43.22
## 2          26170492              50              50                     45.06
## 3          23512513              50              50                     43.28
## 4          22594646              50              50                     45.31
## 5          22776315              50              50                     44.15
## 6          21418113              50              50                     43.96
##   Protein.coding.genes.percent.present Intergenic.regions.percent.present
## 1                                71.69                              15.55
## 2                                72.45                              16.37
## 3                                72.02                              15.31
## 4                                75.26                              16.87
## 5                                73.94                              15.40
## 6                                73.67                              15.72
##      Run.IDs Data.source
## 1 ERR2130634         SRA
## 2 ERR2130635         SRA
## 3 ERR2130636         SRA
## 4 ERR2130646         SRA
## 5 ERR2130647         SRA
## 6 ERR2130648         SRA
##                                                                              Data.source.URL
## 1 http://trace.ncbi.nlm.nih.gov/Traces/sra/sra.cgi?cmd=viewer&m=data&s=viewer&run=ERX2187486
## 2 http://trace.ncbi.nlm.nih.gov/Traces/sra/sra.cgi?cmd=viewer&m=data&s=viewer&run=ERX2187487
## 3 http://trace.ncbi.nlm.nih.gov/Traces/sra/sra.cgi?cmd=viewer&m=data&s=viewer&run=ERX2187488
## 4 http://trace.ncbi.nlm.nih.gov/Traces/sra/sra.cgi?cmd=viewer&m=data&s=viewer&run=ERX2187498
## 5 http://trace.ncbi.nlm.nih.gov/Traces/sra/sra.cgi?cmd=viewer&m=data&s=viewer&run=ERX2187499
## 6 http://trace.ncbi.nlm.nih.gov/Traces/sra/sra.cgi?cmd=viewer&m=data&s=viewer&run=ERX2187500
##                                                                                                                           Bgee.normalized.data.URL
## 1 ftp://ftp.bgee.org/bgee_v14_1/download/processed_expr_values/rna_seq/Mus_musculus/Mus_musculus_RNA-Seq_read_counts_TPM_FPKM_ERP104395.tsv.tar.gz
## 2 ftp://ftp.bgee.org/bgee_v14_1/download/processed_expr_values/rna_seq/Mus_musculus/Mus_musculus_RNA-Seq_read_counts_TPM_FPKM_ERP104395.tsv.tar.gz
## 3 ftp://ftp.bgee.org/bgee_v14_1/download/processed_expr_values/rna_seq/Mus_musculus/Mus_musculus_RNA-Seq_read_counts_TPM_FPKM_ERP104395.tsv.tar.gz
## 4 ftp://ftp.bgee.org/bgee_v14_1/download/processed_expr_values/rna_seq/Mus_musculus/Mus_musculus_RNA-Seq_read_counts_TPM_FPKM_ERP104395.tsv.tar.gz
## 5 ftp://ftp.bgee.org/bgee_v14_1/download/processed_expr_values/rna_seq/Mus_musculus/Mus_musculus_RNA-Seq_read_counts_TPM_FPKM_ERP104395.tsv.tar.gz
## 6 ftp://ftp.bgee.org/bgee_v14_1/download/processed_expr_values/rna_seq/Mus_musculus/Mus_musculus_RNA-Seq_read_counts_TPM_FPKM_ERP104395.tsv.tar.gz
##                                                  Raw.file.URL
## 1 https://trace.ncbi.nlm.nih.gov/Traces/study/?acc=ERX2187486
## 2 https://trace.ncbi.nlm.nih.gov/Traces/study/?acc=ERX2187487
## 3 https://trace.ncbi.nlm.nih.gov/Traces/study/?acc=ERX2187488
## 4 https://trace.ncbi.nlm.nih.gov/Traces/study/?acc=ERX2187498
## 5 https://trace.ncbi.nlm.nih.gov/Traces/study/?acc=ERX2187499
## 6 https://trace.ncbi.nlm.nih.gov/Traces/study/?acc=ERX2187500
## 
## $experiment.annotation
##   Experiment.ID
## 1     ERP104395
## 2      GSE30617
## 3     SRP028336
## 4      GSE41637
## 5      GSE30352
## 6     SRP061644
##                                                                                          Experiment.name
## 1                                                        An RNASeq normal tissue atlas for mouse and rat
## 2                                                                       [E-MTAB-599] Mouse Transcriptome
## 3                                            Large-scale multi-species survey of metabolome and lipidome
## 4                              Evolutionary dynamics of gene and isoform regulation in mammalian tissues
## 5                                            The evolution of gene expression levels in mammalian organs
## 6 The relationship between gene network structure and expression variation among individuals and species
##   Library.count Condition.count Organ.stage.count Organ.count Stage.count
## 1            38              13                13          13           1
## 2            36               6                 6           6           1
## 3            30              10                 5           5           1
## 4            26              26                 9           9           1
## 5            17              17                 6           6           1
## 6            17               6                 6           2           3
##   Sex.count Strain.count Data.source
## 1         1            2         SRA
## 2         0            1         GEO
## 3         2            1         SRA
## 4         1            6         GEO
## 5         2            2         GEO
## 6         0            1         SRA
##                                              Data.source.URL
## 1                  http://www.ncbi.nlm.nih.gov/sra/ERP104395
## 2 http://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE30617
## 3                  http://www.ncbi.nlm.nih.gov/sra/SRP028336
## 4 http://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE41637
## 5 http://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE30352
## 6                  http://www.ncbi.nlm.nih.gov/sra/SRP061644
##                                                                                                                           Bgee.normalized.data.URL
## 1 ftp://ftp.bgee.org/bgee_v14_1/download/processed_expr_values/rna_seq/Mus_musculus/Mus_musculus_RNA-Seq_read_counts_TPM_FPKM_ERP104395.tsv.tar.gz
## 2  ftp://ftp.bgee.org/bgee_v14_1/download/processed_expr_values/rna_seq/Mus_musculus/Mus_musculus_RNA-Seq_read_counts_TPM_FPKM_GSE30617.tsv.tar.gz
## 3 ftp://ftp.bgee.org/bgee_v14_1/download/processed_expr_values/rna_seq/Mus_musculus/Mus_musculus_RNA-Seq_read_counts_TPM_FPKM_SRP028336.tsv.tar.gz
## 4  ftp://ftp.bgee.org/bgee_v14_1/download/processed_expr_values/rna_seq/Mus_musculus/Mus_musculus_RNA-Seq_read_counts_TPM_FPKM_GSE41637.tsv.tar.gz
## 5  ftp://ftp.bgee.org/bgee_v14_1/download/processed_expr_values/rna_seq/Mus_musculus/Mus_musculus_RNA-Seq_read_counts_TPM_FPKM_GSE30352.tsv.tar.gz
## 6 ftp://ftp.bgee.org/bgee_v14_1/download/processed_expr_values/rna_seq/Mus_musculus/Mus_musculus_RNA-Seq_read_counts_TPM_FPKM_SRP061644.tsv.tar.gz
##                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                     Experiment.description
## 1                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                The function of a gene is closely connected to its expression specificity across tissues and cell types. RNA-Seq is a powerful quantitative tool to explore genome wide expression. The aim of the present study is to provide a comprehensive RNA-Seq dataset across the same 13 tissues for mouse and rat, two of the most relevant species for biomedical research. The dataset provides the transcriptome across tissues from three male C57BL6 mice and three male Han Wistar rats. We also describe our bioinformatics pipeline to process and technically validate the data. Principal component analysis shows that tissue samples from both species cluster similarly. By comparative genomics we show that many genes with high sequence identity with respect to their human orthologues have also a highly correlated tissue distribution profile and are in agreement with manually curated literature data for human. These results make us confident that the present study provides a unique resource for comparative genomics and will facilitate the analysis of tissue specificity and cross-species conservation in higher organisms.
## 2                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      Sequencing the transcriptome of DBAxC57BL/6J mice. To study the regulation of transcription, splicing and RNA turnover we have sequenced the transcriptomes of tissues collected DBAxC57BL/6J mice.
## 3                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                         This dataset was generated with the goal of comparative study of gene expression in three brain regions and two non-neural tissues of humans, chimpanzees, macaque monkeys and mice. Using this dataset, we performed studies of gene expression and gene splicing evolution across species and search of tissue-specific gene expression and splicing patterns. We also used the gene expression information of genes encoding metabolic enzymes in this dataset to support a larger comparative study of metabolome evolution in the same set of tissues and species. Overall design: 120 tissue samples of prefrontal cortex (PFC), primary visual cortex (VC), cerebellar cortex (CBC), kidney and skeletal muscle of humans, chimpanzees, macaques and mice. The data accompanies a large set of metabolite measurements of the same tissue samples. Enzyme expression was used to validate metabolite measurement variation among species.
## 4                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                           Most mammalian genes produce multiple distinct mRNAs through alternative splicing, but the extent of splicing conservation is not clear. To assess tissue-specific transcriptome variation across mammals, we sequenced cDNA from 9 tissues from 4 mammals and one bird in biological triplicate, at unprecedented depth. We find that while tissue-specific gene expression programs are largely conserved, alternative splicing is well conserved in only a subset of tissues and is frequently lineage-specific. Thousands of novel, lineage-specific and conserved alternative exons were identified; widely conserved alternative exons had signatures of binding by MBNL, PTB, RBFOX, STAR and TIA family splicing factors, implicating them as ancestral mammalian splicing regulators. Our data also indicates that alternative splicing is often used to alter protein phosphorylatability, delimiting the scope of kinase signaling.
## 5                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      Changes in gene expression are thought to underlie many of the phenotypic differences between species. However, large-scale analyses of gene expression evolution were until recently prevented by technological limitations. Here we report the sequencing of polyadenylated RNA from six organs across ten species that represent all major mammalian lineages (placentals, marsupials and monotremes) and birds (the evolutionary outgroup), with the goal of understanding the dynamics of mammalian transcriptome evolution. We show that the rate of gene expression evolution varies among organs, lineages and chromosomes, owing to differences in selective pressures: transcriptome change was slow in nervous tissues and rapid in testes, slower in rodents than in apes and monotremes, and rapid for the X chromosome right after its formation. Although gene expression evolution in mammals was strongly shaped by purifying selection, we identify numerous potentially selectively driven expression switches, which occurred at different rates across lineages and tissues and which probably contributed to the specific organ biology of various mammals. Our transcriptome data provide a valuable resource for functional and evolutionary analyses of mammalian genomes.
## 6 Variation among individuals is a prerequisite of evolution by natural selection. As such, identifying the origins of variation is a fundamental goal of biology. We investigated the link between gene interactions and variation in gene expression among individuals and species, using the mammalian limb as a model system. We first built interaction networks for key genes regulating early (outgrowth; E9.5-11) and late (expansion and elongation; E11-13) limb development in mouse. This resulted in an Early (ESN) and Late (LSN) Stage Network. Computational perturbations of these networks suggest that the ESN is more robust. We then quantified levels of the same key genes among mouse individuals, and found that they vary less at earlier limb stages and that variation in gene expression is heritable. Finally, we quantified variation in gene expression levels among four mammals with divergent limbs (bat, opossum, mouse and pig), and found that levels vary less among species at earlier limb stages. We also found that variation in gene expression levels among individuals and species are correlated for earlier and later limb development. In conclusion, results are consistent with the robustness of the ESN buffering among-individual variation in gene expression levels early in mammalian limb development, and constraining the evolution of early limb development among mammalian species. Overall design: Bat, mouse, opossum, and pig mRNA profiles at early and late developmental stages on each species fore and hind-limbs . Various replicates of each library were generated by single-end sequencing using Illumina HiSeq 2500. Please note that the De novo transcriptome assembly for bat (Trinity.fasta) was generated from pooled RNA-seq data of fore and hind-limbs at various embryonic developmental stages; Beginning stage (Wanek stage 2: 3 FL and 3 HL samples), early-stage (Wanek stage 3/4: 2 FL and 2 HL samples), and late_stage (Wanke stage 6: 2 FL and 2 HL samples).

Download the processed mouse RNA-seq data

The getData() function will download processed RNA-seq data from all mouse experiments in Bgee as a data frame.

# download all RNA-seq experiments from mouse
data_bgee_mouse <- getData(bgee)

## 
## NOTE: annotation files for this species were found in the download directory /tmp/RtmpWj2bWw/Rbuild6b1368972677/BgeeDB/vignettes/Mus_musculus_Bgee_14_1. Data will not be redownloaded.

## downloading data from Bgee FTP...

## You tried to download more than 15 experiments, because of that all the Bgee data for this species will be downloaded.

## Downloading all expression data for species Mus_musculus

## Saved expression data file in/tmp/RtmpWj2bWw/Rbuild6b1368972677/BgeeDB/vignettes/Mus_musculus_Bgee_14_1folder. Now untar file...

## Finished uncompress tar files

## Save data in local sqlite database

## Load queried data. The query is : SELECT * from rna_seq

# number of experiments downloaded
length(data_bgee_mouse)

## [1] 16

# check the downloaded data
lapply(data_bgee_mouse, head)

## $Experiment.ID
## [1] "SRP007655" "SRP007655" "SRP007655" "SRP007655" "SRP007655" "SRP007655"
## 
## $Library.ID
## [1] "SRX088978" "SRX088978" "SRX088978" "SRX088978" "SRX088978" "SRX088978"
## 
## $Library.type
## [1] "paired" "paired" "paired" "paired" "paired" "paired"
## 
## $Gene.ID
## [1] "ENSMUSG00000000001" "ENSMUSG00000000003" "ENSMUSG00000000028"
## [4] "ENSMUSG00000000031" "ENSMUSG00000000037" "ENSMUSG00000000049"
## 
## $Anatomical.entity.ID
## [1] "UBERON:0003902" "UBERON:0003902" "UBERON:0003902" "UBERON:0003902"
## [5] "UBERON:0003902" "UBERON:0003902"
## 
## $Anatomical.entity.name
## [1] "\"retinal neural layer\"" "\"retinal neural layer\""
## [3] "\"retinal neural layer\"" "\"retinal neural layer\""
## [5] "\"retinal neural layer\"" "\"retinal neural layer\""
## 
## $Stage.ID
## [1] "MmusDv:0000062" "MmusDv:0000062" "MmusDv:0000062" "MmusDv:0000062"
## [5] "MmusDv:0000062" "MmusDv:0000062"
## 
## $Stage.name
## [1] "\"2 month-old stage (mouse)\"" "\"2 month-old stage (mouse)\""
## [3] "\"2 month-old stage (mouse)\"" "\"2 month-old stage (mouse)\""
## [5] "\"2 month-old stage (mouse)\"" "\"2 month-old stage (mouse)\""
## 
## $Sex
## [1] "mixed" "mixed" "mixed" "mixed" "mixed" "mixed"
## 
## $Strain
## [1] "\"C57BL/6J - 129SvEvTac\"" "\"C57BL/6J - 129SvEvTac\""
## [3] "\"C57BL/6J - 129SvEvTac\"" "\"C57BL/6J - 129SvEvTac\""
## [5] "\"C57BL/6J - 129SvEvTac\"" "\"C57BL/6J - 129SvEvTac\""
## 
## $Read.count
## [1] 744.00000   0.00000  87.99996  10.00001  42.00001   4.00000
## 
## $TPM
## [1] 26.445908  0.000000  5.613535  0.738085  1.626799  0.401370
## 
## $FPKM
## [1] 14.988300  0.000000  3.181488  0.418312  0.921993  0.227478
## 
## $Detection.flag
## [1] "present" "absent"  "present" "present" "present" "present"
## 
## $Detection.quality
## [1] "high quality" "high quality" "high quality" "high quality" "high quality"
## [6] "high quality"
## 
## $State.in.Bgee
## [1] "Part of a call" "Part of a call" "Part of a call" "Part of a call"
## [5] "Part of a call" "Part of a call"

# isolate the first experiment
data_bgee_experiment1 <- data_bgee_mouse[[1]]

The result of the getData() function is a data frame. Each row is a gene and the expression levels are displayed as raw read counts, RPKMs (up to Bgee 13.2), TPMs (from Bgee 14.0), or FPKMs (from Bgee 14.0). A detection flag indicates if the gene is significantly expressed above background level of expression.

Note: If microarray data are downloaded, rows corresponding to probesets and log2 of expression intensities are available instead of read counts/RPKMs/TPMs/FPKMs.

Alternatively, you can choose to download data from one RNA-seq experiment from Bgee with the experimentId parameter:

# download data for GSE30617
data_bgee_mouse_gse30617 <- getData(bgee, experimentId = "GSE30617")

## 
## NOTE: annotation files for this species were found in the download directory /tmp/RtmpWj2bWw/Rbuild6b1368972677/BgeeDB/vignettes/Mus_musculus_Bgee_14_1. Data will not be redownloaded.

## Load queried data. The query is : SELECT * from rna_seq WHERE [Experiment.ID] = "GSE30617"

It is possible to download data by combining filters : * experimentId : one or more experimentId, * sampleId : one or more sampleId (i.e libraryId for RNA-Seq and ChipId for Affymetrix), * anatEntityId : one or more anatomical entity ID from the UBERON ontology (https://uberon.github.io/) * stageId : one or more developmental stage ID from the developmental stage ontologies (https://github.com/obophenotype/developmental-stage-ontologies)

# Examples of data downloading using different filtering combination
# retrieve mouse RNA-Seq data for heart or brain
data_bgee_mouse_filters <- getData(bgee, anatEntityId = c("UBERON:0000955","UBERON:0000948"))
# retrieve mouse RNA-Seq data for heart (UBERON:0000955) or brain (UBERON:0000948) part of the experiment GSE30617
data_bgee_mouse_filters <- getData(bgee, experimentId = "GSE30617", anatEntityId = c("UBERON:0000955","UBERON:0000948"))
# retrieve mouse RNA-Seq data for heart (UBERON:0000955) or brain (UBERON:0000948) from post-embryonic stage (UBERON:0000092)
data_bgee_mouse_filters <- getData(bgee, stageId = "UBERON:0000092", anatEntityId = c("UBERON:0000955","UBERON:0000948"))

Format the RNA-seq data

It is sometimes easier to work with data organized as a matrix, where rows represent genes or probesets and columns represent different samples. The formatData() function reformats the data into an ExpressionSet object including: * An expression data matrix, with genes or probesets as rows, and samples as columns (assayData slot). The stats argument allows to choose if the matrix should be filled with read counts, RPKMs (up to Bgee 13.2), FPKMs (from Bgee 14.0), or TPMs (from Bgee 14.0) for RNA-seq data. For microarray data the matrix is filled with log2 expression intensities. * A data frame listing the samples and their anatomical structure and developmental stage annotation (phenoData slot) * For microarray data, the mapping from probesets to Ensembl genes (featureData slot)

The callType argument allows to retain only actively expressed genes or probesets, if set to “present” or “present high quality”. Genes or probesets that are absent in a given sample are given NA values.

# use only present calls and fill expression matric with FPKM values
gene.expression.mouse.fpkm <- formatData(bgee, data_bgee_mouse_gse30617, callType = "present", stats = "fpkm")

## 
## Extracting expression data matrix...
##   Keeping only present genes.
## 
## Extracting features information...
## 
## Extracting samples information...

gene.expression.mouse.fpkm

## ExpressionSet (storageMode: lockedEnvironment)
## assayData: 47729 features, 36 samples 
##   element names: exprs 
## protocolData: none
## phenoData
##   sampleNames: ERX012344 ERX012345 ... ERX012379 (36 total)
##   varLabels: Library.ID Anatomical.entity.ID ... Stage.name (5 total)
##   varMetadata: labelDescription
## featureData
##   featureNames: ENSMUSG00000000001 ENSMUSG00000000003 ...
##     ENSMUSG00000109578 (47729 total)
##   fvarLabels: Gene.ID
##   fvarMetadata: labelDescription
## experimentData: use 'experimentData(object)'
## Annotation:

Running example: TopAnat gene expression enrichment analysis

For some documentation on the TopAnat analysis, please refer to our publications, or to the web-tool page (http://bgee.org/?page=top_anat#/).

Create a new Bgee object

Similarly to the quantitative data download example above, the first step of a topAnat analysis is to built an object from the Bgee class. For this example, we will focus on zebrafish:

# Creating new Bgee class object
bgee <- Bgee$new(species = "Danio_rerio")

## 
## NOTE: You did not specify any data type. The argument dataType will be set to c("rna_seq","affymetrix","est","in_situ") for the next steps.
## 
## Querying Bgee to get release information...
## 
## NOTE: the file describing Bgee species information for release 14_1 was found in the download directory /tmp/RtmpWj2bWw/Rbuild6b1368972677/BgeeDB/vignettes. Data will not be redownloaded.
## 
## API key built: 3fe87320130948088207ebd4acd1a7639f7b35b41e815d808675e552b0a66e36ea51adc79754706197a2b148890fce891ae232735ff9c5eeccc6bda5af816172

Note : We are free to specify any data type of interest using the dataType argument among rna_seq, affymetrix, est or in_situ, or even a combination of data types. If nothing is specified, as in the above example, all data types available for the targeted species are used. This equivalent to specifying dataType=c("rna_seq","affymetrix","est","in_situ").

Download the data allowing to perform TopAnat analysis

The loadTopAnatData() function loads a mapping from genes to anatomical structures based on calls of expression in anatomical structures. It also loads the structure of the anatomical ontology and the names of anatomical structures.

# Loading calls of expression
myTopAnatData <- loadTopAnatData(bgee)

## 
## Building URLs to retrieve organ relationships from Bgee.........
##    URL successfully built (https://r.bgee.org/bgee14_1/?page=r_package&action=get_anat_entity_relations&display_type=tsv&species_list=7955&attr_list=SOURCE_ID&attr_list=TARGET_ID&api_key=3fe87320130948088207ebd4acd1a7639f7b35b41e815d808675e552b0a66e36ea51adc79754706197a2b148890fce891ae232735ff9c5eeccc6bda5af816172&source=BgeeDB_R_package&source_version=2.16.0)
##    Submitting URL to Bgee webservice (can be long)
##    Got results from Bgee webservice. Files are written in "/tmp/RtmpWj2bWw/Rbuild6b1368972677/BgeeDB/vignettes/Danio_rerio_Bgee_14_1"
## 
## Building URLs to retrieve organ names from Bgee.................
##    URL successfully built (https://r.bgee.org/bgee14_1/?page=r_package&action=get_anat_entities&display_type=tsv&species_list=7955&attr_list=ID&attr_list=NAME&api_key=3fe87320130948088207ebd4acd1a7639f7b35b41e815d808675e552b0a66e36ea51adc79754706197a2b148890fce891ae232735ff9c5eeccc6bda5af816172&source=BgeeDB_R_package&source_version=2.16.0)
##    Submitting URL to Bgee webservice (can be long)
##    Got results from Bgee webservice. Files are written in "/tmp/RtmpWj2bWw/Rbuild6b1368972677/BgeeDB/vignettes/Danio_rerio_Bgee_14_1"
## 
## Building URLs to retrieve mapping of gene to organs from Bgee...
##    URL successfully built (https://r.bgee.org/bgee14_1/?page=r_package&action=get_expression_calls&display_type=tsv&species_list=7955&attr_list=GENE_ID&attr_list=ANAT_ENTITY_ID&api_key=3fe87320130948088207ebd4acd1a7639f7b35b41e815d808675e552b0a66e36ea51adc79754706197a2b148890fce891ae232735ff9c5eeccc6bda5af816172&source=BgeeDB_R_package&source_version=2.16.0&data_qual=SILVER)
##    Submitting URL to Bgee webservice (can be long)
##    Got results from Bgee webservice. Files are written in "/tmp/RtmpWj2bWw/Rbuild6b1368972677/BgeeDB/vignettes/Danio_rerio_Bgee_14_1"
## 
## Parsing the results.............................................
## 
## Adding BGEE:0 as unique root of all terms of the ontology.......
## 
## Done.

# Look at the data
## str(myTopAnatData)

The strigency on the quality of expression calls can be changed with the confidence argument. Finally, if you are interested in expression data coming from a particular developmental stage or a group of stages, please specify the a Uberon stage Id in the stage argument.

## Loading silver and gold expression calls from affymetrix data made on embryonic samples only 
## This is just given as an example, but is not run in this vignette because only few data are returned
bgee <- Bgee$new(species = "Danio_rerio", dataType="affymetrix")
myTopAnatData <- loadTopAnatData(bgee, stage="UBERON:0000068", confidence="silver")

Note 1: As mentioned above, the downloaded data files are stored in a versioned folder that can be set with the pathToData argument when creating the Bgee class object (default is the working directory). If you query again Bgee with the exact same parameters, these cached files will be read instead of querying the web-service again. It is thus important, if you plan to reuse the same data for multiple parallel topAnat analyses, to plan to make use of these cached files instead of redownloading them for each analysis. The cached files also give the possibility to repeat analyses offline.

Note 2: In releases up to Bgee 13.2 allowed confidence`` values were `low_quality` or or `high_quality`. From Bgee 14.0confidence``values aregoldorsilver`.

Prepare a topAnatData object allowing to perform TopAnat analysis with topGO

First we need to prepare a list of genes in the foreground and in the background. The input format is the same as the gene list required to build a topGOdata object in the topGO package: a vector with background genes as names, and 0 or 1 values depending if a gene is in the foreground or not. In this example we will look at genes with an annotated phenotype related to “pectoral fin” . We use the biomaRt package to retrieve this list of genes. We expect them to be enriched for expression in male tissues, notably the testes. The background list of genes is set to all genes annotated to at least one Gene Ontology term, allowing to account for biases in which types of genes are more likely to receive Gene Ontology annotation.

# if (!requireNamespace("BiocManager", quietly=TRUE))
    # install.packages("BiocManager")
# BiocManager::install("biomaRt")
library(biomaRt)
ensembl <- useMart("ENSEMBL_MART_ENSEMBL", dataset="drerio_gene_ensembl", host="mar2016.archive.ensembl.org")

# get the mapping of Ensembl genes to phenotypes. It will corresponds to the background genes
universe <- getBM(filters=c("phenotype_source"), value=c("ZFIN"), attributes=c("ensembl_gene_id","phenotype_description"), mart=ensembl)

# select phenotypes related to pectoral fin
phenotypes <- grep("pectoral fin", unique(universe$phenotype_description), value=T)

# Foreground genes are those with an annotated phenotype related to "pectoral fin" 
myGenes <- unique(universe$ensembl_gene_id[universe$phenotype_description %in% phenotypes])

# Prepare the gene list vector 
geneList <- factor(as.integer(unique(universe$ensembl_gene_id) %in% myGenes))
names(geneList) <- unique(universe$ensembl_gene_id)
summary(geneList)

# Prepare the topGO object
myTopAnatObject <-  topAnat(myTopAnatData, geneList)

The above code using the biomaRt package is not executed in this vignette to prevent building issues of our package in case of biomaRt downtime. Instead we use a geneList object saved in the data/ folder that we built using pre-downloaded data.

data(geneList)
myTopAnatObject <-  topAnat(myTopAnatData, geneList)

## 
## Checking topAnatData object.............
## 
## Checking gene list......................
## 
## Building most specific Ontology terms...  (  1174  Ontology terms found. )
## 
## Building DAG topology...................  (  2036  Ontology terms and  3883  relations. )
## 
## Annotating nodes (Can be long)..........  (  3005  genes annotated to the Ontology terms. )

Warning: This can be long, especially if the gene list is large, since the Uberon anatomical ontology is large and expression calls will be propagated through the whole ontology (e.g., expression in the forebrain will also be counted as expression in parent structures such as the brain, nervous system, etc). Consider running a script in batch mode if you have multiple analyses to do.

Launch the enrichment test

For this step, see the vignette of the topGO package for more details, as you have to directly use the tests implemented in the topGO package, as shown in this example:

results <- runTest(myTopAnatObject, algorithm = 'weight', statistic = 'fisher')

## 
##           -- Weight Algorithm -- 
## 
##       The algorithm is scoring 1007 nontrivial nodes
##       parameters: 
##           test statistic: fisher : ratio

## 
##   Level 27:  1 nodes to be scored.

## 
##   Level 26:  1 nodes to be scored.

## 
##   Level 25:  1 nodes to be scored.

## 
##   Level 24:  4 nodes to be scored.

## 
##   Level 23:  4 nodes to be scored.

## 
##   Level 22:  5 nodes to be scored.

## 
##   Level 21:  4 nodes to be scored.

## 
##   Level 20:  8 nodes to be scored.

## 
##   Level 19:  23 nodes to be scored.

## 
##   Level 18:  23 nodes to be scored.

## 
##   Level 17:  27 nodes to be scored.

## 
##   Level 16:  39 nodes to be scored.

## 
##   Level 15:  63 nodes to be scored.

## 
##   Level 14:  63 nodes to be scored.

## 
##   Level 13:  74 nodes to be scored.

## 
##   Level 12:  95 nodes to be scored.

## 
##   Level 11:  120 nodes to be scored.

## 
##   Level 10:  115 nodes to be scored.

## 
##   Level 9:   92 nodes to be scored.

## 
##   Level 8:   75 nodes to be scored.

## 
##   Level 7:   67 nodes to be scored.

## 
##   Level 6:   44 nodes to be scored.

## 
##   Level 5:   27 nodes to be scored.

## 
##   Level 4:   21 nodes to be scored.

## 
##   Level 3:   6 nodes to be scored.

## 
##   Level 2:   4 nodes to be scored.

## 
##   Level 1:   1 nodes to be scored.

Warning: This can be long because of the size of the ontology. Consider running scripts in batch mode if you have multiple analyses to do.

Format the table of results after an enrichment test for anatomical terms

The makeTable function allows to filter and format the test results, and calculate FDR values.

# Display results sigificant at a 1% FDR threshold
tableOver <- makeTable(myTopAnatData, myTopAnatObject, results, cutoff = 0.01)

## 
## Building the results table for the 27 significant terms at FDR threshold of 0.01...
## Ordering results by pValue column in increasing order...
## Done

head(tableOver)

##                       organId                organName annotated significant
## UBERON:0000151 UBERON:0000151             pectoral fin       439          79
## UBERON:0004357 UBERON:0004357      paired limb/fin bud       198          48
## UBERON:2000040 UBERON:2000040          median fin fold        59          20
## UBERON:0003051 UBERON:0003051              ear vesicle       391          49
## UBERON:0005729 UBERON:0005729 pectoral appendage field        20          11
## UBERON:0004376 UBERON:0004376                 fin bone        34          12
##                expected foldEnrichment       pValue          FDR
## UBERON:0000151    21.48       3.677840 1.358300e-27 1.471039e-24
## UBERON:0004357     9.69       4.953560 5.187251e-23 2.808896e-20
## UBERON:2000040     2.89       6.920415 9.370662e-13 3.382809e-10
## UBERON:0003051    19.13       2.561422 5.501734e-11 1.489595e-08
## UBERON:0005729     0.98      11.224490 3.052286e-10 6.611252e-08
## UBERON:0004376     1.66       7.228916 2.603199e-08 4.698774e-06

At the time of building this vignette (June 2018), there was 27 significant anatomical structures. The first term is “pectoral fin”, and the second “paired limb/fin bud”. Other terms in the list, especially those with high enrichment folds, are clearly related to pectoral fins or substructures of fins. This analysis shows that genes with phenotypic effects on pectoral fins are specifically expressed in or next to these structures

By default results are sorted by p-value, but this can be changed with the ordering parameter by specifying which column should be used to order the results (preceded by a “-” sign to indicate that ordering should be made in decreasing order). For example, it is often convenient to sort significant structures by decreasing enrichment fold, using ordering = -6. The full table of results can be obtained using cutoff = 1.

Of note, it is possible to retrieve for a particular tissue the significant genes that were mapped to it.

# In order to retrieve significant genes mapped to the term " paired limb/fin bud"
term <- "UBERON:0004357"
termStat(myTopAnatObject, term)

##                Annotated Significant Expected
## UBERON:0004357       198          48     9.69

# 198 genes mapped to this term for Bgee 14.0 and Ensembl 84
genesInTerm(myTopAnatObject, term)

## $`UBERON:0004357`
##   [1] "ENSDARG00000001057" "ENSDARG00000001785" "ENSDARG00000002445"
##   [4] "ENSDARG00000002795" "ENSDARG00000002933" "ENSDARG00000002952"
##   [7] "ENSDARG00000003293" "ENSDARG00000003399" "ENSDARG00000004954"
##  [10] "ENSDARG00000005479" "ENSDARG00000005645" "ENSDARG00000005762"
##  [13] "ENSDARG00000006120" "ENSDARG00000006514" "ENSDARG00000007407"
##  [16] "ENSDARG00000007438" "ENSDARG00000007641" "ENSDARG00000008305"
##  [19] "ENSDARG00000008886" "ENSDARG00000009534" "ENSDARG00000011027"
##  [22] "ENSDARG00000011407" "ENSDARG00000011618" "ENSDARG00000012078"
##  [25] "ENSDARG00000012422" "ENSDARG00000012824" "ENSDARG00000013409"
##  [28] "ENSDARG00000013853" "ENSDARG00000013881" "ENSDARG00000014091"
##  [31] "ENSDARG00000014259" "ENSDARG00000014329" "ENSDARG00000014626"
##  [34] "ENSDARG00000014634" "ENSDARG00000014796" "ENSDARG00000015554"
##  [37] "ENSDARG00000015674" "ENSDARG00000016022" "ENSDARG00000016454"
##  [40] "ENSDARG00000016858" "ENSDARG00000017219" "ENSDARG00000018025"
##  [43] "ENSDARG00000018426" "ENSDARG00000018460" "ENSDARG00000018492"
##  [46] "ENSDARG00000018693" "ENSDARG00000018902" "ENSDARG00000019260"
##  [49] "ENSDARG00000019353" "ENSDARG00000019579" "ENSDARG00000019838"
##  [52] "ENSDARG00000019995" "ENSDARG00000020527" "ENSDARG00000021389"
##  [55] "ENSDARG00000021442" "ENSDARG00000021938" "ENSDARG00000022280"
##  [58] "ENSDARG00000024561" "ENSDARG00000024894" "ENSDARG00000025081"
##  [61] "ENSDARG00000025147" "ENSDARG00000025375" "ENSDARG00000025641"
##  [64] "ENSDARG00000025891" "ENSDARG00000028071" "ENSDARG00000029764"
##  [67] "ENSDARG00000030110" "ENSDARG00000030756" "ENSDARG00000030932"
##  [70] "ENSDARG00000031222" "ENSDARG00000031809" "ENSDARG00000031894"
##  [73] "ENSDARG00000031952" "ENSDARG00000033327" "ENSDARG00000033616"
##  [76] "ENSDARG00000034375" "ENSDARG00000035559" "ENSDARG00000035648"
##  [79] "ENSDARG00000036254" "ENSDARG00000036558" "ENSDARG00000037109"
##  [82] "ENSDARG00000037556" "ENSDARG00000037675" "ENSDARG00000037677"
##  [85] "ENSDARG00000038006" "ENSDARG00000038428" "ENSDARG00000038672"
##  [88] "ENSDARG00000038879" "ENSDARG00000038990" "ENSDARG00000038991"
##  [91] "ENSDARG00000040534" "ENSDARG00000040764" "ENSDARG00000041430"
##  [94] "ENSDARG00000041609" "ENSDARG00000041706" "ENSDARG00000041799"
##  [97] "ENSDARG00000042233" "ENSDARG00000042296" "ENSDARG00000043130"
## [100] "ENSDARG00000043559" "ENSDARG00000043923" "ENSDARG00000044511"
## [103] "ENSDARG00000044574" "ENSDARG00000052131" "ENSDARG00000052139"
## [106] "ENSDARG00000052344" "ENSDARG00000052494" "ENSDARG00000052652"
## [109] "ENSDARG00000053479" "ENSDARG00000053493" "ENSDARG00000054026"
## [112] "ENSDARG00000054030" "ENSDARG00000054619" "ENSDARG00000055026"
## [115] "ENSDARG00000055027" "ENSDARG00000055381" "ENSDARG00000055398"
## [118] "ENSDARG00000056995" "ENSDARG00000057830" "ENSDARG00000058115"
## [121] "ENSDARG00000058543" "ENSDARG00000058822" "ENSDARG00000059073"
## [124] "ENSDARG00000059233" "ENSDARG00000059276" "ENSDARG00000059279"
## [127] "ENSDARG00000059437" "ENSDARG00000060397" "ENSDARG00000060808"
## [130] "ENSDARG00000061328" "ENSDARG00000061345" "ENSDARG00000061600"
## [133] "ENSDARG00000062824" "ENSDARG00000068365" "ENSDARG00000068567"
## [136] "ENSDARG00000068732" "ENSDARG00000069105" "ENSDARG00000069473"
## [139] "ENSDARG00000069763" "ENSDARG00000069922" "ENSDARG00000070069"
## [142] "ENSDARG00000070670" "ENSDARG00000071336" "ENSDARG00000071560"
## [145] "ENSDARG00000071699" "ENSDARG00000073814" "ENSDARG00000074378"
## [148] "ENSDARG00000074597" "ENSDARG00000075713" "ENSDARG00000076010"
## [151] "ENSDARG00000076554" "ENSDARG00000076566" "ENSDARG00000076856"
## [154] "ENSDARG00000077121" "ENSDARG00000077353" "ENSDARG00000077473"
## [157] "ENSDARG00000078696" "ENSDARG00000078784" "ENSDARG00000078864"
## [160] "ENSDARG00000079027" "ENSDARG00000079570" "ENSDARG00000079922"
## [163] "ENSDARG00000079964" "ENSDARG00000080453" "ENSDARG00000087196"
## [166] "ENSDARG00000089805" "ENSDARG00000090820" "ENSDARG00000091161"
## [169] "ENSDARG00000092136" "ENSDARG00000092809" "ENSDARG00000095743"
## [172] "ENSDARG00000095859" "ENSDARG00000098359" "ENSDARG00000099088"
## [175] "ENSDARG00000099175" "ENSDARG00000099458" "ENSDARG00000099996"
## [178] "ENSDARG00000100236" "ENSDARG00000100312" "ENSDARG00000100725"
## [181] "ENSDARG00000101076" "ENSDARG00000101199" "ENSDARG00000101209"
## [184] "ENSDARG00000101244" "ENSDARG00000101701" "ENSDARG00000101766"
## [187] "ENSDARG00000101831" "ENSDARG00000102470" "ENSDARG00000102750"
## [190] "ENSDARG00000102824" "ENSDARG00000102995" "ENSDARG00000103432"
## [193] "ENSDARG00000103515" "ENSDARG00000103754" "ENSDARG00000104353"
## [196] "ENSDARG00000104815" "ENSDARG00000105230" "ENSDARG00000105357"

# 48 significant genes mapped to this term for Bgee 14.0 and Ensembl 84
annotated <- genesInTerm(myTopAnatObject, term)[["UBERON:0004357"]]
annotated[annotated %in% sigGenes(myTopAnatObject)]

##  [1] "ENSDARG00000002445" "ENSDARG00000002952" "ENSDARG00000003293"
##  [4] "ENSDARG00000008305" "ENSDARG00000011407" "ENSDARG00000012824"
##  [7] "ENSDARG00000013853" "ENSDARG00000013881" "ENSDARG00000014091"
## [10] "ENSDARG00000018426" "ENSDARG00000018693" "ENSDARG00000018902"
## [13] "ENSDARG00000019260" "ENSDARG00000019353" "ENSDARG00000019838"
## [16] "ENSDARG00000021389" "ENSDARG00000024894" "ENSDARG00000028071"
## [19] "ENSDARG00000030932" "ENSDARG00000031894" "ENSDARG00000036254"
## [22] "ENSDARG00000037677" "ENSDARG00000038006" "ENSDARG00000038672"
## [25] "ENSDARG00000041799" "ENSDARG00000042233" "ENSDARG00000042296"
## [28] "ENSDARG00000043559" "ENSDARG00000043923" "ENSDARG00000053493"
## [31] "ENSDARG00000054619" "ENSDARG00000058543" "ENSDARG00000060397"
## [34] "ENSDARG00000068567" "ENSDARG00000069473" "ENSDARG00000071336"
## [37] "ENSDARG00000073814" "ENSDARG00000076856" "ENSDARG00000077121"
## [40] "ENSDARG00000077353" "ENSDARG00000079027" "ENSDARG00000079570"
## [43] "ENSDARG00000087196" "ENSDARG00000095859" "ENSDARG00000099088"
## [46] "ENSDARG00000100312" "ENSDARG00000101831" "ENSDARG00000105357"

Warning: it is debated whether FDR correction is appropriate on enrichment test results, since tests on different terms of the ontologies are not independent. A nice discussion can be found in the vignette of the topGO package.

Store expression data localy

Since version 2.14.0 (Bioconductor 3.11) BgeeDB store downloaded expression data in a local RSQLite database. The advantages of this approach compared to the one used in the previous BgeeDB versions are: * do not anymore need a server with lot of memory to access to subset of huge dataset (e.g GTeX dataset) * more granular filtering using arguments in the getData() function * do not download twice the same data * fast access to data once integrated in the local database

This approach comes with some drawbacks : * the SQLite local database use more disk space than the previously conpressed .rds approach * first access to a dataset takes more time (integration to SQLite local database is time consuming)

It is possible to remove .rds files generated in previous versions of BgeeDB and not used anymore since version 2.14.0. The function below delete all .rds files for the selected species and for all datatype.

bgee <- Bgee$new(species="Mus_musculus", release = "14.1")
# delete all old .rds files of species Mus musculus
deleteOldData(bgee)

As the new SQLite approach use more disk space it is now possible to delete all local data of one species from one release of Bgee.

bgee <- Bgee$new(species="Mus_musculus", release = "14.1")
# delete local SQLite database of species Mus musculus from Bgee 14.1
deleteLocalData(bgee)

BgeeDB, an R package for retrieval of curated expression datasets and for gene list enrichment tests

Andrea Komljenovic, Julien Roux, Marc Robinson-Rechavi, Frederic Bastian

2020-10-27

Installation

How to use BgeeDB package

Load the package

Running example: downloading and formatting processed RNA-seq data

List available species in Bgee

Create a new Bgee object

Retrieve the annotation of mouse RNA-seq datasets

Download the processed mouse RNA-seq data

Format the RNA-seq data

Running example: TopAnat gene expression enrichment analysis

Create a new Bgee object

Download the data allowing to perform TopAnat analysis

Prepare a topAnatData object allowing to perform TopAnat analysis with topGO

Launch the enrichment test

Format the table of results after an enrichment test for anatomical terms

Store expression data localy