The ExperimentHub
server provides easy R / Bioconductor access to large files of data.
The ExperimentHub package provides a client interface to resources stored at the ExperimentHub web service. It has similar functionality to AnnotationHub package.
library(ExperimentHub)
The ExperimentHub package is straightforward to use. Create an ExperiemntHub
object
eh = ExperimentHub()
## snapshotDate(): 2017-10-30
Now at this point you have already done everything you need in order to start retrieving experiment data. For most operations, using the ExperimentHub
object should feel a lot like working with a familiar list
or data.frame
and has all of the functionality of an Hub
object like AnnotationHub package’s AnnotationHub
object.
Lets take a minute to look at the show method for the hub object eh
eh
## ExperimentHub with 866 records
## # snapshotDate(): 2017-10-30
## # $dataprovider: Eli and Edythe L. Broad Institute of Harvard and MIT, De...
## # $species: Homo Sapiens, Homo sapiens, Mus Musculus (E18 mice), Mus Musc...
## # $rdataclass: ExpressionSet, SummarizedExperiment, RaggedExperiment, Dat...
## # additional mcols(): taxonomyid, genome, description,
## # coordinate_1_based, maintainer, rdatadateadded, preparerclass,
## # tags, rdatapath, sourceurl, sourcetype
## # retrieve records with, e.g., 'object[["EH1"]]'
##
## title
## EH1 | RNA-Sequencing and clinical data for 7706 tumor samples from ...
## EH164 | RNA-Sequencing and clinical data for 9246 tumor samples from ...
## EH165 | RNA-Sequencing and clinical data for 741 normal samples from ...
## EH166 | ERR188297
## EH167 | ERR188088
## ... ...
## EH1032 | SKCM_Methylation-20160128
## EH1033 | SKCM_miRNASeqGene-20160128
## EH1034 | SKCM_Mutation-20160128
## EH1035 | SKCM_RNASeq2GeneNorm-20160128
## EH1036 | SKCM_RPPAArray-20160128
You can see that it gives you an idea about the different types of data that are present inside the hub. You can see where the data is coming from (dataprovider), as well as what species have samples present (species), what kinds of R data objects could be returned (rdataclass). We can take a closer look at all the kinds of data providers that are available by simply looking at the contents of dataprovider as if it were the column of a data.frame object like this:
unique(eh$dataprovider)
## [1] "GEO"
## [2] "GEUVADIS"
## [3] "Allen Brain Atlas"
## [4] "ArrayExpress"
## [5] "Department of Psychology, Abdul Haq Campus, Federal Urdu University for Arts, Science and Technology, Karachi, Pakistan. shahiq_psy@yahoo.com"
## [6] "Department of Chemical and Biological Engineering, Chalmers University of Technology, SE-412 96 Gothenburg, Sweden."
## [7] "INRA, Institut National de la Recherche Agronomique, US1367 Metagenopolis, 78350 Jouy en Josas, France."
## [8] "Institute of Microbiology and Infection, University of Birmingham, Birmingham, England."
## [9] "1] Center for Biological Sequence Analysis, Technical University of Denmark, Kongens Lyngby, Denmark. [2] Novo Nordisk Foundation Center for Biosustainability, Technical University of Denmark, Kongens Lyngby, Denmark. [3]."
## [10] "1] Department of Anthropology, University of Oklahoma, Dale Hall Tower, 521 Norman, Oklahoma 73019, USA [2] Universidad CientÃfica del Sur, Lima 18, Perú [3] City of Hope, NCI-designated Comprehensive Cancer Center, Duarte, California 91010, USA."
## [11] "Translational and Functional Genomics Branch, National Human Genome Research Institute, NIH, Bethesda, Maryland 20892, USA."
## [12] "BGI-Shenzhen, Shenzhen 518083, China."
## [13] "1] State Key Laboratory for Diagnosis and Treatment of Infectious Disease, The First Affiliated Hospital, College of Medicine, Zhejiang University, 310003 Hangzhou, China [2] Collaborative Innovation Center for Diagnosis and Treatment of Infectious Diseases, Zhejiang University, 310003 Hangzhou, China [3]."
## [14] "Department of Pharmacy and Biotechnology, University of Bologna, Bologna 40126, Italy."
## [15] "NA"
## [16] "Structural and Computational Biology Unit, European Molecular Biology Laboratory, Heidelberg, Germany."
## [17] "Centre for Integrative Biology, University of Trento, Trento, Italy."
## [18] "Computational Biology Institute, George Washington University , Ashburn, VA , USA ; Center for Bioinformatics and Integrative Biology, Universidad Andres Bello, Facultad de Ciencias Biologicas , Santiago , Chile."
## [19] "Genome Institute of Singapore, Singapore 138672, Singapore."
## [20] "[1] BGI-Shenzhen, Shenzhen 518083, China [2] Department of Biology, University of Copenhagen, Ole Maaloes Vej 5, 2200 Copenhagen, Denmark."
## [21] "Luxembourg Centre for Systems Biomedicine, 7 avenue des Hauts-Fourneaux, 4362 Esch-sur-Alzette, Luxembourg."
## [22] "Key Laboratory of Dairy Biotechnology and Engineering, Education Ministry of P. R. China, Department of Food Science and Engineering, Inner Mongolia Agricultural University, Hohhot 010018, China."
## [23] "[1] Center for Biological Sequence Analysis, Technical University of Denmark, Kongens Lyngby, Denmark. [2] Novo Nordisk Foundation Center for Biosustainability, Technical University of Denmark, Kongens Lyngby, Denmark. [3]."
## [24] "[1] Department of Anthropology, University of Oklahoma, Dale Hall Tower, 521 Norman, Oklahoma 73019, USA [2] Universidad Cientifica del Sur, Lima 18, Peru [3] City of Hope, NCI-designated Comprehensive Cancer Center, Duarte, California 91010, USA."
## [25] "Centre de Recherche en Infectiologie, CHU de Quebec-Universite Laval, Quebec, Canada."
## [26] "Department of Microbiology and Immunology, McGill University, Montreal, Quebec, Canada."
## [27] "Division of Cancer Epidemiology & Genetics, National Cancer Institute, Bethesda, Maryland, United States of America."
## [28] "BGI-Shenzhen, Shenzhen 518083, China; China National Genebank-Shenzhen, BGI-Shenzhen, Shenzhen 518083, China."
## [29] "Department of Medicine & Therapeutics, State Key Laboratory of Digestive Disease, Institute of Digestive Disease, LKS Institute of Health Sciences, CUHK Shenzhen Research Institute, The Chinese University of Hong Kong, Hong Kong."
## [30] "[1] State Key Laboratory for Diagnosis and Treatment of Infectious Disease, The First Affiliated Hospital, College of Medicine, Zhejiang University, 310003 Hangzhou, China [2] Collaborative Innovation Center for Diagnosis and Treatment of Infectious Diseases, Zhejiang University, 310003 Hangzhou, China [3]."
## [31] "1000 Genomes Project"
## [32] "yriMulti"
## [33] "10x Genomics"
## [34] "Illumina 450 methylation assay"
## [35] "GTex"
## [36] "Eli and Edythe L. Broad Institute of Harvard and MIT"
## [37] "Harmonized Cancer Datasets Genomic Data Commons Data Portal"
## [38] "[1] 1] BGI-Shenzhen, Shenzhen, China. [2] BGI Hong Kong Research Institute, Hong Kong, China. [3] School of Bioscience and Biotechnology, South China University of Technology, Guangzhou, China. [4]., [2] 1] BGI-Shenzhen, Shenzhen, China. [2]., [3] 1] BGI-Shenzhen, Shenzhen, China. [2] Department of Biology, University of Copenhagen, Copenhagen, Denmark. [3]., [4] European Molecular Biology Laboratory, Heidelberg, Germany., [5] 1] BGI-Shenzhen, Shenzhen, China. [2] European Molecular Biology Laboratory, Heidelberg, Germany. [3] The Novo Nordisk Foundation Center for Basic Metabolic Research, Faculty of Health and Medical Sciences, University of Copenhagen, Copenhagen, Denmark., [6] INRA, Institut National de la Recherche Agronomique, Metagenopolis, Jouy en Josas, France., [7] The Novo Nordisk Foundation Center for Basic Metabolic Research, Faculty of Health and Medical Sciences, University of Copenhagen, Copenhagen, Denmark., [8] Center for Biological Sequence Analysis, Technical University of Denmark, Kongens Lyngby, Denmark., [9] Digestive System Research Unit, University Hospital Vall d'Hebron, Ciberehd, Barcelona, Spain., [10] BGI-Shenzhen, Shenzhen, China., [11] 1] Department of Genetic Medicine, Faculty of Medicine, King Abdulaziz University (KAU), Jeddah, Saudi Arabia. [2] Princess Al-Jawhara AlBrahim Centre of Excellence in Research of Hereditary Disorders (PACER-HD), Faculty of Medicine, KAU, Jeddah, Saudi Arabia., [12] 1] Princess Al-Jawhara AlBrahim Centre of Excellence in Research of Hereditary Disorders (PACER-HD), Faculty of Medicine, KAU, Jeddah, Saudi Arabia. [2] Department of Biological Sciences, Faculty of Science, King Abdulaziz University (KAU), Jeddah, Saudi Arabia., [13] 1] BGI-Shenzhen, Shenzhen, China. [2] Princess Al-Jawhara AlBrahim Centre of Excellence in Research of Hereditary Disorders (PACER-HD), Faculty of Medicine, KAU, Jeddah, Saudi Arabia. [3] James D. Watson Institute of Genome Science, Hangzhou, China., [14] 1] BGI-Shenzhen, Shenzhen, China. [2] James D. Watson Institute of Genome Science, Hangzhou, China., [15] Department of Biology, University of Copenhagen, Copenhagen, Denmark., [16] NA, [17] 1] INRA, Institut National de la Recherche Agronomique, Metagenopolis, Jouy en Josas, France. [2] Centre for Host-Microbiome Interactions, Dental Institute Central Office, King's College London, Guy's Hospital, London Bridge, UK., [18] NA, [19] 1] BGI-Shenzhen, Shenzhen, China. [2] Department of Biology, University of Copenhagen, Copenhagen, Denmark. [3] The Novo Nordisk Foundation Center for Basic Metabolic Research, Faculty of Health and Medical Sciences, University of Copenhagen, Copenhagen, Denmark. [4] Princess Al-Jawhara AlBrahim Centre of Excellence in Research of Hereditary Disorders (PACER-HD), Faculty of Medicine, KAU, Jeddah, Saudi Arabia. [5] Macau University of Science and Technology, Macau, China."
In the same way, you can also see data from different species inside the hub by looking at the contents of species like this:
head(unique(eh$species))
## [1] "Homo sapiens" "Mus musculus"
## [3] "Homo Sapiens" "Mus musculus (E18 mice)"
## [5] "Mus Musculus (E18 mice)" "Mus Musculus"
And this will also work for any of the other types of metadata present. You can learn which kinds of metadata are available by simply hitting the tab key after you type ‘eh$’. In this way you can explore for yourself what kinds of data are present in the hub right from the command line. This interface also allows you to access the hub programatically to extract data that matches a particular set of criteria.
Another valuable types of metadata to pay attention to is the rdataclass.
head(unique(eh$rdataclass))
## [1] "ExpressionSet" "SummarizedExperiment"
## [3] "GAlignmentPairs" "CellMapperList"
## [5] "gds.class" "RangedSummarizedExperiment"
The rdataclass allows you to see which kinds of R objects the hub will return to you. This kind of information is valuable both as a means to filter results and also as a means to explore and learn about some of the kinds of experimenthub objects that are widely available for the project. Right now this is a pretty short list, but over time it should grow as we support more of the different kinds of experimenthub objects via the hub.
Now lets try getting the data files associated with the r Biocpkg("alpineData")
package using the query method. The query method lets you search rows for specific strings, returning an ExperimentHub
instance with just the rows matching the query. The preparerclass
column of metadata monitors which package is associated with the ExperimentHub data.
One can get chain files for Drosophila melanogaster from UCSC with:
apData <- query(eh, "alpineData")
apData
## ExperimentHub with 4 records
## # snapshotDate(): 2017-10-30
## # $dataprovider: GEUVADIS
## # $species: Homo sapiens
## # $rdataclass: GAlignmentPairs
## # additional mcols(): taxonomyid, genome, description,
## # coordinate_1_based, maintainer, rdatadateadded, preparerclass,
## # tags, rdatapath, sourceurl, sourcetype
## # retrieve records with, e.g., 'object[["EH166"]]'
##
## title
## EH166 | ERR188297
## EH167 | ERR188088
## EH168 | ERR188204
## EH169 | ERR188317
Query has worked and you can now see that the only data present is provided by the “alpineData”.
The metadata underlying this hub object can be retrieved by you
apData$preparerclass
## [1] "alpineData" "alpineData" "alpineData" "alpineData"
df <- mcols(apData)
By default the show method will only display the first 5 and last 5 rows. There are hundreds of records present in the hub.
length(eh)
## [1] 866
Lets look at another example, where we pull down only data from the hub for species “mus musculus”.
mm <- query(eh, "mus musculus")
mm
## ExperimentHub with 5 records
## # snapshotDate(): 2017-10-30
## # $dataprovider: 10x Genomics, ArrayExpress, GEO
## # $species: Mus Musculus (E18 mice), Mus Musculus, Mus musculus, Mus musc...
## # $rdataclass: RangedSummarizedExperiment, CellMapperList, DataFrame
## # additional mcols(): taxonomyid, genome, description,
## # coordinate_1_based, maintainer, rdatadateadded, preparerclass,
## # tags, rdatapath, sourceurl, sourcetype
## # retrieve records with, e.g., 'object[["EH173"]]'
##
## title
## EH173 | Pre-processed microarray data from the Affymetrix MG-U74Av2 p...
## EH552 | st100k
## EH553 | st400k
## EH554 | full_1Mneurons
## EH557 | tasicST6
We can also look at the ExperimentHub
object in a browser using the display()
function. We can then filter the ExperimentHub
object using the Global search field on the top right corner of the page or the in-column search fields.
d <- display(eh)
ExperimentHub
to retrieve dataLooking back at our alpineData file example, if we are interested in the first file, we can gets its metadata using
apData
## ExperimentHub with 4 records
## # snapshotDate(): 2017-10-30
## # $dataprovider: GEUVADIS
## # $species: Homo sapiens
## # $rdataclass: GAlignmentPairs
## # additional mcols(): taxonomyid, genome, description,
## # coordinate_1_based, maintainer, rdatadateadded, preparerclass,
## # tags, rdatapath, sourceurl, sourcetype
## # retrieve records with, e.g., 'object[["EH166"]]'
##
## title
## EH166 | ERR188297
## EH167 | ERR188088
## EH168 | ERR188204
## EH169 | ERR188317
apData["EH166"]
## ExperimentHub with 1 record
## # snapshotDate(): 2017-10-30
## # names(): EH166
## # package(): alpineData
## # $dataprovider: GEUVADIS
## # $species: Homo sapiens
## # $rdataclass: GAlignmentPairs
## # $rdatadateadded: 2016-07-21
## # $title: ERR188297
## # $description: Subset of aligned reads from sample ERR188297
## # $taxonomyid: 9606
## # $genome: GRCh38
## # $sourcetype: FASTQ
## # $sourceurl: ftp://ftp.sra.ebi.ac.uk/vol1/fastq/ERR188/ERR188297/ERR1882...
## # $sourcesize: NA
## # $tags: c("Sequencing", "RNASeq", "GeneExpression",
## # "Transcription")
## # retrieve record with 'object[["EH166"]]'
We can download the file using
apData[["EH166"]]
## see ?alpineData and browseVignettes('alpineData') for documentation
## loading from cache '/home/biocbuild//.ExperimentHub/166'
## Loading required package: GenomicAlignments
## Loading required package: S4Vectors
## Loading required package: stats4
##
## Attaching package: 'S4Vectors'
## The following object is masked from 'package:base':
##
## expand.grid
## Loading required package: IRanges
## Loading required package: GenomeInfoDb
## Loading required package: GenomicRanges
## Loading required package: SummarizedExperiment
## Loading required package: Biobase
## Welcome to Bioconductor
##
## Vignettes contain introductory material; view with
## 'browseVignettes()'. To cite Bioconductor, see
## 'citation("Biobase")', and for packages 'citation("pkgname")'.
##
## Attaching package: 'Biobase'
## The following object is masked from 'package:ExperimentHub':
##
## cache
## The following object is masked from 'package:AnnotationHub':
##
## cache
## Loading required package: DelayedArray
## Loading required package: matrixStats
##
## Attaching package: 'matrixStats'
## The following objects are masked from 'package:Biobase':
##
## anyMissing, rowMedians
##
## Attaching package: 'DelayedArray'
## The following objects are masked from 'package:matrixStats':
##
## colMaxs, colMins, colRanges, rowMaxs, rowMins, rowRanges
## The following object is masked from 'package:base':
##
## apply
## Loading required package: Biostrings
## Loading required package: XVector
##
## Attaching package: 'Biostrings'
## The following object is masked from 'package:DelayedArray':
##
## type
## The following object is masked from 'package:base':
##
## strsplit
## Loading required package: Rsamtools
## GAlignmentPairs object with 25531 pairs, strandMode=1, and 0 metadata columns:
## seqnames strand : ranges --
## <Rle> <Rle> : <IRanges> --
## [1] 1 + : [108560389, 108560463] --
## [2] 1 - : [108560454, 108560528] --
## [3] 1 + : [108560534, 108600608] --
## [4] 1 - : [108569920, 108569994] --
## [5] 1 - : [108587954, 108588028] --
## ... ... ... ... ... ...
## [25527] X + : [119790596, 119790670] --
## [25528] X + : [119790988, 119791062] --
## [25529] X + : [119791037, 119791111] --
## [25530] X + : [119791348, 119791422] --
## [25531] X + : [119791376, 119791450] --
## ranges
## <IRanges>
## [1] [108560454, 108560528]
## [2] [108560383, 108560457]
## [3] [108600626, 108606236]
## [4] [108569825, 108569899]
## [5] [108587881, 108587955]
## ... ...
## [25527] [119790717, 119790791]
## [25528] [119791086, 119791160]
## [25529] [119791142, 119791216]
## [25530] [119791475, 119791549]
## [25531] [119791481, 119791555]
## -------
## seqinfo: 194 sequences from an unspecified genome
Each file is retrieved from the ExperimentHub server and the file is also cache locally, so that the next time you need to retrieve it, it should download much more quickly.
ExperimentHub
objectsWhen you create the ExperimentHub
object, it will set up the object for you with some default settings. See ?ExperimentHub
for ways to customize the hub source, the local cache, and other instance-specific options, and ?getExperimentHubOption
to get or set package-global options for use across sessions.
If you look at the object you will see some helpful information about it such as where the data is cached and where online the hub server is set to.
eh
## ExperimentHub with 866 records
## # snapshotDate(): 2017-10-30
## # $dataprovider: Eli and Edythe L. Broad Institute of Harvard and MIT, De...
## # $species: Homo Sapiens, Homo sapiens, Mus Musculus (E18 mice), Mus Musc...
## # $rdataclass: ExpressionSet, SummarizedExperiment, RaggedExperiment, Dat...
## # additional mcols(): taxonomyid, genome, description,
## # coordinate_1_based, maintainer, rdatadateadded, preparerclass,
## # tags, rdatapath, sourceurl, sourcetype
## # retrieve records with, e.g., 'object[["EH1"]]'
##
## title
## EH1 | RNA-Sequencing and clinical data for 7706 tumor samples from ...
## EH164 | RNA-Sequencing and clinical data for 9246 tumor samples from ...
## EH165 | RNA-Sequencing and clinical data for 741 normal samples from ...
## EH166 | ERR188297
## EH167 | ERR188088
## ... ...
## EH1032 | SKCM_Methylation-20160128
## EH1033 | SKCM_miRNASeqGene-20160128
## EH1034 | SKCM_Mutation-20160128
## EH1035 | SKCM_RNASeq2GeneNorm-20160128
## EH1036 | SKCM_RPPAArray-20160128
By default the ExperimentHub
object is set to the latest snapshotData
and a snapshot version that matches the version of Bioconductor that you are using. You can also learn about these data with the appropriate methods.
snapshotDate(eh)
## [1] "2017-10-30"
If you are interested in using an older version of a snapshot, you can list previous versions with the possibleDates()
like this:
pd <- possibleDates(eh)
pd
## [1] "2016-02-23" "2016-06-07" "2016-07-14" "2016-07-21" "2016-08-08"
## [6] "2016-10-01" "2017-06-09" "2017-08-25" "2017-10-06" "2017-10-10"
## [11] "2017-10-12" "2017-10-16" "2017-10-19" "2017-10-26" "2017-10-30"
## [16] "2017-10-30"
Set the dates like this:
snapshotDate(ah) <- pd[1]
sessionInfo()
## R version 3.4.2 (2017-09-28)
## Platform: x86_64-pc-linux-gnu (64-bit)
## Running under: Ubuntu 16.04.3 LTS
##
## Matrix products: default
## BLAS: /home/biocbuild/bbs-3.6-bioc/R/lib/libRblas.so
## LAPACK: /home/biocbuild/bbs-3.6-bioc/R/lib/libRlapack.so
##
## locale:
## [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
## [3] LC_TIME=en_US.UTF-8 LC_COLLATE=C
## [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
## [7] LC_PAPER=en_US.UTF-8 LC_NAME=C
## [9] LC_ADDRESS=C LC_TELEPHONE=C
## [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
##
## attached base packages:
## [1] stats4 parallel stats graphics grDevices utils datasets
## [8] methods base
##
## other attached packages:
## [1] GenomicAlignments_1.14.0 Rsamtools_1.30.0
## [3] Biostrings_2.46.0 XVector_0.18.0
## [5] SummarizedExperiment_1.8.0 DelayedArray_0.4.0
## [7] matrixStats_0.52.2 Biobase_2.38.0
## [9] GenomicRanges_1.30.0 GenomeInfoDb_1.14.0
## [11] IRanges_2.12.0 S4Vectors_0.16.0
## [13] alpineData_1.3.0 ExperimentHub_1.4.0
## [15] AnnotationHub_2.10.0 BiocGenerics_0.24.0
## [17] BiocStyle_2.6.0
##
## loaded via a namespace (and not attached):
## [1] Rcpp_0.12.13 compiler_3.4.2
## [3] BiocInstaller_1.28.0 zlibbioc_1.24.0
## [5] bitops_1.0-6 tools_3.4.2
## [7] digest_0.6.12 bit_1.1-12
## [9] lattice_0.20-35 RSQLite_2.0
## [11] evaluate_0.10.1 memoise_1.1.0
## [13] tibble_1.3.4 pkgconfig_2.0.1
## [15] rlang_0.1.2 Matrix_1.2-11
## [17] shiny_1.0.5 DBI_0.7
## [19] curl_3.0 yaml_2.1.14
## [21] GenomeInfoDbData_0.99.1 httr_1.3.1
## [23] stringr_1.2.0 knitr_1.17
## [25] grid_3.4.2 rprojroot_1.2
## [27] bit64_0.9-7 R6_2.2.2
## [29] AnnotationDbi_1.40.0 BiocParallel_1.12.0
## [31] rmarkdown_1.6 bookdown_0.5
## [33] blob_1.1.0 magrittr_1.5
## [35] backports_1.1.1 htmltools_0.3.6
## [37] mime_0.5 interactiveDisplayBase_1.16.0
## [39] xtable_1.8-2 httpuv_1.3.5
## [41] stringi_1.1.5 RCurl_1.95-4.8