The NCBI Gene Expression Omnibus (GEO) represents the largest repository of microarray data in existence. One difficulty in dealing with GEO is finding the microarray data that is of interest. As part of the NCBI Entrez search system, GEO can be searched online via web pages or using NCBI Eutils. However, the web search is not as full-featured as it could be, particularly for programmatic access. NCBI Eutils offers another option for finding data within the vast stores of GEO, but it is cumbersome to use, often requiring multiple complicated Eutils calls to get at the relevant information. We have found it absolutely critical to have ready access not just to the microarray data, but to the metadata describing the microarray experiments. To this end we have created GEOmetadb.
If you use GEOmetadb or the associated SQLite database, please cite as follows:
citation("GEOmetadb")
##
## Please cite the following if utilizing the GEOmetadb software:
##
## Zhu Y, Davis S, Stephens R, Meltzer PS, Chen Y. GEOmetadb:
## powerful alternative search engine for the Gene Expression
## Omnibus. Bioinformatics. 2008 Dec 1;24(23):2798-800. doi:
## 10.1093/bioinformatics/btn520. Epub 2008 Oct 7. PubMed PMID:
## 18842599; PubMed Central PMCID: PMC2639278.
##
## A BibTeX entry for LaTeX users is
##
## @Article{,
## author = {Yuelin Zhu and Sean Davis and Robert Stephens and Paul S. Meltzer and Yidong Chen},
## title = {GEOmetadb: powerful alternative search engine for the Gene Expression Omnibus.},
## journal = {Bioinformatics (Oxford, England)},
## year = {2008},
## month = {Dec},
## day = {01},
## volume = {24},
## number = {23},
## pages = {2798--2800},
## abstract = {The NCBI Gene Expression Omnibus (GEO) represents the largest public repository of microarray data. However, finding data in GEO can be challenging. We have developed GEOmetadb in an attempt to make querying the GEO metadata both easier and more powerful. All GEO metadata records as well as the relationships between them are parsed and stored in a local MySQL database. A powerful, flexible web search interface with several convenient utilities provides query capabilities not available via NCBI tools. In addition, a Bioconductor package, GEOmetadb that utilizes a SQLite export of the entire GEOmetadb database is also available, rendering the entire GEO database accessible with full power of SQL-based queries from within R.},
## issn = {1367-4811},
## doi = {10.1093/bioinformatics/btn520},
## url = {http://www.ncbi.nlm.nih.gov/pubmed/18842599},
## language = {eng},
## }
The GEOmetadb package is an attempt to make querying the metadata describing microarray experiments, platforms, and datasets both easier and more powerful. At the heart of GEOmetadb is a SQLite database that stores nearly all the metadata associated with all GEO data types including GEO samples (GSM), GEO platforms (GPL), GEO data series (GSE), and curated GEO datasets (GDS), as well as the relationships between these data types. This database is generated by our server by parsing all the records in GEO and needs to be downloaded via a simple helper function to the user’s local machine before GEOmetadb is useful. Once this is done, the entire GEO database is accessible with simple SQL-based queries. With the GEOmetadb database, queries that are simply not possible using NCBI tools or web pages are often quite simple.
The relationships between the tables in the GEOmetadb SQLite database can be seen in the following entity-relationship diagram.
A very typical problem for large-scale consumers of GEO data is to determine the relationships between various GEO accession types. As examples, consider the following questions:
Because these types of questions are common, GEOmetadb contains the function geoConvert
that addresses these questions directly and efficiently.
We have faithfully parsed and maintained in GEO when creating GEOmetadb. This means that limitations inherent to GEO are also inherent in GEOmetadb. We have made no attempt to curate, semantically recode, or otherwise “clean up”" GEO; to do so would require significant resources, which we do not have.
GEOmetadb does not contain any microarray data. For access to microarray data from within R/Bioconductor, please look at the GEOquery package. In fact, we would expect that many users will find that the combination of GEOmetadb and GEOquery is quite powerful.
Once GEOmetadb is installed (see the Bioconductor website for full installation instructions), we are ready to begin.
This package does not come with a pre-installed version of the database. This has the advantage that the user will get the most up-to-date version of the database to start; the database can be re-downloaded using the same command as often as desired. First, load the library.
library(GEOmetadb)
The download and uncompress steps are done automatically with a single command, getSQLiteFile
.
if(!file.exists('GEOmetadb.sqlite')) getSQLiteFile()
## Unzipping...
## Metadata associate with downloaded file:
## name value
## 1 schema version 1.0
## 2 creation timestamp 2016-10-15 17:17:08
## [1] "/tmp/Rtmpd1n2fB/Rbuild40684d8013c8/GEOmetadb/vignettes/GEOmetadb.sqlite"
The default storage location is in the current working directory and the default filename is “GEOmetadb.sqlite”; it is best to leave the name unchanged unless there is a pressing reason to change it.
Since this SQLite file is of key importance in GEOmetadb, it is perhaps of some interest to know some details about the file itself.
file.info('GEOmetadb.sqlite')
## size isdir mode
## GEOmetadb.sqlite 5204996096 FALSE 644
## mtime
## GEOmetadb.sqlite 2016-10-17 18:23:11
## ctime
## GEOmetadb.sqlite 2016-10-17 18:23:11
## atime uid gid
## GEOmetadb.sqlite 2016-10-17 18:23:11 1004 1004
## uname grname
## GEOmetadb.sqlite biocbuild biocbuild
Now, the SQLite file is available for connection. The standard DBI functionality as implemented in RSQLite function dbConnect
makes the connection to the database. The dbDisconnect
function disconnects the connection.
con <- dbConnect(SQLite(),'GEOmetadb.sqlite')
dbDisconnect(con)
## [1] TRUE
The variable con
is an RSQLite connection object.
The Structured Query Language, or SQL, is a very powerful and standard way of working with relational data. GEO is composed of several data types, all of which are related to each other; in fact, NCBI uses a relational SQL database for metadata storage and querying. SQL databases and SQL itself are designed specifically to work efficiently with just such data. While the goal of many programming projects and programmers is to hide the details of SQL from the user, we are of the opinion that such efforts may be counterproductive, particularly with complex data and the need for ad hoc queries, both of which are characteristics with GEO metadata. We have taken the view that exposing the power of SQL will enable users to maximally utilize the vast data repository that is GEO. We understand that many users are not accustomed to working with SQL and, therefore, have devoted a large section of the vignette to working examples. Our goal is not to teach SQL, so a quick tutorial of SQL is likely to be beneficial to those who have not used it before. Many such tutorials are available online and can be completed in 30 minutes or less.
The functionality covered in this section is covered in much more detail in the DBI and RSQLite package documentation. We cover enough here only to be useful.
Again, we connect to the database.
con <- dbConnect(SQLite(),'GEOmetadb.sqlite')
The dbListTables
function lists all the tables in the SQLite database handled by the connection object con
.
geo_tables <- dbListTables(con)
geo_tables
## [1] "gds" "gds_subset"
## [3] "geoConvert" "geodb_column_desc"
## [5] "gpl" "gse"
## [7] "gse_gpl" "gse_gsm"
## [9] "gsm" "metaInfo"
## [11] "sMatrix"
There is also the dbListFields
function that can list database fields associated with a table.
dbListFields(con,'gse')
## [1] "ID" "title"
## [3] "gse" "status"
## [5] "submission_date" "last_update_date"
## [7] "pubmed_id" "summary"
## [9] "type" "contributor"
## [11] "web_link" "overall_design"
## [13] "repeats" "repeats_sample_list"
## [15] "variable" "variable_description"
## [17] "contact" "supplementary_file"
Sometimes it is useful to get the actual SQL schema associated with a table. Here, we get the table schema for the gpl
table.
dbGetQuery(con,'PRAGMA TABLE_INFO(gpl)')
## cid name type notnull dflt_value pk
## 1 0 ID REAL 0 <NA> 0
## 2 1 title TEXT 0 <NA> 0
## 3 2 gpl TEXT 0 <NA> 0
## 4 3 status TEXT 0 <NA> 0
## 5 4 submission_date TEXT 0 <NA> 0
## 6 5 last_update_date TEXT 0 <NA> 0
## 7 6 technology TEXT 0 <NA> 0
## 8 7 distribution TEXT 0 <NA> 0
## 9 8 organism TEXT 0 <NA> 0
## 10 9 manufacturer TEXT 0 <NA> 0
## 11 10 manufacture_protocol TEXT 0 <NA> 0
## 12 11 coating TEXT 0 <NA> 0
## 13 12 catalog_number TEXT 0 <NA> 0
## 14 13 support TEXT 0 <NA> 0
## 15 14 description TEXT 0 <NA> 0
## 16 15 web_link TEXT 0 <NA> 0
## 17 16 contact TEXT 0 <NA> 0
## 18 17 data_row_count REAL 0 <NA> 0
## 19 18 supplementary_file TEXT 0 <NA> 0
## 20 19 bioc_package TEXT 0 <NA> 0
Select 5 records from the gse table and show the first 7 columns.
rs <- dbGetQuery(con,'select * from gse limit 5')
rs[,1:7]
## ID
## 1 1
## 2 2
## 3 3
## 4 4
## 5 5
## title
## 1 NHGRI_Melanoma_class
## 2 Cerebellar development
## 3 Renal Cell Carcinoma Differential Expression
## 4 Diurnal and Circadian-Regulated Genes in Arabidopsis
## 5 Global profile of germline gene expression in C. elegans
## gse status submission_date
## 1 GSE1 Public on Jan 22 2001 2001-01-22
## 2 GSE2 Public on Apr 26 2001 2001-04-19
## 3 GSE3 Public on Jul 19 2001 2001-07-19
## 4 GSE4 Public on Jul 20 2001 2001-07-20
## 5 GSE5 Public on Jul 24 2001 2001-07-24
## last_update_date pubmed_id
## 1 2012-02-23 10952317
## 2 2012-02-23 NA
## 3 2012-02-23 11691851
## 4 2012-02-23 11158533
## 5 2012-02-23 11030340
Get the GEO series accession and title from GEO series that were submitted by “Sean Davis”. The “%” sign is used in combination with the “like” operator to do a “wildcard” search for the name “Sean Davis” with any number of characters before or after or between “Sean” and “Davis”.
rs <- dbGetQuery(con,paste("select gse,title from gse where",
"contributor like '%Sean%Davis%'",sep=" "))
rs
## gse
## 1 GSE2553
## 2 GSE4406
## 3 GSE5357
## 4 GSE7376
## 5 GSE7882
## 6 GSE8486
## 7 GSE9328
## 8 GSE14543
## 9 GSE15621
## 10 GSE16087
## 11 GSE16088
## 12 GSE16091
## 13 GSE16102
## 14 GSE18544
## 15 GSE19063
## 16 GSE20016
## 17 GSE22520
## 18 GSE25127
## 19 GSE25164
## 20 GSE29428
## 21 GSE41795
## 22 GSE5481
## 23 GSE33249
## title
## 1 NHGRI_Sarcoma_Baird
## 2 Gene expression profiling of CD4+ T-cells and GM6990 lymphoblastoid cell lines
## 3 NHGRI Menin ChIP-Chip
## 4 Detection of novel amplification units in prostate cancer
## 5 Gene Expression and Comparative Genomic Hybridization of Ductal Carcinoma In Situ of the Breast
## 6 Whole genome DNAse hypersensitivity in human CD4+ T-cells
## 7 ATF2 knockout in papillomas
## 8 A molecular function map of Ewing’s Sarcoma
## 9 Acute Lymphocytic Leukemia versus associated xenografts
## 10 Gene expression profiles of canine osteosarcoma
## 11 Gene expression profiles of human osteosarcoma
## 12 Gene expression profiles of human osteosarcoma, set2
## 13 Gene expression profiles of canine and human osteosarcoma
## 14 Expression Profiling of a Mouse Xenograft Model of “Triple-Negative” Breast Cancer Brain Metastases With Vorinostat
## 15 Genome-wide map of PAX3-FKHR binding sites in rhabdomyosarcoma
## 16 Analyses of Human Brain Metastases of Breast Cancer Reveal the Association between HK2 Up-Regulation and Poor Prognosis
## 17 Mouse Models of Alveolar/Embryonal Rhabdomyosarcoma & Spindle Cell Sarcomas
## 18 Ewing Sarcoma cell lines treated with mithramycin
## 19 UV effects in mouse melanocytes
## 20 A methyl-deviator epigenotype of estrogen-receptor-positive breast carcinoma is associated with malignant biology
## 21 Accelerated high-yield generation of limb-innervating motor neurons from human stem cells
## 22 BRAF siRNA NHGRI_Liang_BRAF
## 23 High resolution comparative genomic hybridisation analysis of ductal carcinoma in situ of the breast
As another example, GEOmetadb can find all samples on GPL96 (Affymetrix hgu133a) that have .CEL files available for download.
rs <- dbGetQuery(con,paste("select gsm,supplementary_file",
"from gsm where gpl='GPL96'",
"and supplementary_file like '%CEL.gz'"))
dim(rs)
## [1] 27958 2
But why limit to only GPL96? Why not look for all Affymetrix arrays that have .CEL files? And list those with their associated GPL information, as well as the Bioconductor annotation package name?
rs <- dbGetQuery(con,paste("select gpl.bioc_package,gsm.gpl,",
"gsm,gsm.supplementary_file",
"from gsm join gpl on gsm.gpl=gpl.gpl",
"where gpl.manufacturer='Affymetrix'",
"and gsm.supplementary_file like '%CEL.gz' "))
rs[1:5,]
## bioc_package gpl gsm
## 1 hu6800 GPL80 GSM575
## 2 hu6800 GPL80 GSM576
## 3 hu6800 GPL80 GSM577
## 4 hu6800 GPL80 GSM578
## 5 hu6800 GPL80 GSM579
## supplementary_file
## 1 ftp://ftp.ncbi.nlm.nih.gov/geo/samples/GSMnnn/GSM575/suppl/GSM575.cel.gz
## 2 ftp://ftp.ncbi.nlm.nih.gov/geo/samples/GSMnnn/GSM576/suppl/GSM576.cel.gz
## 3 ftp://ftp.ncbi.nlm.nih.gov/geo/samples/GSMnnn/GSM577/suppl/GSM577.cel.gz
## 4 ftp://ftp.ncbi.nlm.nih.gov/geo/samples/GSMnnn/GSM578/suppl/GSM578.cel.gz
## 5 ftp://ftp.ncbi.nlm.nih.gov/geo/samples/GSMnnn/GSM579/suppl/GSM579.cel.gz
Of course, we can combine programming and data access. A simple sapply
example shows how to query each of the tables for number of records.
getTableCounts <- function(tableName,conn) {
sql <- sprintf("select count(*) from %s",tableName)
return(dbGetQuery(conn,sql)[1,1])
}
do.call(rbind,sapply(geo_tables,getTableCounts,con,simplify=FALSE))
## [,1]
## gds 3867
## gds_subset 23357
## geoConvert 8974514
## geodb_column_desc 104
## gpl 16616
## gse 73687
## gse_gpl 88876
## gse_gsm 2325649
## gsm 1966676
## metaInfo 2
## sMatrix 49161
Large-scale consumers of GEO data might want to convert GEO entity type from one to others, e.g. finding all GSM and GSE associated with ‘GPL96’. Function goeConvert does the conversion with a very fast mapping between entity types.
Covert ‘GPL96’ to other possible types in the GEOmetadb.sqlite.
conversion <- geoConvert('GPL96')
Check what GEO types and how many entities in each type in the conversion.
lapply(conversion, dim)
## $gse
## [1] 1062 2
##
## $gsm
## [1] 39514 2
##
## $gds
## [1] 359 2
##
## $sMatrix
## [1] 1015 2
conversion$gse[1:5,]
## from_acc to_acc
## 1 GPL96 GSE1000
## 2 GPL96 GSE10024
## 3 GPL96 GSE10043
## 4 GPL96 GSE10072
## 5 GPL96 GSE10089
conversion$gsm[1:5,]
## from_acc to_acc
## 1 GPL96 GSM1003058
## 2 GPL96 GSM1003059
## 3 GPL96 GSM1003060
## 4 GPL96 GSM1003061
## 5 GPL96 GSM1003062
conversion$gds[1:5,]
## from_acc to_acc
## 1 GPL96 GDS1023
## 2 GPL96 GDS1036
## 3 GPL96 GDS1050
## 4 GPL96 GDS1062
## 5 GPL96 GDS1063
conversion$sMatrix[1:5,]
## from_acc to_acc
## 1 GPL96 GSE1000_series_matrix.txt.gz
## 2 GPL96 GSE10024_series_matrix.txt.gz
## 3 GPL96 GSE10043_series_matrix.txt.gz
## 4 GPL96 GSE10072_series_matrix.txt.gz
## 5 GPL96 GSE10089_series_matrix.txt.gz
The function getBiocPlatformMap is to get GPL information of a given list of Bioconductor microarry annotation packages. Note currently the GEOmetadb does not contains all the mappings, but we are trying to construct a relative complete list.
Get GPL information of ‘hgu133a’ and ‘hgu95av2’:
getBiocPlatformMap(con, bioc=c('hgu133a','hgu95av2'))
## title
## 1 [HG-U133A] Affymetrix Human Genome U133A Array
## 2 [HG_U95A] Affymetrix Human Genome U95A Array
## 3 [HG_U95Av2] Affymetrix Human Genome U95 Version 2 Array
## gpl bioc_package manufacturer organism
## 1 GPL96 hgu133a Affymetrix Homo sapiens
## 2 GPL91 hgu95av2 Affymetrix Homo sapiens
## 3 GPL8300 hgu95av2 Affymetrix Homo sapiens
## data_row_count
## 1 22283
## 2 12626
## 3 12625
Now, for something a bit more complicated, we would like to find all the human breast cancer-related Affymetrix gene expression GEO series.
sql <- paste("SELECT DISTINCT gse.title,gse.gse",
"FROM",
" gsm JOIN gse_gsm ON gsm.gsm=gse_gsm.gsm",
" JOIN gse ON gse_gsm.gse=gse.gse",
" JOIN gse_gpl ON gse_gpl.gse=gse.gse",
" JOIN gpl ON gse_gpl.gpl=gpl.gpl",
"WHERE",
" gsm.molecule_ch1 like '%total RNA%' AND",
" gse.title LIKE '%breast cancer%' AND",
" gpl.organism LIKE '%Homo sapiens%'",sep=" ")
rs <- dbGetQuery(con,sql)
dim(rs)
## [1] 1052 2
print(rs[1:5,],right=FALSE)
## title
## 1 Human mammary epithelium and breast cancer
## 2 Cell-type specific responses to chemotherapeutics in breast cancer
## 3 E2 and SERM Treatment of MCF-7 Breast Cancer Cells
## 4 Estradiol Treated Breast Cancer Cells Expressing Mutant Estrogen Receptors
## 5 Breast Cancer Cell Line Experiment
## gse
## 1 GSE53
## 2 GSE763
## 3 GSE848
## 4 GSE1045
## 5 GSE1299
Since we have all the data from NCBI GEO available, why not look around a bit. The next code block creates a wordcloud using the tm and wordcloud packages based on the titles of all GSE records.
library(tm)
library(wordcloud)
gseTitles = dbGetQuery(con,"select title from gse")
corp = VCorpus(VectorSource(gseTitles))
corp <- tm_map(corp, removePunctuation)
corp <- tm_map(corp, content_transformer(tolower))
corp <- tm_map(corp, removeNumbers)
corp <- tm_map(corp, function(x)removeWords(x,stopwords()))
term.matrix <- TermDocumentMatrix(corp)
term.matrix <- as.matrix(term.matrix)
v <- sort(rowSums(term.matrix),decreasing=TRUE)
d <- data.frame(word = names(v),freq=v)
n = 100
wordcloud(d[1:n,]$word,d[1:n,]$freq)
The dplyr package provides a very streamlined functional approach to accessing SQL databases.
Load the dplyr library and connect to the GEOmetadb.sqlite
database. The dplyr package is written around manipulation of tabular data, so we will work with only one GEOmetadb table, the gse
table. Instead of using SQL, we can use dplyr verbs to get information from tables of interest.
library(dplyr)
db = src_sqlite('GEOmetadb.sqlite')
gse = tbl(db,'gse')
filter(gse,gse=='GSE2553')
## Source: query [?? x 18]
## Database: sqlite 3.8.6 [GEOmetadb.sqlite]
##
## ID title gse
## <dbl> <chr> <chr>
## 1 2151 NHGRI_Sarcoma_Baird GSE2553
## # ... with 15 more variables: status <chr>,
## # submission_date <chr>, last_update_date <chr>,
## # pubmed_id <int>, summary <chr>, type <chr>,
## # contributor <chr>, web_link <chr>,
## # overall_design <chr>, repeats <chr>,
## # repeats_sample_list <chr>, variable <chr>,
## # variable_description <chr>, contact <chr>,
## # supplementary_file <chr>
Finally, it is probably a good idea to close the connection. Please see the DBI package for details on this point.
dbDisconnect(con)
## [1] TRUE
If you want to remove old GEOmetadb.sqlite file before retrieve a new version from the server, execute the following codes:
file.remove('GEOmetadb.sqlite')
sessionInfo()
## R version 3.3.1 (2016-06-21)
## Platform: x86_64-pc-linux-gnu (64-bit)
## Running under: Ubuntu 16.04.1 LTS
##
## locale:
## [1] LC_CTYPE=en_US.UTF-8
## [2] LC_NUMERIC=C
## [3] LC_TIME=en_US.UTF-8
## [4] LC_COLLATE=C
## [5] LC_MONETARY=en_US.UTF-8
## [6] LC_MESSAGES=en_US.UTF-8
## [7] LC_PAPER=en_US.UTF-8
## [8] LC_NAME=C
## [9] LC_ADDRESS=C
## [10] LC_TELEPHONE=C
## [11] LC_MEASUREMENT=en_US.UTF-8
## [12] LC_IDENTIFICATION=C
##
## attached base packages:
## [1] parallel stats graphics grDevices utils
## [6] datasets methods base
##
## other attached packages:
## [1] dplyr_0.5.0 wordcloud_2.5
## [3] RColorBrewer_1.1-2 tm_0.6-2
## [5] NLP_0.1-9 GEOmetadb_1.34.0
## [7] RSQLite_1.0.0 DBI_0.5-1
## [9] GEOquery_2.40.0 Biobase_2.34.0
## [11] BiocGenerics_0.20.0 knitr_1.14
##
## loaded via a namespace (and not attached):
## [1] Rcpp_0.12.7 magrittr_1.5 R6_2.2.0
## [4] stringr_1.1.0 httr_1.2.1 tools_3.3.1
## [7] htmltools_0.3.5 lazyeval_0.2.0 yaml_2.1.13
## [10] assertthat_0.1 digest_0.6.10 tibble_1.2
## [13] formatR_1.4 bitops_1.0-6 codetools_0.2-15
## [16] RCurl_1.95-4.8 evaluate_0.10 slam_0.1-38
## [19] rmarkdown_1.1 stringi_1.1.2 XML_3.98-1.4