Generating `curatedAdipoChIP`

Search starategy & data collection

The term “3T3-L1” was used to search the NCBI SRA repository. The results were sent to the run selector. 1,176 runs were viewed. The runs were faceted by Assay Type and the “chip-seq” which resulted in 739 runs. Only 235 samples from 20 different studies were included after being manually reviewed to fit the following criteria: * The raw data is available from GEO and has a GEO identifier (GSM#) * The raw data is linked to a published publicly available article * The protocols for generating the data sufficiently describe the origin of the cell line, the differentiation medium and the time points when the samples were collected. * In case the experimental designs included treatment other than the differentiation medias, the control (non-treated) samples were included.

Note: The data quality and the platform discrepancies are not included in these criteria

Pre-processing

The scripts to download and process the raw data are located in inst/scripts/ and are glued together to run sequentially by the GNU make file Makefile. The following is basically a description of the recipes in the Makefile with emphasis on the software versions, options, inputs and outputs.

1. Downloading data `download_fastq`

Program: wget (1.18)
Input: run.csv, the URLs column
Output: *.fastq.gz
Options: -N

2. Making a genome index `make_index`

Program: bowtie2-build (2.3.0)
Input: URL for mm10 mouse genome fasta files
Output: *.bt2 bowtie2 index for the mouse genome
Options: defaults

3. Dowinloading annotations `get_annotation`

Program: wget (1.18)
Input: URL for mm10 gene annotation file
Output: annotation.gtf
Options: -N

4. Aligning reads `align_reads`

Program: bowtie2 (2.3.0)
Input: *.fastq.gz and mm10/ bowtie2 index for the mouse genome
Output: *.sam
Options: --no-unal

5. Calling peaks `peak_calling`

Program: macs2 (2.1.2)
Input: *.bam and
Output: peaks.bed
Options: -B --nomodel --SPMR

6. Making a peakset `get_peakset`

Program: bedtools (2.26.0)
Input: *.bed
Output: peakset.bed
Options: -c 4,4 -o count_distinct,distinct

7. Counting reads in peaks `count_features`

Program: bedtools multicov (2.26.0)
Input: *.bam and peakset.bed
Output: peakset.txt
Options: defaults

8. Quality assessment `fastqc`

Program: fastqc (0.11.5)
Input: *.fastq.gz and *.sam
Output: *_fastqc.zip
Option: defaults

Processing

The aim of this step is to construct a self-contained object with minimal manipulations of the pre-processed data followed by a simple exploration of the data in the next section.

Making a summarized experiment object `peak_object`

The required steps to make this object from the pre-processed data are documented in the script and are supposed to be fully reproducible when run through this package. The output is a RangedSummarizedExperiment object containing the peak counts and the phenotype and features data and metadata.

The RangedSummarizedExperiment contains * The gene counts matrix peak_counts * The phenotype data colData. The column name links each peak to one or more samples * The feature data rowRanges * The metadata metadata which contain a data.frame of the studies from which the samples were collected.

Exploring the `peak_counts` object

In this section, we conduct a simple exploration of the data objects to show the content of the package and how they can be loaded and used.

# loading required libraries
library(ExperimentHub)
library(SummarizedExperiment)
library(S4Vectors)
library(fastqcr)
library(DESeq2)
library(dplyr)
library(tidyr)
library(ggplot2)

# query package resources on ExperimentHub
eh <- ExperimentHub()
#> snapshotDate(): 2020-10-02
query(eh, "curatedAdipoChIP")
#> ExperimentHub with 5 records
#> # snapshotDate(): 2020-10-02
#> # $dataprovider: GEO, SRA
#> # $species: Mus musculus
#> # $rdataclass: SummarizedExperiment, RangedSummarizedExperiment
#> # additional mcols(): taxonomyid, genome, description,
#> #   coordinate_1_based, maintainer, rdatadateadded, preparerclass, tags,
#> #   rdatapath, sourceurl, sourcetype 
#> # retrieve records with, e.g., 'object[["EH2258"]]' 
#> 
#>            title                                                              
#>   EH2258 | A Curated ChIP-Seq Dataset of MDI-induced Differentiated Adipocy...
#>   EH3286 | A Curated Microarrays Dataset  of MDI-induced Differentiated Adi...
#>   EH3287 | A Curated Microarrays Dataset (processed) of MDI-induced Differe...
#>   EH3288 | A Curated Microarrays Dataset  of MDI-induced Differentiated Adi...
#>   EH3289 | A Curated Microarrays Dataset (processed) of MDI-induced Differe...

# load data from ExperimentHub
peak_counts <- query(eh, "curatedAdipoChIP")[[1]]
#> see ?curatedAdipoChIP and browseVignettes('curatedAdipoChIP') for documentation
#> loading from cache

# print object
peak_counts
#> class: RangedSummarizedExperiment 
#> dim: 837030 207 
#> metadata(1): studies
#> assays(1): peak_counts
#> rownames(837030): peak_1 peak_2 ... peak_837855 peak_837856
#> rowData names(10): peak name ... geneId distanceToTSS
#> colnames(207): GSM1017630 GSM1017631 ... GSM838021 GSM838022
#> colData names(11): id runs ... control_type qc

The count matrix can be accessed using assay. Here we show the first five entries of the first five samples.

# print count matrix
assay(peak_counts)[1:5, 1:5]
#>        GSM1017630 GSM1017631 GSM1017632 GSM1017633 GSM1017634
#> peak_1          0          0          7          2          3
#> peak_2          0         10         28         24          5
#> peak_3          0          1          9         13          2
#> peak_4          4          9          9         20          6
#> peak_5          1          1         14         10          2

The phenotype/samples data is a data.frame, It can be accessed using colData. The time and stage columns encode the time point in hours and stage of differentiation respectively. The column factor records the ChIP antibody used in the sample.

# names of the coldata object
names(colData(peak_counts))
#>  [1] "id"           "runs"         "study"        "pmid"         "factor"      
#>  [6] "time"         "stage"        "bibtexkey"    "control_id"   "control_type"
#> [11] "qc"

# table of times column
table(colData(peak_counts)$time)
#> 
#> -48   0   2   4   6  24  48  72  96 144 168 
#>   2  46   1  43   1   4  21   3   9   7  22

# table of stage column
table(colData(peak_counts)$stage)
#> 
#>  0  1  2  3 
#> 67 73  9 40

# table of factor column
table(colData(peak_counts)$factor)
#> 
#>      ATF2      ATF7     CEBPA     CEBPB     CEBPD     CREB1      CTCF      E2F4 
#>         2         1         2         9         2         3        10         1 
#>     EP300     FOSL2      GPS2   H3K27ac  H3K27me3  H3K36me3   H3K4me1   H3K4me2 
#>         6         2         2         6         6         3         7         5 
#>   H3K4me3  H3K79me2  H3K79me3   H3K9me2   H3K9me3  H4K20me1     HDAC2     HDAC3 
#>        12         1         1         1         4         1         4         4 
#>       JUN      JUND     KDM1A     KDM5A     KDM5C      KLF4      KLF5     KMT2B 
#>         2         1         2         6         2         1         1         1 
#>      MBD1      MED1     NCOR1     NR3C1      NRF1      PBX1    POLR2A     PPARG 
#>         2        14         5         1         1         1        12        13 
#>     PSMB1      RXRG    SETDB1   SMARCA4     SMC1A     STAT1    STAT5A      TCF7 
#>         1         8         3         1        10         1         1         1 
#> Ubiquitin       VDR 
#>         2         1

Other columns in colData are selected information about the samples/runs or identifiers to different databases. The following table provides the description of each of these columns.

col_name	description
id	The GEO sample identifier.
runs	A list of tibbles with information about individual run files.
study	The SRA study identifier.
pmid	The PubMed ID of the article where the data were published originally.
factor	The name of the ChIP antibody that was used in the sample.
time	The time point of the sample when collected in hours. The time is recorded from the beginning of the protocol as 0 hours.
stage	The stage of differentiation of the sample when collected. Possible values are 0 to 3; 0 for non-differentiated; 1 for differentiating; and 2/3 for maturating samples.
bibtexkey	The key of the study where the data were published originally. This maps to the studies object of the metadata which records the study information in bibtex format.
control_id	The GEO sample identifier of the control (input) sample.
control_type	The method for assigning the control samples. Possible values are “provided” when the same study has control sample/s or “other” when control samples were selected from other studies.
library_layout	The type of RNA library. Possible values are SINGLE for single-end and PAIRED for paired-end runs.
instrument_model	The name of the sequencing machine that was used to obtain the sequence reads.
qc	The quality control output of fastqc on the separate files/runs.

The features data are a GRanges object and can be accessed using rowRanges.

# print GRanges object
rowRanges(peak_counts)
#> GRanges object with 837030 ranges and 10 metadata columns:
#>               seqnames            ranges strand |
#>                  <Rle>         <IRanges>  <Rle> |
#>        peak_1     chr1   3029191-3029410      * |
#>        peak_2     chr1   3031394-3031660      * |
#>        peak_3     chr1   3037771-3037971      * |
#>        peak_4     chr1   3049860-3050379      * |
#>        peak_5     chr1   3051655-3051881      * |
#>           ...      ...               ...    ... .
#>   peak_837852     chrY 90830107-90830335      * |
#>   peak_837853     chrY 90830368-90830710      * |
#>   peak_837854     chrY 90834938-90835190      * |
#>   peak_837855     chrY 90835545-90836986      * |
#>   peak_837856     chrY 90839338-90840071      * |
#>                                               peak        name
#>                                    <CharacterList> <character>
#>        peak_1                GSM2515964,GSM2515966      peak_1
#>        peak_2 GSM1370450,GSM1893625,GSM1893626,...      peak_2
#>        peak_3                           GSM2515948      peak_3
#>        peak_4 GSM1412513,GSM2515948,GSM2515949,...      peak_4
#>        peak_5                           GSM2515966      peak_5
#>           ...                                  ...         ...
#>   peak_837852                            GSM340808 peak_837852
#>   peak_837853                            GSM340808 peak_837853
#>   peak_837854                            GSM686970 peak_837854
#>   peak_837855 GSM1368009,GSM1368013,GSM1368014,... peak_837855
#>   peak_837856 GSM1370470,GSM1412512,GSM2233369,... peak_837856
#>                      annotation   geneChr geneStart   geneEnd geneLength
#>                     <character> <integer> <integer> <integer>  <integer>
#>        peak_1 Distal Intergenic         1   3214482   3671498     457017
#>        peak_2 Distal Intergenic         1   3214482   3671498     457017
#>        peak_3 Distal Intergenic         1   3214482   3671498     457017
#>        peak_4 Distal Intergenic         1   3214482   3671498     457017
#>        peak_5 Distal Intergenic         1   3214482   3671498     457017
#>           ...               ...       ...       ...       ...        ...
#>   peak_837852 Distal Intergenic        21  90785442  90816465      31024
#>   peak_837853 Distal Intergenic        21  90785442  90816465      31024
#>   peak_837854 Distal Intergenic        21  90785442  90816465      31024
#>   peak_837855 Distal Intergenic        21  90785442  90816465      31024
#>   peak_837856 Distal Intergenic        21  90785442  90816465      31024
#>               geneStrand      geneId distanceToTSS
#>                <integer> <character>     <numeric>
#>        peak_1          2        Xkr4        642088
#>        peak_2          2        Xkr4        639838
#>        peak_3          2        Xkr4        633527
#>        peak_4          2        Xkr4        621119
#>        peak_5          2        Xkr4        619617
#>           ...        ...         ...           ...
#>   peak_837852          1       Erdr1         44665
#>   peak_837853          1       Erdr1         44926
#>   peak_837854          1       Erdr1         49496
#>   peak_837855          1       Erdr1         50103
#>   peak_837856          1       Erdr1         53896
#>   -------
#>   seqinfo: 64 sequences from an unspecified genome; no seqlengths

Notice there are two types of data in this object. The first is the coordinates of the identified peaks ranges(peak_counts). The second is the annotation of the these regions mcols(peak_counts). The following table show the description of the second annotation item. Except for the first three columns, all annotations were obtained using ChIPSeeker::annotatePeak as described in the inst/scripts.

col_name	description
peak	The GEO sample identifier of the samples containing this peak in a CharacterList.
name	The name of the peak in the set of unique merged peaks.
annotation	genomic feature of the peak, for instance if the peak is located in 5’UTR, it will annotated by 5’UTR. Possible annotation is Promoter-TSS, Exon, 5’ UTR, 3’ UTR, Intron, and Intergenic.
geneChr	Chromosome of the nearest gene
geneStart	Gene start
geneEnd	Gene end
geneLength	Gene length
geneStrand	Gene Strand
geneId	Gene symbol
distanceToTSS	Distance from peak to gene TSS

The metadata is a list of one object. studies is a data.frame containing the bibliography information of the studies from which the data were collected. Here we show the first entry in the studies object.

# print data of first study
metadata(peak_counts)$studies[1,]
#> # A tibble: 1 x 34
#>   CATEGORY BIBTEXKEY ADDRESS ANNOTE AUTHOR BOOKTITLE CHAPTER CROSSREF EDITION
#>   <chr>    <chr>     <chr>   <chr>  <list> <chr>     <chr>   <chr>    <chr>  
#> 1 ARTICLE  Brier2017 <NA>    <NA>   <chr … <NA>      <NA>    <NA>     <NA>   
#> # … with 25 more variables: EDITOR <list>, HOWPUBLISHED <chr>,
#> #   INSTITUTION <chr>, JOURNAL <chr>, KEY <chr>, MONTH <chr>, NOTE <chr>,
#> #   NUMBER <chr>, ORGANIZATION <chr>, PAGES <chr>, PUBLISHER <chr>,
#> #   SCHOOL <chr>, SERIES <chr>, TITLE <chr>, TYPE <chr>, VOLUME <chr>,
#> #   YEAR <dbl>, ARCHIVEPREFIX <chr>, ARXIVID <chr>, DOI <chr>, EPRINT <chr>,
#> #   ISBN <chr>, ISSN <chr>, PMID <chr>, URL <chr>

Summary of the studies in the dataset

#> `summarise()` ungrouping output (override with `.groups` argument)

GEO series ID	PubMed ID	Num. of Samples	Time (hr)	Differentiation Stage	Factor
SRP000630	18981474	18	0/ 24/ 48/ 72/ 96/ 144	0/1/2/3	PPARG/ RXRG/ POLR2A
SRP002283	20442865	1	NA	3	E2F4
SRP002337	20887899	15	-48/ 0/ 48/ 168	0/1/3	H3K4me3/ H3K27me3/ H3K36me3/ H3K4me2/ H3K4me1/ H3K27ac/ PPARG
SRP002507	20478996	2	0/ 6	0/1	CEBPB
SRP006001	21427703	13	0/ 2/ 4/ 48/ 144	0/1/3	CEBPB/ CEBPD/ NR3C1/ STAT5A/ RXRG/ PPARG/ POLR2A
SRP008061	21914845	1	24	1	TCF7
SRP009613	24315104	6	NA	0	PPARG/ JUN/ CREB1/ PSMB1/ Ubiquitin
SRP016054	23178591	4	NA/ 168	0/3	H3K4me3/ H3K27me3/ H3K9me2/ H3K9me3
SRP028367	23885096	7	168	3	PPARG/ MED1/ CEBPA/ POLR2A/ CREB1
SRP029985	24912735	3	0/ 168	0/3	KDM1A/ NRF1
SRP041129	24788520	10	NA	3	MED1/ CREB1/ EP300/ NCOR1/ CEBPA/ CEBPB/ ATF2/ JUND/ FOSL2
SRP041249	24857652	19	4	1	ATF2/ ATF7/ JUN/ FOSL2/ KLF4/ KLF5/ PBX1/ STAT1/ VDR/ RXRG/ MED1/ EP300/ SMARCA4/ H3K27ac/ H3K4me1/ H3K4me2
SRP042079	24953653	2	0	0	GPS2
SRP043216	25503565	8	NA	0	H3K27ac/ H3K27me3/ H3K36me3/ H3K4me1/ H3K4me3/ H3K79me2/ H3K79me3/ H4K20me1
SRP064188	26590716	11	0/ 144	0/3	H3K27me3/ H3K9me3/ SETDB1/ MBD1/ POLR2A
SRP065028	28398509	1	168	3	KMT2B
SRP078506	27899593	14	0/ 4/ 48	0/1	H3K4me3/ KDM5A/ KDM5C
SRP080809	28107648	2	NA	0	CEBPB
SRP100871	28475875	52	4/ 0/ 48/ 96/ 168	1/0/2/3	CTCF/ H3K27ac/ H3K4me1/ H3K4me2/ HDAC2/ HDAC3/ MED1/ NCOR1/ EP300/ SMC1A

Some of the samples included in this object are control samples.

# show the number of control samples
table(is.na(peak_counts$control_id))
#> 
#> FALSE  TRUE 
#>   189    18

Also, some entries are recorded as NA such as in the column time when the the data are missing or couldn’t be determined from the metadata.

Example of using `curatedAdipoChIP`

Motivation

All the samples in this dataset come from the 3T3-L1 cell line. The MDI induction media, were used to induce adipocyte differentiation. The two important variables in the dataset are time and stage, which correspond to the time point and stage of differentiation when the sample were captured. Ideally, this dataset should be treated as a time course. However, for the purposes of this example, we only used samples from two differentiation stages 0 and 1 for the transcription factor (CEBPB) and treated them as independent groups. The goal of this example is to show how a typical differential binding analysis can be applied in the dataset. The main focus is to explain how the the data and metadata in peak_counts fit in each main piece of the analysis. We started by sub-setting the object to the samples of interest, filtering the low quality samples and low count genes. Then we applied the DESeq2 method with the default values.

Subsetting the object to samples and peaks of interest

First, we subset the peak_counts object to all samples in the differentiation stage 0 or 1 for the transcription factor CEBPB. Using the samples IDs we get chose only the features/peaks that was found in at least one of samples. Notice that other criteria for choosing peaks can be applied depending on the purpose of the analysis. The total number of samples is 8 with 4 samples in each stage. The total number of peaks is 49028.

# subset peaks_count object
# select samples for the factor CEBPB and stage 0 and 1
sample_ind <- (peak_counts$factor == 'CEBPB') & (peak_counts$stage %in% c(0, 1))
sample_ids <- colnames(peak_counts)[sample_ind]

# select peaks from the selected samples
peak_ind <- lapply(mcols(peak_counts)$peak, function(x) sum(sample_ids %in% x))
peak_ind <- unlist(peak_ind) > 2

# subset the object
se <- peak_counts[peak_ind, sample_ind]

# show the number of samples in each group
table(se$stage)
#> 
#> 0 1 
#> 4 4

# show the number of peak in each sample
table(unlist(mcols(se)$peak))[sample_ids]
#> 
#> GSM2257705 GSM2257706  GSM544720  GSM544721  GSM686970  GSM686971  GSM686972 
#>        749        677       1806      14251      14074      15381       7653 
#>  GSM686973 
#>       3367

Filtering low quality samples

Since the quality metrics are reported per run file, we need to get the SSR* id for each of the samples. Notice that, some samples would have more than one file. In this case because some of the samples are paired-end, so each of them would have two files SRR\*_1 and SRR\*_2. This kind of information is available in colData.

# check the number of files in qc
qc <- se$qc
table(lengths(qc))
#> 
#> 1 2 
#> 6 2

# flattening qc list
qc <- unlist(qc, recursive = FALSE)
length(qc)
#> [1] 10

The qc object of the metadata contains the output of fastqc in a qc_read class. More information on this object can be accessed by calling ?fastqcr::qc_read. Here, we only use the per_base_sequence_quality to filter out low quality samples. This is by no means enough quality control but it should drive the point home.

# extracting per_base_sequence_quality
per_base <- lapply(qc, function(x) {
  df <- x[['per_base_sequence_quality']]
  df %>%
    dplyr::select(Base, Mean) %>%
    transform(Base = strsplit(as.character(Base), '-')) %>%
    unnest(Base) %>%
    mutate(Base = as.numeric(Base))
}) %>%
  bind_rows(.id = 'run')

After tidying the data, we get a data.frame with three columns; run, Mean and Base for the run ID, the mean quality score and the base number in each read. fastqc provide thorough documentation of this quality control module and others.

# a quick look at quality scores
summary(per_base)
#>      run                 Base           Mean      
#>  Length:362         Min.   : 1.0   Min.   :28.24  
#>  Class :character   1st Qu.:10.0   1st Qu.:31.90  
#>  Mode  :character   Median :19.0   Median :34.80  
#>                     Mean   :18.6   Mean   :33.96  
#>                     3rd Qu.:28.0   3rd Qu.:35.57  
#>                     Max.   :37.0   Max.   :36.50

To identify the low quality samples, we categorize the runs run_average which correspond to the average of the per base mean scores. The following figure should make it easier to see why this cutoff were used in this case.

# find low quality runs
per_base <- per_base %>%
  group_by(run) %>%
  mutate(run_average = mean(Mean) > 34)

# plot average per base quality
per_base %>%
  ggplot(aes(x = Base, y = Mean, group = run, color = run_average)) +
  geom_line() +
  lims(y = c(0,40))

The run IDs of the “bad” samples is then used to remove them from the dataset.

# get run ids of low quality samples
bad_runs <- unique(per_base$run[per_base$run_average == FALSE])
bad_samples <- lapply(se$runs, function(x) sum(x$run %in% bad_runs))

# subset the counts object
se2 <- se[, unlist(bad_samples) == 0]

# show remaining samples by stage
table(se2$stage) 
#> 
#> 0 1 
#> 3 3

Filtering peaks with low counts

To identify the low count feature/peaks, we keep only the features with at least 10 reads in 2 or more samples. Then we subset the object to exclude the rest.

# filtering low count genes
low_counts <- apply(assay(se2), 1, function(x) length(x[x>10])>=2)
table(low_counts)
#> low_counts
#> FALSE  TRUE 
#>    45 15435

# subsetting the count object
se3 <- se2[low_counts,]

Check sample reproducibility

One expects that samples from the same group should reflect similar biology. In fact, this is also expected for the differential binding analysis to be reliable. In the context of ChIP-Seq, several measures were proposed to quantify the similarities and discrepancies between the samples, check this article for more. Here, we use scatter plots of the pairs of samples in each group to show that there is a general agreement in terms of the counts in the peakset.

# plot scatters of samples from each group
par(mfrow = c(2,3))
lapply(split(colnames(se3), se3$stage), function(x) {
  # get counts of three samples
  y1 <- assay(se3)[, x[1]]
  y2 <- assay(se3)[, x[2]]
  y3 <- assay(se3)[, x[3]]
  
  # plot scatters of pairs of samples
  plot(log10(y1 + 1), log10(y2 + 1), xlab = x[1], ylab = x[2])
  plot(log10(y1 + 1), log10(y3 + 1), xlab = x[1], ylab = x[3])
  plot(log10(y2 + 1), log10(y3 + 1), xlab = x[2], ylab = x[3])
})

#> $`0`
#> NULL
#> 
#> $`1`
#> NULL

Applying differential binding using `DESeq2`

DESeq2 is a well documented and widely used R package for the differential binding analysis of ChIP-Seq data. Here we use the default values of DESeq to find the peaks which are deferentially bound in stage 1 compared to 0.

# differential binding analysis
colData(se3) <- colData(se3)[, -2]
se3$stage <- factor(se3$stage)
dds <- DESeqDataSet(se3, ~stage)
#> renaming the first element in assays to 'counts'
#> converting counts to integer mode
dds <- DESeq(dds)
#> estimating size factors
#> estimating dispersions
#> gene-wise dispersion estimates
#> mean-dispersion relationship
#> final dispersion estimates
#> fitting model and testing
res <- results(dds)
table(res$padj < .1)
#> 
#> FALSE  TRUE 
#> 13297  1240

Next!

In this example, we didn’t attempt to correct for the between study factors that might confound the results. To show how is this possible, we use the PCA plots with a few of these factors in the following graphs. The first uses the stage factor which is the factor of interest in this case. We see that the DESeq transformation did a good job separating the samples to their expected groups. However, it also seems that the stage is not the only factor in play. For example, we show in the second and the third graphs two other factors library_layout and instrument_model which might explain some of the variance between the samples. This is expected because the data were collected from different studies using slightly different protocols and different sequencing machines. Therefore, it is necessary to account for these differences to obtain reliable results. There are multiple methods to do that such as Removing Unwanted Variation (RUV) and Surrogate Variable Analysis (SVA).

# explaining variabce 
plotPCA(rlog(dds), intgroup = 'stage')

plotPCA(rlog(dds), intgroup = 'study')

Citing the studies in this subset of the data

Speaking of studies, as mentioned earlier the studies object contains full information of the references of the original studies in which the data were published. Please cite them when using this dataset.

# keys of the studies in this subset of the data
unique(se3$bibtexkey)
#> [1] "luo_parp-1_2017"          "siersbaek_extensive_2011"

Citing `curatedAdipoChIP`

For citing the package use:

# citing the package
citation("curatedAdipoChIP")
#> 
#> To cite package 'curatedAdipoChIP' in publications use:
#> 
#>   Mahmoud Ahmed (2020). curatedAdipoChIP: A Curated ChIP-Seq Dataset of
#>   MDI-induced Differentiated Adipocytes (3T3-L1). R package version
#>   1.6.0. https://github.com/MahShaaban/curatedAdipoChIP
#> 
#> A BibTeX entry for LaTeX users is
#> 
#>   @Manual{,
#>     title = {curatedAdipoChIP: A Curated ChIP-Seq Dataset of MDI-induced Differentiated Adipocytes
#> (3T3-L1)},
#>     author = {Mahmoud Ahmed},
#>     year = {2020},
#>     note = {R package version 1.6.0},
#>     url = {https://github.com/MahShaaban/curatedAdipoChIP},
#>   }

Session Info

devtools::session_info()
#> ─ Session info ───────────────────────────────────────────────────────────────
#>  setting  value                       
#>  version  R version 4.0.3 (2020-10-10)
#>  os       Ubuntu 18.04.5 LTS          
#>  system   x86_64, linux-gnu           
#>  ui       X11                         
#>  language (EN)                        
#>  collate  C                           
#>  ctype    en_US.UTF-8                 
#>  tz       America/New_York            
#>  date     2020-10-29                  
#> 
#> ─ Packages ───────────────────────────────────────────────────────────────────
#>  package                * version  date       lib source        
#>  annotate                 1.68.0   2020-10-28 [2] Bioconductor  
#>  AnnotationDbi            1.52.0   2020-10-28 [2] Bioconductor  
#>  AnnotationHub          * 2.22.0   2020-10-28 [2] Bioconductor  
#>  assertthat               0.2.1    2019-03-21 [2] CRAN (R 4.0.3)
#>  backports                1.1.10   2020-09-15 [2] CRAN (R 4.0.3)
#>  Biobase                * 2.50.0   2020-10-28 [2] Bioconductor  
#>  BiocFileCache          * 1.14.0   2020-10-28 [2] Bioconductor  
#>  BiocGenerics           * 0.36.0   2020-10-28 [2] Bioconductor  
#>  BiocManager              1.30.10  2019-11-16 [2] CRAN (R 4.0.3)
#>  BiocParallel             1.24.0   2020-10-28 [2] Bioconductor  
#>  BiocVersion              3.12.0   2020-10-28 [2] Bioconductor  
#>  bit                      4.0.4    2020-08-04 [2] CRAN (R 4.0.3)
#>  bit64                    4.0.5    2020-08-30 [2] CRAN (R 4.0.3)
#>  bitops                   1.0-6    2013-08-17 [2] CRAN (R 4.0.3)
#>  blob                     1.2.1    2020-01-20 [2] CRAN (R 4.0.3)
#>  callr                    3.5.1    2020-10-13 [2] CRAN (R 4.0.3)
#>  cli                      2.1.0    2020-10-12 [2] CRAN (R 4.0.3)
#>  colorspace               1.4-1    2019-03-18 [2] CRAN (R 4.0.3)
#>  crayon                   1.3.4    2017-09-16 [2] CRAN (R 4.0.3)
#>  curatedAdipoChIP       * 1.6.0    2020-10-29 [1] Bioconductor  
#>  curl                     4.3      2019-12-02 [2] CRAN (R 4.0.3)
#>  DBI                      1.1.0    2019-12-15 [2] CRAN (R 4.0.3)
#>  dbplyr                 * 1.4.4    2020-05-27 [2] CRAN (R 4.0.3)
#>  DelayedArray             0.16.0   2020-10-28 [2] Bioconductor  
#>  desc                     1.2.0    2018-05-01 [2] CRAN (R 4.0.3)
#>  DESeq2                 * 1.30.0   2020-10-28 [2] Bioconductor  
#>  devtools                 2.3.2    2020-09-18 [2] CRAN (R 4.0.3)
#>  digest                   0.6.27   2020-10-24 [2] CRAN (R 4.0.3)
#>  dplyr                  * 1.0.2    2020-08-18 [2] CRAN (R 4.0.3)
#>  ellipsis                 0.3.1    2020-05-15 [2] CRAN (R 4.0.3)
#>  evaluate                 0.14     2019-05-28 [2] CRAN (R 4.0.3)
#>  ExperimentHub          * 1.16.0   2020-10-28 [2] Bioconductor  
#>  fansi                    0.4.1    2020-01-08 [2] CRAN (R 4.0.3)
#>  farver                   2.0.3    2020-01-16 [2] CRAN (R 4.0.3)
#>  fastmap                  1.0.1    2019-10-08 [2] CRAN (R 4.0.3)
#>  fastqcr                * 0.1.2    2019-01-03 [2] CRAN (R 4.0.3)
#>  fs                       1.5.0    2020-07-31 [2] CRAN (R 4.0.3)
#>  genefilter               1.72.0   2020-10-28 [2] Bioconductor  
#>  geneplotter              1.68.0   2020-10-28 [2] Bioconductor  
#>  generics                 0.0.2    2018-11-29 [2] CRAN (R 4.0.3)
#>  GenomeInfoDb           * 1.26.0   2020-10-28 [2] Bioconductor  
#>  GenomeInfoDbData         1.2.4    2020-10-12 [2] Bioconductor  
#>  GenomicRanges          * 1.42.0   2020-10-28 [2] Bioconductor  
#>  ggplot2                * 3.3.2    2020-06-19 [2] CRAN (R 4.0.3)
#>  glue                     1.4.2    2020-08-27 [2] CRAN (R 4.0.3)
#>  gtable                   0.3.0    2019-03-25 [2] CRAN (R 4.0.3)
#>  highr                    0.8      2019-03-20 [2] CRAN (R 4.0.3)
#>  htmltools                0.5.0    2020-06-16 [2] CRAN (R 4.0.3)
#>  httpuv                   1.5.4    2020-06-06 [2] CRAN (R 4.0.3)
#>  httr                     1.4.2    2020-07-20 [2] CRAN (R 4.0.3)
#>  interactiveDisplayBase   1.28.0   2020-10-28 [2] Bioconductor  
#>  IRanges                * 2.24.0   2020-10-28 [2] Bioconductor  
#>  knitr                    1.30     2020-09-22 [2] CRAN (R 4.0.3)
#>  labeling                 0.4.2    2020-10-20 [2] CRAN (R 4.0.3)
#>  later                    1.1.0.1  2020-06-05 [2] CRAN (R 4.0.3)
#>  lattice                  0.20-41  2020-04-02 [2] CRAN (R 4.0.3)
#>  lifecycle                0.2.0    2020-03-06 [2] CRAN (R 4.0.3)
#>  locfit                   1.5-9.4  2020-03-25 [2] CRAN (R 4.0.3)
#>  magrittr                 1.5      2014-11-22 [2] CRAN (R 4.0.3)
#>  Matrix                   1.2-18   2019-11-27 [2] CRAN (R 4.0.3)
#>  MatrixGenerics         * 1.2.0    2020-10-28 [2] Bioconductor  
#>  matrixStats            * 0.57.0   2020-09-25 [2] CRAN (R 4.0.3)
#>  memoise                  1.1.0    2017-04-21 [2] CRAN (R 4.0.3)
#>  mime                     0.9      2020-02-04 [2] CRAN (R 4.0.3)
#>  munsell                  0.5.0    2018-06-12 [2] CRAN (R 4.0.3)
#>  pillar                   1.4.6    2020-07-10 [2] CRAN (R 4.0.3)
#>  pkgbuild                 1.1.0    2020-07-13 [2] CRAN (R 4.0.3)
#>  pkgconfig                2.0.3    2019-09-22 [2] CRAN (R 4.0.3)
#>  pkgload                  1.1.0    2020-05-29 [2] CRAN (R 4.0.3)
#>  prettyunits              1.1.1    2020-01-24 [2] CRAN (R 4.0.3)
#>  processx                 3.4.4    2020-09-03 [2] CRAN (R 4.0.3)
#>  promises                 1.1.1    2020-06-09 [2] CRAN (R 4.0.3)
#>  ps                       1.4.0    2020-10-07 [2] CRAN (R 4.0.3)
#>  purrr                    0.3.4    2020-04-17 [2] CRAN (R 4.0.3)
#>  R6                       2.5.0    2020-10-28 [2] CRAN (R 4.0.3)
#>  rappdirs                 0.3.1    2016-03-28 [2] CRAN (R 4.0.3)
#>  RColorBrewer             1.1-2    2014-12-07 [2] CRAN (R 4.0.3)
#>  Rcpp                     1.0.5    2020-07-06 [2] CRAN (R 4.0.3)
#>  RCurl                    1.98-1.2 2020-04-18 [2] CRAN (R 4.0.3)
#>  remotes                  2.2.0    2020-07-21 [2] CRAN (R 4.0.3)
#>  rlang                    0.4.8    2020-10-08 [2] CRAN (R 4.0.3)
#>  rmarkdown                2.5      2020-10-21 [2] CRAN (R 4.0.3)
#>  rprojroot                1.3-2    2018-01-03 [2] CRAN (R 4.0.3)
#>  RSQLite                  2.2.1    2020-09-30 [2] CRAN (R 4.0.3)
#>  S4Vectors              * 0.28.0   2020-10-28 [2] Bioconductor  
#>  scales                   1.1.1    2020-05-11 [2] CRAN (R 4.0.3)
#>  sessioninfo              1.1.1    2018-11-05 [2] CRAN (R 4.0.3)
#>  shiny                    1.5.0    2020-06-23 [2] CRAN (R 4.0.3)
#>  stringi                  1.5.3    2020-09-09 [2] CRAN (R 4.0.3)
#>  stringr                  1.4.0    2019-02-10 [2] CRAN (R 4.0.3)
#>  SummarizedExperiment   * 1.20.0   2020-10-28 [2] Bioconductor  
#>  survival                 3.2-7    2020-09-28 [2] CRAN (R 4.0.3)
#>  testthat                 2.3.2    2020-03-02 [2] CRAN (R 4.0.3)
#>  tibble                   3.0.4    2020-10-12 [2] CRAN (R 4.0.3)
#>  tidyr                  * 1.1.2    2020-08-27 [2] CRAN (R 4.0.3)
#>  tidyselect               1.1.0    2020-05-11 [2] CRAN (R 4.0.3)
#>  usethis                  1.6.3    2020-09-17 [2] CRAN (R 4.0.3)
#>  utf8                     1.1.4    2018-05-24 [2] CRAN (R 4.0.3)
#>  vctrs                    0.3.4    2020-08-29 [2] CRAN (R 4.0.3)
#>  withr                    2.3.0    2020-09-22 [2] CRAN (R 4.0.3)
#>  xfun                     0.18     2020-09-29 [2] CRAN (R 4.0.3)
#>  XML                      3.99-0.5 2020-07-23 [2] CRAN (R 4.0.3)
#>  xtable                   1.8-4    2019-04-21 [2] CRAN (R 4.0.3)
#>  XVector                  0.30.0   2020-10-28 [2] Bioconductor  
#>  yaml                     2.2.1    2020-02-01 [2] CRAN (R 4.0.3)
#>  zlibbioc                 1.36.0   2020-10-28 [2] Bioconductor  
#> 
#> [1] /tmp/RtmpQCZwZx/Rinst70953f18acb
#> [2] /home/biocbuild/bbs-3.12-bioc/R/library

Using curatedAdipoChIP

Mahmoud Ahmed

2020-10-29

curatedAdipoChIP

Overview

Introduction

What is `curatedAdipoChIP`?

What is contained in `curatedAdipoChIP`?

What is `curatedAdipoChIP` for?

Installation

Docker image

Generating `curatedAdipoChIP`

Search starategy & data collection

Pre-processing

1. Downloading data `download_fastq`

2. Making a genome index `make_index`

3. Dowinloading annotations `get_annotation`

4. Aligning reads `align_reads`

5. Calling peaks `peak_calling`

6. Making a peakset `get_peakset`

7. Counting reads in peaks `count_features`

8. Quality assessment `fastqc`

Processing

Making a summarized experiment object `peak_object`

Exploring the `peak_counts` object

Summary of the studies in the dataset

Example of using `curatedAdipoChIP`

Motivation

Subsetting the object to samples and peaks of interest

Filtering low quality samples

Filtering peaks with low counts

Check sample reproducibility

Applying differential binding using `DESeq2`

Next!

Citing the studies in this subset of the data

Citing `curatedAdipoChIP`

Session Info

Using curatedAdipoChIP

Mahmoud Ahmed

2020-10-29

curatedAdipoChIP

Overview

Introduction

What is curatedAdipoChIP?

What is contained in curatedAdipoChIP?

What is curatedAdipoChIP for?

Installation

Docker image

Generating curatedAdipoChIP

Search starategy & data collection

Pre-processing

1. Downloading data download_fastq

2. Making a genome index make_index

3. Dowinloading annotations get_annotation

4. Aligning reads align_reads

5. Calling peaks peak_calling

6. Making a peakset get_peakset

7. Counting reads in peaks count_features

8. Quality assessment fastqc

Processing

Making a summarized experiment object peak_object

Exploring the peak_counts object

Summary of the studies in the dataset

Example of using curatedAdipoChIP

Motivation

Subsetting the object to samples and peaks of interest

Filtering low quality samples

Filtering peaks with low counts

Check sample reproducibility

Applying differential binding using DESeq2

Next!

Citing the studies in this subset of the data

Citing curatedAdipoChIP

Session Info

What is `curatedAdipoChIP`?

What is contained in `curatedAdipoChIP`?

What is `curatedAdipoChIP` for?

Generating `curatedAdipoChIP`

1. Downloading data `download_fastq`

2. Making a genome index `make_index`

3. Dowinloading annotations `get_annotation`

4. Aligning reads `align_reads`

5. Calling peaks `peak_calling`

6. Making a peakset `get_peakset`

7. Counting reads in peaks `count_features`

8. Quality assessment `fastqc`

Making a summarized experiment object `peak_object`

Exploring the `peak_counts` object

Example of using `curatedAdipoChIP`

Applying differential binding using `DESeq2`

Citing `curatedAdipoChIP`