This document gives an introduction to the R package Damsel, for use in DamID analysis; from BAM file input to gene ontology analysis.
Designed for use with DamID data, the Damsel methodology could be modified for use on any similar technology that seeks to identify enriched regions relative to a control sample.
Utilising the power of edgeR for differential analysis and goseq for gene ontology bias correction, Damsel provides a unique end to end analysis for DamID.
The DamID example data used in this vignette is available in the package and has been taken from Vissers et al., (2018), ‘The Scalloped and Nerfin-1 Transcription Factors Cooperate to Maintain Neuronal Cell Fate’. The fastq files were downloaded from SRA, aligned using Rsubread::index
and Rsubread::align
, before sorting and making bai files with Samtools.
if (!require("BiocManager", quietly = TRUE)) {
install.packages("BiocManager")
}
BiocManager::install("Damsel")
library(Damsel)
As a DamID analysis tool, Damsel requires a GATC region file for analysis. These regions serve as a guide to extract counts from the BAM files.
It can be generated with getGatcRegions()
using a BSGenome
object or a FASTA file.
It is a GRangesList
with the consecutive GATC regions across the genome - representing the region (or the length) between GATC sites, as well as the positions of the sites themselves.
If you have another species of DamID data or would prefer to make your own region file, you can use the following function, providing a BSgenome object or a FASTA file.
library(BSgenome.Dmelanogaster.UCSC.dm6)
#> Loading required package: BSgenome
#> Loading required package: BiocGenerics
#>
#> Attaching package: 'BiocGenerics'
#> The following objects are masked from 'package:stats':
#>
#> IQR, mad, sd, var, xtabs
#> The following objects are masked from 'package:base':
#>
#> Filter, Find, Map, Position, Reduce, anyDuplicated, aperm, append,
#> as.data.frame, basename, cbind, colnames, dirname, do.call,
#> duplicated, eval, evalq, get, grep, grepl, intersect, is.unsorted,
#> lapply, mapply, match, mget, order, paste, pmax, pmax.int, pmin,
#> pmin.int, rank, rbind, rownames, sapply, setdiff, table, tapply,
#> union, unique, unsplit, which.max, which.min
#> Loading required package: S4Vectors
#> Loading required package: stats4
#>
#> Attaching package: 'S4Vectors'
#> The following object is masked from 'package:utils':
#>
#> findMatches
#> The following objects are masked from 'package:base':
#>
#> I, expand.grid, unname
#> Loading required package: IRanges
#> Loading required package: GenomeInfoDb
#> Loading required package: GenomicRanges
#> Loading required package: Biostrings
#> Loading required package: XVector
#>
#> Attaching package: 'Biostrings'
#> The following object is masked from 'package:base':
#>
#> strsplit
#> Loading required package: BiocIO
#> Loading required package: rtracklayer
#>
#> Attaching package: 'rtracklayer'
#> The following object is masked from 'package:BiocIO':
#>
#> FileForFormat
regions_and_sites <- getGatcRegions(BSgenome.Dmelanogaster.UCSC.dm6)
#> Warning in .local(x, row.names, optional, ...): 'optional' argument was ignored
#> Warning in .local(x, row.names, optional, ...): 'optional' argument was ignored
#> Warning in .local(x, row.names, optional, ...): 'optional' argument was ignored
#> Warning in .local(x, row.names, optional, ...): 'optional' argument was ignored
#> Warning in .local(x, row.names, optional, ...): 'optional' argument was ignored
#> Warning in .local(x, row.names, optional, ...): 'optional' argument was ignored
#> Warning in .local(x, row.names, optional, ...): 'optional' argument was ignored
#> Warning in GenomeInfoDb::renameSeqlevels(x = df_, value = newStyle): invalid
#> seqlevels 'chrM' ignored
regions <- regions_and_sites$regions
knitr::kable(head(regions))
seqnames | start | end | width | strand | Position |
---|---|---|---|---|---|
2L | 82 | 230 | 149 | * | chr2L-82 |
2L | 231 | 371 | 141 | * | chr2L-231 |
2L | 372 | 539 | 168 | * | chr2L-372 |
2L | 540 | 688 | 149 | * | chr2L-540 |
2L | 689 | 829 | 141 | * | chr2L-689 |
2L | 830 | 997 | 168 | * | chr2L-830 |
knitr::kable(head(regions_and_sites$sites))
seqnames | start | end | width | strand | Position |
---|---|---|---|---|---|
2L | 80 | 83 | 4 | * | chr2L-80 |
2L | 229 | 232 | 4 | * | chr2L-229 |
2L | 370 | 373 | 4 | * | chr2L-370 |
2L | 538 | 541 | 4 | * | chr2L-538 |
2L | 687 | 690 | 4 | * | chr2L-687 |
2L | 828 | 831 | 4 | * | chr2L-828 |
If you already have your own GATC region file, ensure that it has the same format with 6 columns:
[plyranges::as_granges()])
Note: Damsel requires BAM files that have been mapped to the reference genome.
Provided the path to a folder of BAM files (and their .bai files) and the appropriate GATC region file, the function countBamInGATC()
will extract the counts for each region for each available BAM and add them as columns to a data frame. The columns will be named by the BAM file name - please rename them before running the function if they do not make sense.
path_to_bams <- system.file("extdata", package = "Damsel")
counts.df <- countBamInGATC(path_to_bams,
regions = regions
)
knitr::kable(head(counts.df))
X | Position | seqnames | start | end | width | strand | dam_1_SRR7948872.BAM | sd_1_SRR7948874.BAM | dam_2_SRR7948876.BAM | sd_2_SRR7948877.BAM |
---|---|---|---|---|---|---|---|---|---|---|
1 | chr2L-82 | chr2L | 82 | 230 | 149 | * | 1.0 | 0.33 | 0.0 | 0.0 |
2 | chr2L-231 | chr2L | 231 | 371 | 141 | * | 1.5 | 5.67 | 87.0 | 57.5 |
3 | chr2L-372 | chr2L | 372 | 539 | 168 | * | 2.5 | 6.17 | 88.0 | 58.5 |
4 | chr2L-540 | chr2L | 540 | 688 | 149 | * | 2.0 | 4.83 | 0.0 | 0.0 |
5 | chr2L-689 | chr2L | 689 | 829 | 141 | * | 0.0 | 0.00 | 0.5 | 0.5 |
6 | chr2L-830 | chr2L | 830 | 997 | 168 | * | 0.0 | 1.33 | 4.5 | 3.5 |
This example data is also directly available as a counts file via data
.
data("dros_counts")
Do not remove the .bam extension on the column names as this is used as a check in later functions to ensure only the BAM files are selected from the data frame.
The DamID data captures the ~75bp region extending from each GATC site, so although regions are of differing widths, there is a null to minimal length bias present on the data, and does not require length correction.
At this stage, the similarities and differences between the samples can be analysed via correlation.
plotCorrHeatmap
plots the correlation of all available BAM files Dam and Fusion, to visualise the similarity between files.
The default for all Damsel correlation analysis is the non-parametric “spearman’s” correlation.
The correlation between Dam_1 and Fusion_1 can be expected to reach ~ 0.7, whereas the correlation between Dam_1 & Dam_3 or Fusion_1 & Fusion_2 would be expected to be closer to ~0.9
plotCorrHeatmap(df = counts.df, method = "spearman")
Two specific samples can also be compared using ggscatter
which plots a scatterplot of the two samples, overlaid with the correlation results. [ggpubr::ggscatter()]
A specific region can be selected to view the counts across samples.
plotCounts(counts.df,
seqnames = "chr2L",
start_region = 1,
end_region = 40000,
layout = "spread"
)
As shown in the following plots, the default layout is "stacked"
, where the replicates are overlaid. The counts can also be displayed in a log2 ratio with log2_scale=TRUE
The goal with DamID analysis is to identify regions that are enriched in the fusion sample relative to the control. In Damsel, this step is referred to as differential methylation analysis, and makes use of [edgeR
].
For ease of use, Damsel has four main edgeR based functions which compile different steps and functions from within edgeR.
makeDGE
sets up the edgeR analysis for differential methylation testing. Taking the data frame of samples and regions as input, it conducts the following steps:
dge <- makeDGE(counts.df)
head(dge)
#> An object of class "DGEList"
#> $counts
#> dam_1_SRR7948872.BAM sd_1_SRR7948874.BAM dam_2_SRR7948876.BAM
#> chr2L-231 1.50 5.67 87.0
#> chr2L-372 2.50 6.17 88.0
#> chr2L-3578 4.00 1.50 6.0
#> chr2L-4494 1.42 7.83 8.0
#> chr2L-4952 3.58 1.83 5.5
#> chr2L-5120 4.42 3.00 6.0
#> sd_2_SRR7948877.BAM
#> chr2L-231 57.50
#> chr2L-372 58.50
#> chr2L-3578 5.00
#> chr2L-4494 6.00
#> chr2L-4952 5.33
#> chr2L-5120 5.83
#>
#> $samples
#> group lib.size norm.factors
#> dam_1_SRR7948872.BAM Dam 2238419 1.7342299
#> sd_1_SRR7948874.BAM Fusion 2027147 0.8980197
#> dam_2_SRR7948876.BAM Dam 10752265 1.2158660
#> sd_2_SRR7948877.BAM Fusion 11540604 0.5281068
#>
#> $genes
#> Position seqnames start end
#> chr2L-231 chr2L-231 chr2L 231 371
#> chr2L-372 chr2L-372 chr2L 372 539
#> chr2L-3578 chr2L-3578 chr2L 3578 3745
#> chr2L-4494 chr2L-4494 chr2L 4494 4661
#> chr2L-4952 chr2L-4952 chr2L 4952 5119
#> chr2L-5120 chr2L-5120 chr2L 5120 5153
#>
#> $design
#> (Intercept) group V2
#> 1 1 1 1
#> 2 1 2 1
#> 3 1 3 0
#> 4 1 4 0
#> attr(,"assign")
#> [1] 0 1 2
#>
#> $common.dispersion
#> [1] 0.02265938
#>
#> $trended.dispersion
#> [1] 1.471754e-02 1.502779e-02 9.765625e-05 9.765625e-05 9.765625e-05
#> [6] 9.765625e-05
#>
#> $tagwise.dispersion
#> [1] 1.550797e-02 1.560039e-02 9.765625e-05 9.765625e-05 9.765625e-05
#> [6] 9.765625e-05
#>
#> $AveLogCPM
#> [1] 2.583582549 2.620713615 0.004159108 0.388971263 -0.007506044
#> [6] 0.175425153
#>
#> $trend.method
#> [1] "locfit"
#>
#> $prior.df
#> [1] 115.1677 116.6016 122.7404 109.0151 122.7404 122.7404
#>
#> $prior.n
#> [1] 115.1677 116.6016 122.7404 109.0151 122.7404 122.7404
#>
#> $span
#> [1] 0.2767543
The output from this step is a DGEList containing all of the information from the steps.
It’s important to visualise the differences between the samples.
You would expect the Dam samples to cluster together, and for the Fusion samples to cluster together. You would expect the majority of the variation to be within the 1st dimension (the x axis), and less variation in the 2nd dimension (y axis)
group <- dge$samples$group %>% as.character()
limma::plotMDS(dge, col = as.numeric(factor(group)))
After exploring the data visually, it’s time to identify the enriched regions. testDmRegions
compiles the edgeR functions for differential testing with one key modification - it outputs the results with the adjusted p values as well as the raw p values.
testDmRegions
conducts the following key steps:
These results are incorporated with the region data, providing a result for every region. The regions excluded from edgeR analysis are given logFC = 0, and adjust.p = 1
Setting plot=TRUE will plot an [edgeR::plotSmear()
] alongside the results
dm_results <- testDmRegions(dge, p.value = 0.05, lfc = 1, regions = regions, plot = TRUE)
#> Warning in plot.xy(xy.coords(x, y), type = type, ...): "panel.first" is not a
#> graphical parameter
dm_results %>%
dplyr::group_by(meth_status) %>%
dplyr::summarise(n = dplyr::n())
#> # A tibble: 3 × 2
#> meth_status n
#> <chr> <int>
#> 1 No_sig 30415
#> 2 Not_included 344362
#> 3 Upreg 8877
knitr::kable(head(dm_results), digits = 32)
Position | seqnames | start | end | width | strand | number | dm | logFC | PValue | adjust.p | meth_status |
---|---|---|---|---|---|---|---|---|---|---|---|
chr2L-82 | chr2L | 82 | 230 | 149 | * | 1 | NA | 0.0000000 | 1.00000000 | 1.0000000 | Not_included |
chr2L-231 | chr2L | 231 | 371 | 141 | * | 2 | 0 | 0.7452035 | 0.09146510 | 0.1415904 | No_sig |
chr2L-372 | chr2L | 372 | 539 | 168 | * | 3 | 0 | 0.7348633 | 0.07698549 | 0.1227444 | No_sig |
chr2L-540 | chr2L | 540 | 688 | 149 | * | 4 | NA | 0.0000000 | 1.00000000 | 1.0000000 | Not_included |
chr2L-689 | chr2L | 689 | 829 | 141 | * | 5 | NA | 0.0000000 | 1.00000000 | 1.0000000 | Not_included |
chr2L-830 | chr2L | 830 | 997 | 168 | * | 6 | NA | 0.0000000 | 1.00000000 | 1.0000000 | Not_included |
The edgeR results can be plotted alongside the counts as well.
plotCounts(counts.df,
seqnames = "chr2L",
start_region = 1,
end_region = 40000,
log2_scale = FALSE
) +
geom_dm(dm_results.df = dm_results)
Only regions that are fully contained within the provided boundaries will be plotted.
gatc_sites <- regions_and_sites$sites
knitr::kable(head(gatc_sites))
seqnames | start | end | width | strand | Position |
---|---|---|---|---|---|
2L | 80 | 83 | 4 | * | chr2L-80 |
2L | 229 | 232 | 4 | * | chr2L-229 |
2L | 370 | 373 | 4 | * | chr2L-370 |
2L | 538 | 541 | 4 | * | chr2L-538 |
2L | 687 | 690 | 4 | * | chr2L-687 |
2L | 828 | 831 | 4 | * | chr2L-828 |
plotCounts(counts.df,
seqnames = "chr2L",
start_region = 1,
end_region = 40000,
log2_scale = FALSE
) +
geom_dm(dm_results) +
geom_gatc(gatc_sites)
As you could see from the plot of the differential methylation results, there are 10s of 1000s of enriched regions. To reduce the scale of this data to something that can be more biologically meaningul, enriched regions can be compiled into peaks.
Damsel identifies peaks by aggregating regions of enrichment. As DamID sequencing generally sequences the 75 bp from the GATC site, regions smaller than 150 bp are mostly non-significant in statistical testing. Because of this, gaps between peaks of less than or equal to 150 bp are combined into one longer peak.
The FDR and logFC for each peak is calculated via the theory of [csaw::getBestTest()] where the ‘best’ (smallest) p-value in the regions that make up the peak is selected as representative of the peak. The logFC is therefore the corresponding logFC from the selected region.
peaks <- identifyPeaks(dm_results)
nrow(peaks)
#> [1] 697
knitr::kable(head(peaks), digits = 32)
peak_id | seqnames | start | end | width | strand | rank_p | logFC_match | FDR | multiple_peaks | region_pos | n_regions_dm | n_regions_not_dm |
---|---|---|---|---|---|---|---|---|---|---|---|---|
PS_1159 | chr2L | 10488545 | 10489575 | 1031 | * | 1 | 7.408994 | 0.0001833906 | 1 | chr2L-10488545 | 3 | 0 |
PM_749 | chr2L | 6963496 | 6964804 | 1309 | * | 2 | 6.345275 | 0.0001833906 | 2 | chr2L-6963496 | 5 | 1 |
PS_1411 | chr2L | 13204670 | 13207893 | 3224 | * | 3 | 5.084738 | 0.0001833906 | 1 | chr2L-13206327 | 8 | 0 |
PS_1233 | chr2L | 11304923 | 11306613 | 1691 | * | 4 | 5.067651 | 0.0001833906 | 1 | chr2L-11305689 | 8 | 0 |
PM_923 | chr2L | 8402774 | 8404891 | 2118 | * | 5 | 4.995820 | 0.0001833906 | 2 | chr2L-8403504 | 11 | 1 |
PS_911 | chr2L | 8301895 | 8305706 | 3812 | * | 6 | 4.822946 | 0.0001833906 | 1 | chr2L-8303352 | 13 | 0 |
A peak plot layer can be added to our graph
plotCounts(counts.df,
seqnames = "chr2L",
start_region = 1,
end_region = 40000
) +
geom_dm(dm_results) +
geom_peak(peaks, peak.label = TRUE) +
geom_gatc(gatc_sites)
The default version will not plot the peak.label
The peak information itself - while interesting, has no biological meaning. As the peaks represent a region that the Fusion protein interacted with on the DNA, likely as a transcription factor, we wish to identify the gene that is being affected. To do so, we need to associate the peaks with a potential “target” gene.
Note: any gene identified here is only a potential target that must be validated in laboratory procedures. There is no method available that is able to accurately predict the location and target genes of enhancers, so a key and potentially incorrect assumption in this part of the analysis is that all peaks represent binding to a local enhancer or promoter - that it is close or overlapping to the target gene.
It must also be noted that the Drosophila melanogaster genome and transcription factor interactions are different to that of mammals and using the same assumptions means results must be taken cautiously. While mammalian genes are generally spread out with little overlap, there is a large amount of overlap between Drosophila genes, requiring some intuitive interpretation of which gene the peak is potentially targeting.
In the Damsel methodology, peaks are considered to associate with genes if they are within 5kb upstream or downstream. If multiple genes are within these criteria, they are all listed, with the closest gene given the primary position.
The function collateGenes()
uses two different mechanisms to create a list of genes. It allows for the use of a TxDb object/annotation package, or can access biomaRt.
The simplest approach is to use a TxDb annotation package. A TxDb package is available for most species and has information about the genes, exons, cds, promoters, etc - which can all be accessed using the GenomicFeatures package. This presentation has a lot of slides on how to use TxDb objects to access different data types: https://rockefelleruniversity.github.io/Bioconductor_Introduction/presentations/slides/GenomicFeatures_In_Bioconductor.html#10
However, TxDb libraries contain only the Ensembl gene Ids and not the gene symbol or name. Instead we need to access an org.Db package to transfer them over. org.Db packages contain information about model organisms genome annotation, and can be used to extract various information about the gene name etc. More information can be found here https://rockefelleruniversity.github.io/Bioconductor_Introduction/presentations/slides/GenomicFeatures_In_Bioconductor.html#46
BiocManager::install("TxDb.Dmelanogaster.UCSC.dm6.ensGene")
BiocManager::install("org.Dm.eg.db")
library("TxDb.Dmelanogaster.UCSC.dm6.ensGene")
#> Loading required package: GenomicFeatures
#> Loading required package: AnnotationDbi
#> Loading required package: Biobase
#> Welcome to Bioconductor
#>
#> Vignettes contain introductory material; view with
#> 'browseVignettes()'. To cite Bioconductor, see
#> 'citation("Biobase")', and for packages 'citation("pkgname")'.
txdb <- TxDb.Dmelanogaster.UCSC.dm6.ensGene::TxDb.Dmelanogaster.UCSC.dm6.ensGene
library("org.Dm.eg.db")
#>
genes <- collateGenes(genes = txdb, regions = regions, org.Db = org.Dm.eg.db)
#> 1 gene was dropped because it has exons located on both strands of the
#> same reference sequence or on more than one reference sequence, so
#> cannot be represented by a single genomic range.
#> Use 'single.strand.genes.only=FALSE' to get all the genes in a
#> GRangesList object, or use suppressMessages() to suppress this message.
#> 'select()' returned 1:many mapping between keys and columns
#> TSS taken as start of gene, taking strand into account
knitr::kable(head(genes))
seqnames | start | end | width | strand | ensembl_gene_id | gene_name | TSS | n_regions |
---|---|---|---|---|---|---|---|---|
chr2L | 7529 | 9484 | 1956 | + | FBgn0031208 | CR11023 | 7529 | 3 |
chr2L | 9839 | 21376 | 11538 | - | FBgn0002121 | l(2)gl | 21376 | 33 |
chr2L | 21823 | 25155 | 3333 | - | FBgn0031209 | Ir21a | 25155 | 8 |
chr2L | 21952 | 24237 | 2286 | + | FBgn0263584 | asRNA:CR43609 | 21952 | 6 |
chr2L | 25402 | 65404 | 40003 | - | FBgn0051973 | Cda5 | 65404 | 93 |
chr2L | 54817 | 55767 | 951 | + | FBgn0267987 | lncRNA:CR46254 | 54817 | 1 |
Using the TxDb package, Damsel assumes that the TSS (transcription start site) is the same as the start site of the gene, taking the strand into account.
Alternatively, the name of a species listed in biomaRt can be provided, and the version of the genome. The advantage of biomaRt is that a greater amount of information is able to be uncovered, including the canonical transcript. A guide to understanding more about how biomaRt functions is here: https://bioconductor.org/packages/release/bioc/vignettes/biomaRt/inst/doc/accessing_ensembl.html
BiocManager::install("biomaRt")
library(biomaRt)
collateGenes(genes = "dmelanogaster_gene_ensembl", regions = regions, version = 109)
#> GRanges object with 23844 ranges and 5 metadata columns:
#> seqnames ranges strand | ensembl_gene_id gene_name
#> <Rle> <IRanges> <Rle> | <character> <character>
#> [1] 2L 7529-9484 + | FBgn0031208 CR11023
#> [2] 2L 9726-9859 + | FBti0060580
#> [3] 2L 9839-21376 - | FBgn0002121 l(2)gl
#> [4] 2L 9888-9949 - | FBti0059810
#> [5] 2L 21823-25155 - | FBgn0031209 Ir21a
#> ... ... ... ... . ... ...
#> [23840] Y 3370887-3371602 + | FBgn0267426 CR45779
#> [23841] Y 3376224-3376984 + | FBgn0267427 CR45780
#> [23842] Y 3539437-3540722 - | FBgn0046698 Pp1-Y2
#> [23843] Y 3561745-3618339 + | FBgn0046323 ORY
#> [23844] Y 3649996-3666928 + | FBgn0267592 CCY
#> ensembl_transcript_id TSS n_regions
#> <character> <integer> <numeric>
#> [1] FBtr0475186 7529 3
#> [2] FBti0060580-RA 9726 1
#> [3] FBtr0306592 21376 33
#> [4] FBti0059810-RA 9949 0
#> [5] FBtr0113008 25155 8
#> ... ... ... ...
#> [23840] FBtr0346754 3370887 0
#> [23841] FBtr0346755 3376224 0
#> [23842] FBtr0346673 3540722 6
#> [23843] FBtr0346720 3561745 114
#> [23844] FBtr0347413 3649996 28
#> -------
#> seqinfo: 7 sequences from an unspecified genome; no seqlengths
As stated above, Damsel associates genes with peaks if they are within 5 kb upstream or downstream. This maximum distance is an adjustable parameter within the annotatePeaksGenes()
function. If set to NULL
it will output all possible combinations as defined by plyranges::pair_nearest
. The nature of this function means that the closest gene will be found for every peak, even if that distance is in the millions. If the user sets max_distance=NULL
, we recommend undergoing some filtering to remove those associations.
To respect that some species have genes with more overlap than others, annotatePeaksGenes
outputs a list of data frames. The first, closest, outputs information for every peak and it’s closest gene. The second data frame, top_five, outputs a string of the top five genes (if available) and their distances from each peak. The final data frame, all, provides the raw results and all possible gene and peak associations, as well as all available statistical results.
annotation <- annotatePeaksGenes(peaks, genes, regions = regions, max_distance = 5000)
knitr::kable(head(annotation$closest), digits = 32)
seqnames | start | end | width | total_regions | n_regions_dm | peak_id | rank_p | gene_position | ensembl_gene_id | gene_name | midpoint_is | position |
---|---|---|---|---|---|---|---|---|---|---|---|---|
chr2L | 5154 | 9484 | 4331 | 8 | 3 | PS_2 | 586 | chr2L:+:7529-9484 | FBgn0031208 | CR11023 | Upstream | Peak_upstream |
chr2L | 9839 | 21376 | 11538 | 33 | 3 | PM_4 | 615 | chr2L:-:9839-21376 | FBgn0002121 | l(2)gl | Upstream | Peak_within_gene |
chr2L | 9839 | 21376 | 11538 | 33 | 5 | PS_6 | 46 | chr2L:-:9839-21376 | FBgn0002121 | l(2)gl | Upstream | Peak_within_gene |
chr2L | 9839 | 21643 | 11805 | 34 | 5 | PS_7 | 14 | chr2L:-:9839-21376 | FBgn0002121 | l(2)gl | Upstream | Peak_overlap_downstream |
chr2L | 21823 | 25155 | 3333 | 8 | 3 | PS_8 | 546 | chr2L:-:21823-25155 | FBgn0031209 | Ir21a | Upstream | Peak_within_gene |
chr2L | 25402 | 65404 | 40003 | 93 | 3 | PM_9 | 544 | chr2L:-:25402-65404 | FBgn0051973 | Cda5 | Upstream | Peak_within_gene |
knitr::kable(head(annotation$top_five), digits = 32)
seqnames | start | end | peak_id | rank_p | n_genes | list_ensembl | list_gene | position | distance_TSS | min_distance |
---|---|---|---|---|---|---|---|---|---|---|
chr2L | 5154 | 9484 | PS_2 | 586 | 1 | FBgn0031208 | CR11023 | Peak_upstream | 1512 | 649 |
chr2L | 9839 | 21376 | PM_4 | 615 | 1 | FBgn0002121 | l(2)gl | Peak_within_gene | 8236 | 0 |
chr2L | 9839 | 21376 | PS_6 | 46 | 1 | FBgn0002121 | l(2)gl | Peak_within_gene | 3489 | 0 |
chr2L | 9839 | 21643 | PS_7 | 14 | 1 | FBgn0002121 | l(2)gl | Peak_overlap_downstream | 660 | 267 |
chr2L | 21823 | 25155 | PS_8 | 546 | 2 | FBgn0031209, FBgn0263584 | Ir21a, asRNA:CR43609 | Peak_within_gene, Peak_within_gene | 2771, -432 | 0, 0 |
chr2L | 25402 | 65404 | PM_9 | 544 | 1 | FBgn0051973 | Cda5 | Peak_within_gene | 30177 | 0 |
knitr::kable(str(annotation$all), digits = 32)
#> tibble [1,535 × 33] (S3: tbl_df/tbl/data.frame)
#> $ seqnames : Factor w/ 1870 levels "chr2L","chr2R",..: 1 1 1 1 1 1 1 1 1 1 ...
#> $ start : int [1:1535] 5154 9839 9839 9839 21823 21952 25402 65609 65609 65609 ...
#> $ end : int [1:1535] 9484 21376 21376 21643 25155 24237 65404 68330 68330 71390 ...
#> $ width : num [1:1535] 4331 11538 11538 11805 3333 ...
#> $ gene_seqnames : Factor w/ 1870 levels "chr2L","chr2R",..: 1 1 1 1 1 1 1 1 1 1 ...
#> $ gene_start : int [1:1535] 7529 9839 9839 9839 21823 21952 25402 65999 66318 66482 ...
#> $ gene_end : int [1:1535] 9484 21376 21376 21376 25155 24237 65404 66242 66524 71390 ...
#> $ gene_width : int [1:1535] 1956 11538 11538 11538 3333 2286 40003 244 207 4909 ...
#> $ gene_strand : Factor w/ 3 levels "+","-","*": 1 2 2 2 2 1 2 1 1 1 ...
#> $ peak_seqnames : Factor w/ 2 levels "chr2L","chr2R": 1 1 1 1 1 1 1 1 1 1 ...
#> $ peak_start : int [1:1535] 5154 12715 16957 19789 22097 22097 33758 65609 65609 65609 ...
#> $ peak_end : int [1:1535] 6880 13565 18817 21643 22671 22671 36696 68330 68330 68330 ...
#> $ peak_width : int [1:1535] 1727 851 1861 1855 575 575 2939 2722 2722 2722 ...
#> $ peak_strand : Factor w/ 3 levels "+","-","*": 3 3 3 3 3 3 3 3 3 3 ...
#> $ ensembl_gene_id : chr [1:1535] "FBgn0031208" "FBgn0002121" "FBgn0002121" "FBgn0002121" ...
#> $ gene_name : chr [1:1535] "CR11023" "l(2)gl" "l(2)gl" "l(2)gl" ...
#> $ TSS : int [1:1535] 7529 21376 21376 21376 25155 21952 65404 65999 66318 66482 ...
#> $ n_regions : num [1:1535] 3 33 33 33 8 6 93 0 0 15 ...
#> $ peak_id : chr [1:1535] "PS_2" "PM_4" "PS_6" "PS_7" ...
#> $ rank_p : int [1:1535] 586 615 46 14 546 546 544 159 159 159 ...
#> $ logFC_match : num [1:1535] 1.39 1.66 3.83 4.47 2.5 ...
#> $ FDR : num [1:1535] 0.001747 0.002356 0.000183 0.000183 0.001344 ...
#> $ multiple_peaks : num [1:1535] 1 2 1 1 1 1 2 1 1 1 ...
#> $ region_pos : chr [1:1535] "chr2L-5303" "chr2L-12715" "chr2L-17450" "chr2L-20391" ...
#> $ n_regions_dm : num [1:1535] 3 3 5 5 3 3 3 7 7 7 ...
#> $ n_regions_not_dm: num [1:1535] 0 1 0 0 0 0 1 0 0 0 ...
#> $ peak_midpoint : num [1:1535] 6017 13140 17887 20716 22384 ...
#> $ distance_TSS : num [1:1535] 1512 8236 3489 660 2771 ...
#> $ midpoint_is : chr [1:1535] "Upstream" "Upstream" "Upstream" "Upstream" ...
#> $ n_genes : int [1:1535] 1 1 1 1 2 2 1 4 4 4 ...
#> $ position : chr [1:1535] "Peak_upstream" "Peak_within_gene" "Peak_within_gene" "Peak_overlap_downstream" ...
#> $ min_distance : num [1:1535] 649 0 0 267 0 0 0 0 0 873 ...
#> $ total_regions : int [1:1535] 8 33 33 34 8 6 93 7 7 18 ...
Now that we have the genes from collateGenes()
, this can be added as a layer to the previous plots. This plot requires the gene positions as a guide for a Txdb
or EnsDb
object, building off the autoplot generic built by ggbio
.
plotCounts(counts.df,
seqnames = "chr2L",
start_region = 1,
end_region = 40000
) +
geom_dm(dm_results) +
geom_peak(peaks) +
geom_gatc(gatc_sites) +
geom_genes_tx(genes, txdb)
#> If gene is disproportional to the plot, use gene_limits = c(y1,y2). If gene is too large, recommend setting to c(0,2) and adjusting the plot.height accordingly.
#> Registered S3 method overwritten by 'GGally':
#> method from
#> +.gg ggplot2
#> Parsing transcripts...
#> Parsing exons...
#> Parsing cds...
#> Parsing utrs...
#> ------exons...
#> ------cdss...
#> ------introns...
#> ------utr...
#> aggregating...
#> Done
#> Constructing graphics...
#> Scale for y is already present.
#> Adding another scale for y, which will replace the existing scale.
plot.height
argument.plotWrap(
id = peaks[1, ]$peak_id,
counts.df = counts.df,
dm_results.df = dm_results,
peaks.df = peaks,
gatc_sites.df = gatc_sites,
genes.df = genes, txdb = txdb
)
#> If gene is disproportional to the plot, use gene_limits = c(y1,y2). If gene is too large, recommend setting to c(0,2) and adjusting the plot.height accordingly.
#> Parsing transcripts...
#> Parsing exons...
#> Parsing cds...
#> Parsing utrs...
#> ------exons...
#> ------cdss...
#> ------introns...
#> ------utr...
#> aggregating...
#> Done
#> Constructing graphics...
#> Scale for y is already present.
#> Adding another scale for y, which will replace the existing scale.
#> [[1]]
plotWrap(
id = genes[1, ]$ensembl_gene_id,
counts.df = counts.df,
dm_results.df = dm_results,
peaks.df = peaks,
gatc_sites.df = gatc_sites,
genes.df = genes, txdb = txdb
)
#> No data available for this region
#> If gene is disproportional to the plot, use gene_limits = c(y1,y2). If gene is too large, recommend setting to c(0,2) and adjusting the plot.height accordingly.
#> Parsing transcripts...
#> Parsing exons...
#> Parsing cds...
#> Parsing utrs...
#> ------exons...
#> ------cdss...
#> ------introns...
#> ------utr...
#> aggregating...
#> Done
#> Constructing graphics...
#> Scale for y is already present.
#> Adding another scale for y, which will replace the existing scale.
#> [[1]]
One of the last steps in a typical DamID analysis is gene ontology analysis. However, a key mistake made in this analysis using any common data type - including RNA-seq, is the lack of bias correction. We correct for this by utilising the [goseq] package. Without bias correction, the ontology results would be biased towards longer genes - the longer the gene, the more likely it would be to have a peak associated with it, and therefore be called as significant. We can see this in the plot below.
testGeneOntology
identifies the over-represented GO terms from the peak data, correcting for the number of GATC regions matching to each gene.
3 outputs Plot of goodness of fit of model signif_results: data.frame of significant GO category results, ordered by p-value. prob_weights: data.frame of probability weights for each gene
ontology <- testGeneOntology(annotation$all, genes, regions = regions, extend_by = 2000)
#> Bias will be n_regions that are contained within the gene length
#>
#> Warning in pcls(G): initial point very close to some inequality constraints
#> Fetching GO annotations...
#> For 4456 genes, we could not find any categories. These genes will be excluded.
#> To force their use, please run with use_genes_without_cat=TRUE (see documentation).
#> This was the default behavior for version 1.15.1 and earlier.
#> Calculating the p-values...
#> 'select()' returned 1:1 mapping between keys and columns
The goodness of fit plot above shows us that there is a length based bias to the data. The x axis shows the number of GATC regions in each gene. The y axis shows the proportion of genes that have that amount of GATC regions and have been identified as significant. And it shows that as the number of GATC regions in the gene increase, as does the proportion of genes that are significant.
knitr::kable(head(ontology$signif_results), digits = 32)
category | over_represented_pvalue | under_represented_pvalue | numDEInCat | numInCat | term | ontology | FDR |
---|---|---|---|---|---|---|---|
GO:0043231 | 3.644279e-10 | 1.0000000 | 362 | 5090 | intracellular membrane-bounded organelle | CC | 4.335599e-06 |
GO:0043227 | 7.916815e-10 | 1.0000000 | 372 | 5282 | membrane-bounded organelle | CC | 4.709317e-06 |
GO:0043229 | 1.218388e-08 | 1.0000000 | 398 | 5867 | intracellular organelle | CC | 3.773409e-05 |
GO:0043226 | 1.268692e-08 | 1.0000000 | 405 | 5987 | organelle | CC | 3.773409e-05 |
GO:0005622 | 3.412559e-08 | 1.0000000 | 472 | 7277 | intracellular anatomical structure | CC | 8.119842e-05 |
GO:0048608 | 5.732192e-07 | 0.9999999 | 20 | 81 | reproductive structure development | BP | 9.742270e-04 |
knitr::kable(head(ontology$prob_weights), digits = 32)
DEgenes | bias.data | pwf | |
---|---|---|---|
FBgn0031208 | 1 | 9 | 0.03967155 |
FBgn0002121 | 1 | 46 | 0.06596038 |
FBgn0031209 | 1 | 15 | 0.04437219 |
FBgn0263584 | 1 | 13 | 0.04281526 |
FBgn0051973 | 1 | 100 | 0.08398254 |
FBgn0267987 | 0 | 10 | 0.04046071 |
As expected, significant gene ontology terms surround developmental processes, which is expected as the fusion gene in the example data (Scalloped) is well known to be involved in development.
plotGeneOntology
can be used to plot the top 10 results.
plotGeneOntology(ontology$signif_results)
As shown above, the plot has the category on the y-axis, the FDR on the x-axis, the size of the dot being the number of genes in the GO category, and the colour of the dot being the ontology (Biological Process, Cellular Component, and Molecular Function).
sessionInfo()
#> R version 4.4.1 (2024-06-14)
#> Platform: x86_64-pc-linux-gnu
#> Running under: Ubuntu 22.04.4 LTS
#>
#> Matrix products: default
#> BLAS: /home/biocbuild/bbs-3.19-bioc/R/lib/libRblas.so
#> LAPACK: /usr/lib/x86_64-linux-gnu/lapack/liblapack.so.3.10.0
#>
#> locale:
#> [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
#> [3] LC_TIME=en_GB LC_COLLATE=C
#> [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
#> [7] LC_PAPER=en_US.UTF-8 LC_NAME=C
#> [9] LC_ADDRESS=C LC_TELEPHONE=C
#> [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
#>
#> time zone: America/New_York
#> tzcode source: system (glibc)
#>
#> attached base packages:
#> [1] stats4 stats graphics grDevices utils datasets methods
#> [8] base
#>
#> other attached packages:
#> [1] biomaRt_2.60.1
#> [2] org.Dm.eg.db_3.19.1
#> [3] TxDb.Dmelanogaster.UCSC.dm6.ensGene_3.12.0
#> [4] GenomicFeatures_1.56.0
#> [5] AnnotationDbi_1.66.0
#> [6] Biobase_2.64.0
#> [7] BSgenome.Dmelanogaster.UCSC.dm6_1.4.1
#> [8] BSgenome_1.72.0
#> [9] rtracklayer_1.64.0
#> [10] BiocIO_1.14.0
#> [11] Biostrings_2.72.1
#> [12] XVector_0.44.0
#> [13] GenomicRanges_1.56.1
#> [14] GenomeInfoDb_1.40.1
#> [15] IRanges_2.38.1
#> [16] S4Vectors_0.42.1
#> [17] BiocGenerics_0.50.0
#> [18] Damsel_1.0.2
#> [19] BiocStyle_2.32.1
#>
#> loaded via a namespace (and not attached):
#> [1] RColorBrewer_1.1-3 rstudioapi_0.16.0
#> [3] jsonlite_1.8.8 magrittr_2.0.3
#> [5] magick_2.8.4 farver_2.1.2
#> [7] rmarkdown_2.28 zlibbioc_1.50.0
#> [9] vctrs_0.6.5 memoise_2.0.1
#> [11] Rsamtools_2.20.0 RCurl_1.98-1.16
#> [13] base64enc_0.1-3 tinytex_0.52
#> [15] htmltools_0.5.8.1 S4Arrays_1.4.1
#> [17] progress_1.2.3 curl_5.2.2
#> [19] SparseArray_1.4.8 Formula_1.2-5
#> [21] sass_0.4.9 bslib_0.8.0
#> [23] htmlwidgets_1.6.4 plyr_1.8.9
#> [25] httr2_1.0.3 cachem_1.1.0
#> [27] GenomicAlignments_1.40.0 lifecycle_1.0.4
#> [29] pkgconfig_2.0.3 Matrix_1.7-0
#> [31] R6_2.5.1 fastmap_1.2.0
#> [33] GenomeInfoDbData_1.2.12 MatrixGenerics_1.16.0
#> [35] digest_0.6.37 colorspace_2.1-1
#> [37] GGally_2.2.1 OrganismDbi_1.46.0
#> [39] patchwork_1.2.0 goseq_1.56.0
#> [41] Hmisc_5.1-3 RSQLite_2.3.7
#> [43] filelock_1.0.3 labeling_0.4.3
#> [45] fansi_1.0.6 mgcv_1.9-1
#> [47] httr_1.4.7 abind_1.4-5
#> [49] compiler_4.4.1 bit64_4.0.5
#> [51] withr_3.0.1 backports_1.5.0
#> [53] htmlTable_2.4.3 BiocParallel_1.38.0
#> [55] DBI_1.2.3 ggstats_0.6.0
#> [57] BiasedUrn_2.0.12 highr_0.11
#> [59] geneLenDataBase_1.40.1 rappdirs_0.3.3
#> [61] DelayedArray_0.30.1 rjson_0.2.22
#> [63] tools_4.4.1 foreign_0.8-87
#> [65] nnet_7.3-19 glue_1.7.0
#> [67] restfulr_0.0.15 nlme_3.1-166
#> [69] grid_4.4.1 checkmate_2.3.2
#> [71] cluster_2.1.6 reshape2_1.4.4
#> [73] generics_0.1.3 gtable_0.3.5
#> [75] ensembldb_2.28.1 tidyr_1.3.1
#> [77] data.table_1.16.0 hms_1.1.3
#> [79] xml2_1.3.6 utf8_1.2.4
#> [81] pillar_1.9.0 stringr_1.5.1
#> [83] limma_3.60.4 splines_4.4.1
#> [85] dplyr_1.1.4 BiocFileCache_2.12.0
#> [87] lattice_0.22-6 bit_4.0.5
#> [89] biovizBase_1.52.0 RBGL_1.80.0
#> [91] tidyselect_1.2.1 GO.db_3.19.1
#> [93] locfit_1.5-9.10 knitr_1.48
#> [95] gridExtra_2.3 bookdown_0.40
#> [97] ProtGenerics_1.36.0 ggbio_1.52.0
#> [99] edgeR_4.2.1 SummarizedExperiment_1.34.0
#> [101] xfun_0.47 statmod_1.5.0
#> [103] matrixStats_1.3.0 stringi_1.8.4
#> [105] UCSC.utils_1.0.0 lazyeval_0.2.2
#> [107] yaml_2.3.10 evaluate_0.24.0
#> [109] codetools_0.2-20 tibble_3.2.1
#> [111] graph_1.82.0 BiocManager_1.30.25
#> [113] cli_3.6.3 rpart_4.1.23
#> [115] munsell_0.5.1 jquerylib_0.1.4
#> [117] dichromat_2.0-0.1 Rcpp_1.0.13
#> [119] dbplyr_2.5.0 png_0.1-8
#> [121] XML_3.99-0.17 parallel_4.4.1
#> [123] ggplot2_3.5.1 blob_1.2.4
#> [125] prettyunits_1.2.0 AnnotationFilter_1.28.0
#> [127] plyranges_1.24.0 bitops_1.0-8
#> [129] txdbmaker_1.0.1 VariantAnnotation_1.50.0
#> [131] scales_1.3.0 purrr_1.0.2
#> [133] crayon_1.5.3 rlang_1.1.4
#> [135] KEGGREST_1.44.1