1 Introduction

This package provides basic functions for analyzing next-generation sequencing data of cell-free DNA (cfDNA). The package focuses on extracting the length of cfDNA sample and visualization of genome-wide enrichment of short-fragmented cfDNA. The aberration of fragmentation profile, denoted modified ctDNA estimation score (CES), allows quantification of circulating tumor DNA (ctDNA). We recommend using this package to analysis shallow whole-genome sequencing data (~0.3X or more). This package was complemented by Bioconductor packages e.g. QDNAseq, Rsamtools and GenomicRanges which could further expand the functionality of this package in the future.

1.1 Installation

1.1.1 Install via the Bioconductor repository

if (!requireNamespace("BiocManager", quietly = TRUE))
  install.packages("BiocManager")

BiocManager::install("cfdnakit")

1.1.2 Install the latest version via github

To install this package is via this github repository, please follow the instruction below.

Install prerequisites packages

if(! "devtools" %in% rownames(installed.packages()))
    install.packages("devtools")
if(! "BiocManager" %in% rownames(installed.packages()))
    install.packages("BiocManager")

Install cfdnakit package

library(devtools)  ### use devtools
install_github("Pitithat-pu/cfdnakit") ### install cfDNAKit 

The installation should work fine without non-zero exit status. Try load cfdnakit package into current R session

library(cfdnakit) ### Load cfdnakit package

2 Prepare input BAM

A coordination-sorted BAM file of cfDNA from any liquid biopsy source (e.g. blood-plasma, CSF, urine) should be applicable. You should examine if the sequencing coverage reach coverage threshold before the analysis. Please make sure that the sequencing reads are mapped onto the following version of reference genome. cfdnakit supports the human reference genome GRCh37(hg19) and GRCh38(hg38). Preliminary fragment-length analysis can be performed using mouse cfDNA when mapping onto the mouse reference genome GRCm38(mm10).

3 Read the BAM file with read_bamfile

Let’s read sequence alignment file (.bam) using function read_bamfile. A BAM index file (.bai) is necessary for efficiently reading the file. If it doesn’t exist in the same directory, this function will automatically create one. This function will split sequencing reads into equal-size non-overlapping windows. Possible size of bin are 100, 500, and 1000 KB. A path to the input file is given to the function read_bamfile. A SampleBam object will be created as the result.

For demonstration, we read an example sequence file ex.plasma.bam.

library(cfdnakit)
sample_bamfile <- system.file("extdata",
                             "ex.plasma.bam",
                package = "cfdnakit")
plasma_SampleBam <- read_bamfile(sample_bamfile,
                                apply_blacklist = FALSE)
#> The BAM index file (.bai) is missing. Creating an index file
#> Bam index file created.
#> Reading bamfile

By default, running read_bamfile will split reads into 1000 KB non-overlapping bins based-on the human reference genome GRCh37.

Reading the file should take a while depending on the size of your BAM. We recommend to save the result as RDS file using saveRDS function.

### Optional
saveRDS(plasma_SampleBam, file = "patientcfDNA_SampleBam.RDS")

4 Analyse the Fragment Length Distribution

Fragment-length distribution of the sample can be visualized with function plot_fragment_dist. In the top-right legend, the modal length (bp) is shown in the parenthesis behind each sample name. The x-axis is the length of cfDNA; y-axis is the proportion of cfDNA fragment having a specific length.

plot_fragment_dist(list("Plasma.Sample"=plasma_SampleBam))

This document will demonstate by using SampleBam object (example_patientcfDNA_SampleBam.RDS). Now we load this file into R session.

example_RDS <- "example_patientcfDNA_SampleBam.RDS"
example_RDS_file <-
    system.file("extdata",example_RDS,
                package = "cfdnakit")
sample_bins <- readRDS(example_RDS_file)

In general, plasma cfDNA should show non-random fragmentation pattern. The modal length of cfDNA is 167 bp and 10-bp periodic peaks. However, tumor-derived cfDNA are observed to be shorter (<150 bp) than cfDNA of non-tumor origin. Here, we compare the fragment length distribution of patient’s cfDNA with healthy individual. We derived a healthy cfDNA from Snyder et al. (2016) and create a RData file “BH01_chr15.RDS”. This file can be loaded in R environment with readRDS function.

control_rds<-"BH01_CHR15.SampleBam.rds"
control_RDS_file <-
    system.file("extdata",control_rds,
                package = "cfdnakit")
control_bins <-
    readRDS(control_RDS_file)

To provide visual comparison of cell-free DNA fragmentation, cfdnakit provide a function that allows plotting multiple distribution plots. We create a list of SampleBam files and plot their distribution together with function plot_fragment_dist. Each element in the list must be given a distinct sample name (e.g. Healthy.cfDNA).

comparing_list <- list("Healthy.cfDNA"=control_bins,
                      "Patient.1"=sample_bins)
plot_fragment_dist(comparing_list)

Optional Parameters

maximum_length = Maximum length of cfDNA (Default 550)

minimum_length = Minimum length of cfDNA (Default 20)

5 Quantification of Short Fragmented CfDNA

We can extract genome-wide fragment-length profile. First we define the short fragment as fragments having size (by default) between 100 - 150 base and 151 - 250 as the long fragment. Providing a SampleBam object to the function get_fragment_profile will return a SampleFragment object.

sample_profile <- 
  get_fragment_profile(sample_bins,
                       sample_id = "Patient.1")

This SampleFragment contains a dataframe named sample_profile. This table contains information about the BAM file and the ratio number of Short/Long fragments. The table below describes important variables.

sample_profile$sample_profile
#>   Sample.ID Total.Fragments Read.Pairs.in.range Read.Pairs.in.range..corrected.
#> 1 Patient.1           68094               65421                        62091.41
#>   N.Short N.Long Mode Median   Mean   Mad S.L.Ratio S.L.Ratio_corrected
#> 1   23503  41918  167    159 163.23 20.76      0.56                0.56
#>   Bin.Size.KB.
#> 1         1000
Variable Description
Total.Fragments Number of DNA fragments (not the number of reads)
Read.Pairs.in.range Number of fragments within the defined range (short and long)
Mode Fragment length of the majority (bp)
Median Median Fragment length (bp)
Mean Average Fragment length (bp)
Mad Fragment length median absolute deviation (MAD)
N.Short Number of short fragments (n)
N.Long Number of long fragments (n)
S.L.Ratio N.Short/N.Long; Ratio of short fragment over long fragment
S.L.Ratio_corrected Ratio of short fragment over long fragment after GC-bias correction
Bin.Size.KB. Size of genomic bin of the analysis (KB)

6 Plot Genome-wide Short-fragmented Ratio

We can plot genome-wide short-fragment ratio with the function plot_sl_ratio. Given short-fragment profile, short-fragment ratio per bin infer contribution of circulating tumor DNA (ctDNA) into cfDNA. The enrichment of short-fragment cfDNA in a large genomic region could infer the copy-number aberration status. The higher short-fragment ratio indicate amplification event whereas deletion would have relatively lower short-fragment ratio.

## For this demenstration, we load a real patient-derived cfDNA profile.
patient.SampleFragment.file <-
    system.file("extdata",
                "example_patientcfDNA_SampleFragment.RDS",
                package = "cfdnakit")
patient.SampleFragment <- readRDS(patient.SampleFragment.file)
plot_sl_ratio(patient.SampleFragment)

The range of S.L.Ratio is broad. It could be as high as 2 or more in sample with high contribution of ctDNA. Adjusting the plot is possible with parameter ylim or with other ggplot functions.

7 Save SampleFragment as RDS file

Up to this step, we rather save the SampleFragment object as RDS for later use and for creation of a Panel-of-Normal.

destination_dir <- "~/cfdnakit.result"
saveRDS(sample_profile,
        file = paste0(destination_dir,
        "/plasma.SampleFragment.RDS"))

8 Create Panel of Normal (PoN)

8.1 Creating list of PoN files

Making a Panel-of-Normal is necessary for downstream analysis as we want to compare fragment profile between a cfDNA from patient with pooled of healthy individuals. First, we create a text file where each line is a full path to rds file of the SampleFragment object.

Example content of PoN list file (let’s name it Pon-list.txt)

### Example content of Pon-list.txt
Path.to/SampleFragment_healthyplasma-01.rds
Path.to/SampleFragment_healthyplasma-02.rds
Path.to/SampleFragment_healthyplasma-03.rds
Path.to/SampleFragment_healthyplasma-04.rds
Path.to/SampleFragment_healthyplasma-05.rds

8.2 Creating a PoN dataset

create_PoN function will read through all fragment profile files and create a set of PoN profile. We save this PoN profile and load into analysis later.

PoN.profiles <- create_PoN("Path.to/Pon-list.txt")

saveRDS(PoN.profiles, "PoN.rds")

9 Inferring CNV from short fragment cfDNA

9.1 Normalizing Short-fragmented Ratio

The contribution of short-fragmented cfDNA into genomic regions could infer copy-number aberration harbored in the tumor cells. For this demonstration, we load ready-to-use PoN profile (ex.PoN.rds).

PoN_rdsfile <- system.file("extdata",
                          "ex.PoN.rds",
                          package = "cfdnakit")

PoN.profiles <- readRDS(PoN_rdsfile)

cfdnakit transforms S.L.Ratio of each genomic windows with the distribution of samples in PoN by function get_zscore_profile.

sample_zscore <- 
  get_zscore_profile(patient.SampleFragment,
                     PoN.profiles)

9.2 Circular Binary Segmentation (CBS)

cfdnakit implement the circular binary segmentation algorithm. Function segmentByCBS of package PSCBS were used through function segmentByPSCB. We can visualize both transformed S.L.Ratio and the segmentation by function plot_transformed_sl.

sample_zscore_segment <- segmentByPSCB(sample_zscore)

plot_transformed_sl(sample_zscore,sample_zscore_segment)

Optional parameters ylim = Y-axis of plot (Default c(-20,20))

9.3 Estimating tumor fraction (TF) and CNV calling

Base on hypothesis that short-fragmented cfDNA originate from tumor cells. Enrichment of short-fragmented cfDNA in a large genomic segments correlates with the absolute copy in the tumor origin. cfdnakit heuristically search for fittest solution per absolute ploidy of 2, 3 and 4. The distance of each solution is calculated similarly to ACE (Poell JB et.al 2019) method.

Function call_cnv returns 3 solutions of the CNV calling and TF estimation. The analysis perform only on autosomal chromosomes. We can plot the distance matrix of all solution where solutions with * are optimal solutions (minimum distance) per absolute genome ploidy.

sample_cnv <- call_cnv(sample_zscore_segment,sample_zscore)

plot_distance_matrix(sample_cnv)

Function call_cnv returns 3 solutions of the CNV calling and TF estimation. The analysis perform only on autosomal chromosomes. We can plot the distance matrix of all solution where solutions with * are optimal solutions (minimum distance) per absolute genome ploidy.

The available solution can be obtained with function get_solution_table. The table shows the best solution (minimum distance) per absolute genome ploidy and rank them by the distance.

solution_table <- get_solution_table(sample_cnv)

solution_table
#>     TF rounded_ploidy ploidy    distance rank
#> 1 0.21              2   2.05 0.006484562    1
#> 2 0.25              3   3.00 0.007383756    2

9.4 Plot optimal CNV profile

We can plot copy-number result by function plot_cnv_solution. By default, this function will produce the plot of the best solution (rank 1) and can be changed by specifying solution number to parameter selected_solution.

plot_cnv_solution(sample_cnv,selected_solution = 1)

10 Copy-number Abnormality Score

As the result of copy-number solution fitting, the tumor fraction (tf ) indicates the estimated quantity of ctDNA from the amplitude of signals. cfdnakit also implements the ctDNA estimation score (CES) score to quantify the tumor-derived cell-free DNA from the segmentation and the SLRatio.

calculate_CES_score(sample_zscore_segment)
#> [1] 122.457

This score is robust to coverage bias and noisy fragmented signals. Briefly, the Gaussian noise does not affect the score because the z-scores of segments, instead of the z-score of bins, are considered. Second, the average segment length is used as a penalty for sample quality. The signal of a bad quality sample does not strongly affect the score whereas a true highly unstable genome would overcome this penalty.

11 Session info

Output of sessionInfo on the system on which this document was compiled:

sessionInfo()
#> R version 4.4.0 beta (2024-04-15 r86425)
#> Platform: x86_64-pc-linux-gnu
#> Running under: Ubuntu 22.04.4 LTS
#> 
#> Matrix products: default
#> BLAS:   /home/biocbuild/bbs-3.19-bioc/R/lib/libRblas.so 
#> LAPACK: /usr/lib/x86_64-linux-gnu/lapack/liblapack.so.3.10.0
#> 
#> locale:
#>  [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
#>  [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8    
#>  [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
#>  [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
#>  [9] LC_ADDRESS=C               LC_TELEPHONE=C            
#> [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       
#> 
#> time zone: America/New_York
#> tzcode source: system (glibc)
#> 
#> attached base packages:
#> [1] stats     graphics  grDevices utils     datasets  methods   base     
#> 
#> other attached packages:
#> [1] DNAcopy_1.78.0   cfdnakit_1.2.0   BiocStyle_2.32.0
#> 
#> loaded via a namespace (and not attached):
#>  [1] gtable_0.3.5            xfun_0.43               bslib_0.7.0            
#>  [4] ggplot2_3.5.1           Biobase_2.64.0          vctrs_0.6.5            
#>  [7] tools_4.4.0             bitops_1.0-7            generics_0.1.3         
#> [10] stats4_4.4.0            parallel_4.4.0          tibble_3.2.1           
#> [13] fansi_1.0.6             highr_0.10              pkgconfig_2.0.3        
#> [16] R.oo_1.26.0             PSCBS_0.67.0            S4Vectors_0.42.0       
#> [19] lifecycle_1.0.4         GenomeInfoDbData_1.2.12 R.cache_0.16.0         
#> [22] farver_2.1.1            compiler_4.4.0          Rsamtools_2.20.0       
#> [25] Biostrings_2.72.0       tinytex_0.50            munsell_0.5.1          
#> [28] codetools_0.2-20        GenomeInfoDb_1.40.0     htmltools_0.5.8.1      
#> [31] sass_0.4.9              yaml_2.3.8              pillar_1.9.0           
#> [34] crayon_1.5.2            jquerylib_0.1.4         R.utils_2.12.3         
#> [37] BiocParallel_1.38.0     cachem_1.0.8            magick_2.8.3           
#> [40] parallelly_1.37.1       tidyselect_1.2.1        digest_0.6.35          
#> [43] future_1.33.2           dplyr_1.1.4             bookdown_0.39          
#> [46] listenv_0.9.1           labeling_0.4.3          fastmap_1.1.1          
#> [49] grid_4.4.0              colorspace_2.1-0        cli_3.6.2              
#> [52] magrittr_2.0.3          utf8_1.2.4              withr_3.0.0            
#> [55] scales_1.3.0            UCSC.utils_1.0.0        aroma.light_3.34.0     
#> [58] rmarkdown_2.26          XVector_0.44.0          httr_1.4.7             
#> [61] matrixStats_1.3.0       globals_0.16.3          R.methodsS3_1.8.2      
#> [64] evaluate_0.23           knitr_1.46              GenomicRanges_1.56.0   
#> [67] IRanges_2.38.0          rlang_1.1.3             Rcpp_1.0.12            
#> [70] glue_1.7.0              BiocManager_1.30.22     BiocGenerics_0.50.0    
#> [73] jsonlite_1.8.8          R6_2.5.1                zlibbioc_1.50.0