1 Introduction

The MethPed classifier of pediatric brain tumors presented in the MethPed article [1] can be accessed through the MethPed package. This package includes the probes that were used for building the classification algorithm, the MethPed classifier and a published data set that we have used as an example.

2 Necessary packages and installation guide

The MethPed package can be installed through Bioconductor

source("http://bioconductor.org/biocLite.R")
biocLite("MethPed")

The MethPed classifier depends on the randomForest package for random forest algorithm. If necessary this can be installed from CRAN.

install.packages("randomForest")

If there are missing values present in the data, the impute package can be installed for missing data imputation. This is however optional and the user can use any algorithm for missing value imputation.

source("http://bioconductor.org/biocLite.R")
biocLite("impute")

3 Input data type and format for the MethPed classifier

3.1 Input data type

Data for running MethPed is generated by the Illumina Infinium HumanMethylation 450 BeadChip arrays. MethPed assumes that the input data are beta values (\(\beta\)). Beta values (\(\beta\)) are the estimate of methylation level using the ratio of intensities between methylated and unmethylated alleles.

\[\beta~=~\frac{Methylated~allele~intensity~(M)}{(Unmethylated~allele~intensity~(U)~+~Methylated~allele~intensity~(M)~+~offset)}\]

The “offset” is chosen to avoid dividing with small values. Illumina uses a default of 100. \(\beta\) are between 0 and 1 with 0 being unmethylated and 1 fully methylated.

3.2 Input data format

MethPed assumes that the input data are in 3 specific data classes 1) ExpressionSet class, 2) matrix class and 3) data.frame class.

3.2.1 ExpressionSet class

The ExpressionSet class is designed to combine several different sources of information into a single convenient structure. This type of class can be manipulated (e.g., subsetted, copied) conveniently, and is the input or output from many Bioconductor functions. Input data in ExpressionSet class is as

## Data Structure

## ExpressionSet (storageMode: lockedEnvironment)
## assayData: 1 features, 2 samples 
##   element names: exprs 
## protocolData: none
## phenoData
##   sampleNames: Tumor.A Tumor.B
##   varLabels: Patient_ID
##   varMetadata: labelDescription
## featureData: none
## experimentData: use 'experimentData(object)'
## Annotation: IlluminaHumanMethylation450k

## Data class

## [1] "ExpressionSet"
## attr(,"package")
## [1] "Biobase"

3.2.2 matrix class

In the matrix class formatted data, all the values of the matrix are beta values (\(\beta\)). In the matrix, the probes are found in the rows and the samples in the columns. Input data in “matrix” class is as

## Data Structure

##               Tumor.A    Tumor.B
## cg00000957 0.63605834 0.84604594
## cg00001349 0.79438393 0.59826915
## cg00001583 0.02574772 0.06137504
## cg00002028 0.03303346 0.03328133
## cg00002719 0.01361386 0.01655172
## cg00002837 0.47871573 0.54919749

## Data class

## [1] "matrix"

3.2.3 data.frame class

In the data.frame class formatted data, one column of the dataset are the probes and there is no order specificity of this column. This column can be anywhere in the dataset. The rest of the columns will be the beta values (\(\beta\)) and each column represents an individual sample. The header of the data frame is the name of the corresponding column (Probe column name i.e. “TargetID” and sample column name i.e. “Tumor.A”, “Tumor.B”). Input data in data.frame class is as

## Data Structure

##      Tumor.A    Tumor.B   TargetID
## 1 0.63605834 0.84604594 cg00000957
## 2 0.79438393 0.59826915 cg00001349
## 3 0.02574772 0.06137504 cg00001583
## 4 0.03303346 0.03328133 cg00002028
## 5 0.01361386 0.01655172 cg00002719
## 6 0.47871573 0.54919749 cg00002837

## Data class

## [1] "data.frame"

4 Workflow of MethPed classifier

The MethPed classifier uses the Random Forest (RF) algorithm to classify unknown pediatric brain tumor samples into sub-types. The classification proceeds with the selection of the beta values (\(\beta\)) needed for the classification.

The computational process proceeded in two stages. The first stage commences with a reduction of the probe pool or building the training probe dataset for classification. We have named this dataset as “predictors”. The second stage is to apply the RF algorithm to classify the probe data of interest based on the training probe dataset (predictors).

For the construction of the training probe pool (predictors), methylation data generated by the Illumina Infinium HumanMethylation 450 BeadChip arrays were downloaded from the Gene Expression Omnibus (GEO). Four hundred seventy-two cases were available, representing several brain tumor diagnoses (DIPG, glioblastoma, ETMR, medulloblastoma, ependymoma, pilocytic astrocytoma) and their further subgroups. Corresponding GEO accession number are given in the following table

***Datasets used to build the MethPed classifier***
Diagnosis	GEO accession
DIPG	GSE50022
Glioblastoma	GSE55712, GSE36278
ETMR	GSE52556
Medulloblastoma	GSE54880
Ependymoma	GSE45353
Pilocytic astrocytoma	GSE44684

The data sets were merged and probes that did not appear in all data sets were filtered away. In addition, about 190,000 CpGs were removed due to SNPs, repeats and multiple mapping sites. The final data set contained 206,823 unique probes and nine tumor classes including the medulloblastoma subgroups. K–neighbor imputation was used for missing probe data.

After that, a large number of regression analyses were performed to select the 100 probes per tumor class that had the highest predictive power (AUC values). Based on the identified 900 methylation sites, the nine pediatric brain tumor types could be accurately classified using the multiclass classification algorithm MethPed [1]. The complete list of 900 probes can be observed as

# List of 900 probes in predictors
library(MethPed)
data(MethPed_900probes)
head(MethPed_900probes)

##     TargetID
## 1 cg00001930
## 2 cg00013409
## 3 cg00015639
## 4 cg00036119
## 5 cg00061695
## 6 cg00085790

5 A working example of MethPed

5.1 Dataset without missing probe values

An example dataset without missing probe beta values (\(\beta\)), is provided with the MethPed package in ExpressionSet class. This dataset consists of 2 patients and 468,821 probes. At first we can take a look at the data:

# Loading package
library(MethPed)
# Loading data set
data(MethPed_sample)
head(MethPed_sample)

## ExpressionSet (storageMode: lockedEnvironment)
## assayData: 1 features, 2 samples 
##   element names: exprs 
## protocolData: none
## phenoData
##   sampleNames: Tumor.A Tumor.B
##   varLabels: Patient_ID
##   varMetadata: labelDescription
## featureData: none
## experimentData: use 'experimentData(object)'
## Annotation: IlluminaHumanMethylation450k

class(MethPed_sample)

## [1] "ExpressionSet"
## attr(,"package")
## [1] "Biobase"

5.1.1 Check missing value in data

Additionally, it is good to check if there are missing probe values. This will identify if any values are missing from each column/sample in the probe data set. For this operation we can use the checkNA command form the MethPed package. Running this function might require some time for large data sets.

## Check for missing value 
missingIndex <- checkNA(MethPed_sample)
missingIndex

##   Tumor.A   Tumor.B probeName 
##         0         0         0

In this example, there are no missing values (probe names or beta values) in the data set. Now we are set to run the MethPed algorithm.

5.1.2 Apply the MethPed classifier

The function MethPed has the argument prob which can take two values TRUE or FALSE. The value TRUE will return the conditional probability of a sample belonging to a specific group. FALSE will return the binary classification (0 or 1) of a sample belonging to a specific group (based on maximum conditional probability). By default MethPed will consider TRUE for prob argument.

During running the MethPed algorithm, additional information on data and progress of data analysis are seen. The analysis will take longer time depending on the amount of data.

# Run the MethPed classifier
myClassification <- MethPed(MethPed_sample)

## 
## Probe's column name in data                 : TargetID

## Probe name integrity of data with predictor : OK

## Missing value in data                       : No missing data point

## Data summary                                : 468821 Probes and 2 Sample

## 
## Initiating  data analysis......

## Classification is being processed on 1000 tree

## ntree      OOB      1      2      3      4      5      6      7      8      9
##   100:   2.84% 10.71%  4.17%  0.00%  2.25% 10.81%  1.80%  0.00%  0.00%  1.72%
##   200:   2.66% 10.71%  4.17%  0.00%  2.25%  8.11%  1.80%  0.00%  0.00%  1.72%
##   300:   2.49% 10.71%  4.17%  0.00%  1.69%  8.11%  1.80%  0.00%  0.00%  1.72%
##   400:   2.31% 10.71%  4.17%  0.00%  2.25%  8.11%  0.00%  0.00%  0.00%  1.72%
##   500:   2.31% 10.71%  4.17%  0.00%  2.25%  8.11%  0.00%  0.00%  0.00%  1.72%
##   600:   2.13% 10.71%  2.08%  0.00%  2.25%  8.11%  0.00%  0.00%  0.00%  1.72%
##   700:   2.31% 10.71%  2.08%  0.00%  2.25%  8.11%  0.90%  0.00%  0.00%  1.72%
##   800:   1.95% 10.71%  2.08%  0.00%  1.69%  8.11%  0.00%  0.00%  0.00%  1.72%
##   900:   2.13% 10.71%  2.08%  0.00%  1.69%  8.11%  0.90%  0.00%  0.00%  1.72%
##  1000:   1.95% 10.71%  2.08%  0.00%  1.69%  8.11%  0.00%  0.00%  0.00%  1.72%

## 
## Missing probes: 1 out of 900 probes are missing

## Finished analysis data

First message will show the name of the probe column in the data (Default: “TargetID”).
Second message will show “OK” if at least one probe is common between input dataset and predictors (900 probes are used in the predictor [1]. See Workflow of MethPed classifier section for details).
Third message will report about missing value in the data.
Fourth message will list how many probes and samples are in the input data.
Fifth will show the progress of RF analysis.
Sixth will show how many probes are missing in the input data compared with the 900 probes in predictors.

The output of the algorithm is partitioned in 6 parts. The full output can be seen by typing myClassification (Variable name defined by user)

myClassification

## $target_id
## [1] "TargetID"
## 
## $probes
## [1] 468821
## 
## $sample
## [1] 2
## 
## $probes_missing
## [1] "cg14310162"
## 
## $oob_err.rate
##      OOB 
## 1.953819 
## 
## $predictions
##          DIPG Ependymoma  ETMR   GBM MB_Gr3 MB_Gr4 MB_SHH MB_WNT PiloAstro
## Tumor.A 0.002      0.067 0.008 0.036  0.796  0.057  0.012  0.017     0.005
## Tumor.B 0.008      0.038 0.000 0.181  0.001  0.000  0.001  0.000     0.771
## 
## attr(,"class")
## [1] "methped"

The first part contains the name of the probe column, the second part contains the number of probes and the third part the total number of samples in the main data. Each information can be extracted as

# First part
myClassification$target_id

## [1] "TargetID"

# Second part
myClassification$probes

## [1] 468821

# Third part
myClassification$sample

## [1] 2

The fourth part contains the names of the probes that were included in the original classifier but are missing from the data at hand (input). This can be accessed through the command:

# Fourth part
myClassification$probes_missing

## [1] "cg14310162"

In the example, one probe that was included in the original MethPed classifier, is missing from the present data set. If a large number of probes are missing, this could potentially lead to a drop in the predictive accuracy of the model. However, as can be seen by the fifth part of the result:

# Fifth part
myClassification$oob_err.rate

##      OOB 
## 1.953819

the out-of-bag error is 1.95%; only slightly higher than the 1.7% in the original MethPed classifier. For details see the MethPed article [1].

The last output contains the predictions. We chose not to assign tumors to one or another group but to give probabilities of belonging to each group.

# Sixth part
myClassification$predictions

##          DIPG Ependymoma  ETMR   GBM MB_Gr3 MB_Gr4 MB_SHH MB_WNT PiloAstro
## Tumor.A 0.002      0.067 0.008 0.036  0.796  0.057  0.012  0.017     0.005
## Tumor.B 0.008      0.038 0.000 0.181  0.001  0.000  0.001  0.000     0.771

In the example Tumor.A preserves the highest prediction probability, which is 0.796 for subtype MB_Gr3. This should be interpreted as a conditional probability. The probability that Tumor.A belongs to the subtype MB_Gr3 for the given data is 0.796. Similarly, Tumor.B preserves the highest prediction probability, which is 0.771 for the subtype PiloAstro.

5.1.3 Output summary

Alternatively, the summary function can be used to get the result from the MethPed output.

summary(myClassification)

##          DIPG Ependymoma  ETMR   GBM MB_Gr3 MB_Gr4 MB_SHH MB_WNT PiloAstro
## Tumor.A 0.002      0.067 0.008 0.036  0.796  0.057  0.012  0.017     0.005
## Tumor.B 0.008      0.038 0.000 0.181  0.001  0.000  0.001  0.000     0.771

If only the maximum conditional probability of samples is preferred, this can be obtained by supplying the FALSE value to the argument prob.

myClassification_max <- MethPed(MethPed_sample,prob=FALSE)
summary(myClassification_max)

##         DIPG Ependymoma ETMR GBM MB_Gr3 MB_Gr4 MB_SHH MB_WNT PiloAstro
## Tumor.A    0          0    0   0  0.796      0      0      0     0.000
## Tumor.B    0          0    0   0  0.000      0      0      0     0.771

5.1.4 Output visualization

The output of the classification can also be visualized in a bar plot

par(mai = c(1, 1, 1, 2), xpd=TRUE)
mat<-t(myClassification$predictions)
mycols <- c("green",rainbow(nrow(mat),start=0,end=1)[nrow(mat):1],"red")
barplot(mat,  col = mycols,
                beside=FALSE,axisnames=TRUE,
                ylim=c(0,1),xlab= "Sample",ylab="Probability")
legend( ncol(mat)+0.5,1,
          legend = rownames(mat),fill = mycols,xpd=TRUE, cex = 0.6)

or the plot command to used to get the same bar plot

myClassification <- MethPed(MethPed_sample,prob=TRUE)
plot(myClassification)

If only the maximum conditional probability of samples is preferred, this can be obtained by supplying the FALSE value to the argument prob.

myClassification_max <- MethPed(MethPed_sample,prob=FALSE)
plot(myClassification_max)

5.1.5 Count missing probes

The name of the probes that were included in the original MethPed classifier (900 probes) but are missing from the data at hand, can be accessed by the command probeMis.

probeMis(myClassification)

## [1] "cg14310162"

5.2 Dataset with missing probe values

In the previous section MethPed classification on a dataset without missing values were shown. In this section, MethPed classification on dataset with missing values are shown.

5.2.1 Missing beta value in the data set

The dataset “MethPed_sample” provided with the package is a non-missing valued dataset. As an example we will now generate a data set with missing probe values.

# Loading dataset
data(MethPed_sample)
class(MethPed_sample)

## [1] "ExpressionSet"
## attr(,"package")
## [1] "Biobase"

As the provided dataset is in the ExpressionSet class, we have to extract the matrix of beta values for further process

# Loading library
library(Biobase)
# exprs function is loaded from the package Biobase
MethPed_sample_matrix<-Biobase::exprs(MethPed_sample)
head(MethPed_sample_matrix)

##               Tumor.A    Tumor.B
## cg00000957 0.63605834 0.84604594
## cg00001349 0.79438393 0.59826915
## cg00001583 0.02574772 0.06137504
## cg00002028 0.03303346 0.03328133
## cg00002719 0.01361386 0.01655172
## cg00002837 0.47871573 0.54919749

class(MethPed_sample_matrix)

## [1] "matrix"

We can check for missing value in the extracted matrix of beta values

checkNA(MethPed_sample_matrix)

##   Tumor.A   Tumor.B probeName 
##         0         0         0

There are no missing values in the extracted matrix of beta values. Now we will remove some beta values (Replace with NA) from this matrix to make a missing valued dataset.

MethPed_sample_missing<-MethPed_sample_matrix
MethPed_sample_missing[c(1,10,200),2]<-NA
MethPed_sample_missing[c(4,600,500,1000),1]<-NA
head(MethPed_sample_missing,10)

##               Tumor.A    Tumor.B
## cg00000957 0.63605834         NA
## cg00001349 0.79438393 0.59826915
## cg00001583 0.02574772 0.06137504
## cg00002028         NA 0.03328133
## cg00002719 0.01361386 0.01655172
## cg00002837 0.47871573 0.54919749
## cg00003202 0.01878770 0.01040884
## cg00003287 0.39978051 0.12231161
## cg00004121 0.54321178 0.51326286
## cg00007036 0.94417774         NA

This dataset is now ready for MethPed classification.

5.2.2 Missing values imputation

MethPed cannot be applied on data sets with missing value. Before applying the MethPed classifier, it’s important to check if there are missing values of probes in the dataset. To check missing values, one can use the ´checkNA´ function

checkNA(MethPed_sample_missing)

##   Tumor.A   Tumor.B probeName 
##         4         3         0

In this example there are 2 tumors with in total 7 missing beta values in the data. The classifier requires that all probes that were used for training the model are present for prediction. If this condition is not met, no prediction will be returned for the tumors with missing probes. One possible solution is to discard probes that have missing data. This might not be efficient if a large number of probe’s values are missing in one or more samples.

A more feasible approach can be to impute the missing values. Here we illustrate imputation by the means of K-nearest neighbors. For technical details and relevant references we refer readers to the documentation of the impute package. It should be noted that, this package was developed for gene expression data, and not for methylation data. However, most likely the method performs well on methylation data as well [3].

The impute package depends on a random value for initiation. Therefore, to assure randomness, the eventual set of random seeds should be removed.

if(exists(".Random.seed")) rm(.Random.seed)

As mentioned before, the dataset contains 2 tumors. So the imputation will be run on two column of the methylation data.

# Loading library
library(impute)
# Apply imputation
imputedData <- impute.knn(MethPed_sample_missing)
# extract the imputed matrix form object "data"
imputedData <- imputedData$data
# Check for missing value
checkNA(imputedData)

##   Tumor.A   Tumor.B probeName 
##         0         0         0

Now we are set to run the MethPed algorithm on the imputated matrix.

myClassification <- MethPed(imputedData)
summary(myClassification)

##          DIPG Ependymoma  ETMR   GBM MB_Gr3 MB_Gr4 MB_SHH MB_WNT PiloAstro
## Tumor.A 0.002      0.067 0.008 0.036  0.796  0.057  0.012  0.017     0.005
## Tumor.B 0.008      0.038 0.000 0.181  0.001  0.000  0.001  0.000     0.771

6 Contact and Citation

6.1 Contact

For bug-reports, comments and feature requests please write to helenacarenlab+methped@gmail.com.

6.2 Citation

When using MethPed in your work, please cite as:

citation('MethPed')

## 
## To cite package 'MethPed' in publications use:
## 
##   Mohammad Tanvir Ahamed, Anna Danielsson, Szilárd Nemes and
##   Helena Carén (2016). MethPed: A DNA methylation classifier tool
##   for the identification of pediatric brain tumor subtypes. R
##   package version 1.2.0.
## 
## A BibTeX entry for LaTeX users is
## 
##   @Manual{,
##     title = {MethPed: A DNA methylation classifier tool for the identification of
## pediatric brain tumor subtypes},
##     author = {Mohammad Tanvir Ahamed and Anna Danielsson and Szilárd Nemes and Helena Carén},
##     year = {2016},
##     note = {R package version 1.2.0},
##   }

7 Authors information

Mohammad Tanvir Ahamed, Anna Danielsson, Szilárd Nemes and Helena Carén, University of Gothenburg, Sweden.

8 Session info

sessionInfo()

## R version 3.3.1 (2016-06-21)
## Platform: x86_64-pc-linux-gnu (64-bit)
## Running under: Ubuntu 16.04.1 LTS
## 
## locale:
##  [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
##  [3] LC_TIME=en_US.UTF-8        LC_COLLATE=C              
##  [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
##  [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
##  [9] LC_ADDRESS=C               LC_TELEPHONE=C            
## [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       
## 
## attached base packages:
## [1] parallel  stats     graphics  grDevices utils     datasets  methods  
## [8] base     
## 
## other attached packages:
## [1] impute_1.48.0       MethPed_1.2.0       Biobase_2.34.0     
## [4] BiocGenerics_0.20.0 BiocStyle_2.2.0    
## 
## loaded via a namespace (and not attached):
##  [1] Rcpp_0.12.7         randomForest_4.6-12 digest_0.6.10      
##  [4] assertthat_0.1      formatR_1.4         magrittr_1.5       
##  [7] evaluate_0.10       stringi_1.1.2       rmarkdown_1.1      
## [10] tools_3.3.1         stringr_1.1.0       yaml_2.1.13        
## [13] htmltools_0.3.5     knitr_1.14          tibble_1.2

9 References

[1] Anna Danielsson, Szilárd Nemes, Magnus Tisell, Birgitta Lannering, Claes Nordborg, Magnus Sabel, and Helena Carén. “MethPed: A DNA Methylation Classifier Tool for the Identification of Pediatric Brain Tumor Subtypes”. Clinical Epigenetics 2015, 7:62, 2015

[2] Breiman, Leo (2001). “Random Forests”. Machine Learning 45 (1): 5–32. doi:10.1023/A:1010933404324

[3] Troyanskaya, O., M. Cantor, G. Sherlock, P. Brown, T. Hastie, R. Tibshirani, D. Botstein, and R. B. Altman. “Missing Value Estimation Methods for DNA Microarrays.” Bioinformatics 17.6 (2001): 520-25.

MethPed A DNA Methylation Classifier Tool for the Identification of Pediatric Brain Tumor Subtypes

Mohammad Tanvir Ahamed, Anna Danielsson, Szilárd Nemes and Helena Carén University of Gothenburg, Sweden

Date: 2016-10-17 Version: 1.0.0

Contents