library(ZIBR)
#> Loading required package: statmod
library(dplyr)
#>
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#>
#> filter, lag
#> The following objects are masked from 'package:base':
#>
#> intersect, setdiff, setequal, union
library(nlme)
#>
#> Attaching package: 'nlme'
#> The following object is masked from 'package:dplyr':
#>
#> collapse
set.seed(19683)
The longitudinal microbiome compositional data are highly skewed, bounded in [0,1), and often sparse with many zeros. In addition, the observations from repeated measures are correlated. We propose a two-part zero-inflated Beta regression model with random effects (ZIBR) for testing the association between microbial abundance and clinical covariates for longitudinal microbiome data. The model includes a logistic component to model presence/absence of the microbe in samples and a Beta component to model non-zero microbial abundance and each component includes a random effect to take into account the correlation among repeated measurements on the same subject.
The ZIBR model combines the logistic regression and Beta regression in one model. Each regression part includes random effects to account for correlations acorss time points. We call these two regressions in ZIBR model as logistic component and Beta component. These two components model two different aspects of the data. The logistic component models presence/absence of the microbe and Beta component models non-zero microbial abundance.
Accordingly, we can test three biologically relevant null
hypotheses:
- H0: α_j = 0. This is to test the coefficients in the logistic
component, if the covariates are associated with the bacterial taxon by
affecting its presence or absence;
- H0: β_j = 0. This is to test the coefficients in the Beta component,
if the taxon is associated with the covariates by showing different
abundances;
- H0: α_j = 0 and β_j = 0 for each covariate X_j and Z_j. This is to
joinly test the coefficients in both logistic and Beta components, if
the covariates affect the taxon both in terms of presence/absence and
its non-zero abundance.
The following function will simulate some data according to the zero-inflated beta random effect model. We specify the covariates in the logistic component (X) and covariates in the Beta component (Z) to be the same (i.e set Z=X).
sim <- simulate_zero_inflated_beta_random_effect_data(
subject_n = 100, time_n = 5,
X = as.matrix(c(rep(0, 50 * 5), rep(1, 50 * 5))),
Z = as.matrix(c(rep(0, 50 * 5), rep(1, 50 * 5))),
alpha = as.matrix(c(-0.5, 1)),
beta = as.matrix(c(-0.5, 0.5)),
s1 = 1, s2 = 0.8,
v = 5,
sim_seed = 100
)
names(sim)
#> [1] "Y" "X" "Z" "b" "c"
#> [6] "u" "v" "alpha" "beta" "s1"
#> [11] "s2" "subject_ind" "time_ind"
The simulation function returns the the bacterial abundance (Y) simulated according to the above parameter settings. This function also returns the other variables such as X, Z, alpha, beta etc. that we just specified. It also returns two variables subject.ind and time.ind, which are subject IDs and time points for each subject.
We can run the zibr function to fit the zero-inflated beta random effect model on the simulated data.
zibr_fit <- zibr(
logistic_cov = sim$X, beta_cov = sim$Z, Y = sim$Y,
subject_ind = sim$subject_ind, time_ind = sim$time_ind
)
zibr_fit
#> $logistic_est_table
#> Estimate Pvalue
#> intersept -0.5719003 0.006957793
#> var1 0.8280726 0.005470065
#>
#> $logistic_s1_est
#> [1] 1.068014
#>
#> $beta_est_table
#> Estimate Pvalue
#> intersept -0.5930906 4.181627e-05
#> var1 0.5917456 1.699453e-03
#>
#> $beta_s2_est
#> [1] 0.6303862
#>
#> $beta_v_est
#> [1] 4.734064
#>
#> $loglikelihood
#> [1] NA
#>
#> $joint_p
#> [1] NA
If we use the same covariates for logistic and beta components, ZIBR
can jointly test the covariate in both components. In the following
example, we use the same covariate X in both components. We can obtain
the joint.p
as the pvalues of the joint test.
zibr_fit <- zibr(
logistic_cov = sim$X, beta_cov = sim$X, Y = sim$Y,
subject_ind = sim$subject_ind, time_ind = sim$time_ind
)
zibr_fit
#> $logistic_est_table
#> Estimate Pvalue
#> intersept -0.5719003 0.006957793
#> var1 0.8280726 0.005470065
#>
#> $logistic_s1_est
#> [1] 1.068014
#>
#> $beta_est_table
#> Estimate Pvalue
#> intersept -0.5930906 4.181627e-05
#> var1 0.5917456 1.699453e-03
#>
#> $beta_s2_est
#> [1] 0.6303862
#>
#> $beta_v_est
#> [1] 4.734064
#>
#> $loglikelihood
#> [1] -286.5855
#>
#> $joint_p
#> var1
#> 0.0001533285
Let’s try anohter example on the real data. I will use a dataset from a longitudinal human microbiome study containing the bacterial abundance and clinical information from Lewis and Chen et al.. I only include the abundance from one genus.
head(ibd, n = 20)
#> Sample Subject Time Treatment Abundance
#> 1 5001-01 5001 1 0 0.000000e+00
#> 2 5001-02 5001 2 0 0.000000e+00
#> 3 5001-03 5001 3 0 0.000000e+00
#> 4 5001-04 5001 4 0 0.000000e+00
#> 5 5002-01 5002 1 0 3.961764e-03
#> 6 5002-02 5002 2 0 1.008613e-02
#> 7 5002-03 5002 3 0 0.000000e+00
#> 8 5002-04 5002 4 0 0.000000e+00
#> 9 5003-01 5003 1 0 2.545083e-03
#> 10 5003-02 5003 2 0 6.901096e-03
#> 11 5003-03 5003 3 0 3.038140e-04
#> 12 5003-04 5003 4 0 1.739832e-02
#> 13 5006-01 5006 1 0 2.052549e-03
#> 14 5006-02 5006 2 0 4.636235e-04
#> 15 5006-03 5006 3 0 3.403491e-04
#> 16 5006-04 5006 4 0 2.840756e-03
#> 17 5007-01 5007 1 0 1.634215e-03
#> 18 5007-02 5007 2 0 1.061013e-03
#> 19 5007-03 5007 3 0 5.522727e-05
#> 20 5007-04 5007 4 0 1.986368e-01
Note that each subject has four time points. In the Treatment column, 0 is for antiTNF treatment and 1 is for EEN treatment. Abundance column is the relative abundance for Eubacterium. The abundance is in the range of [0,1).
The current model can not handle missing data. That is, each subject must have the same number of time points. If any time point is missing in your data, you can (1) remove some other time points so that all subject have the same time points (2) impute the missing data, for example, use the mean or median value from other subjects at the same time point in the same covariate group to replace the missing value. I’m currently working on the missing data problem and hope that our model can handle missing data soon.
We can run the zibr function to the real data. Here, I’m interested in comparing the two treatments and use treatment as the only covariate in both logistic and beta component. Depending on the scientific questions you are interested in, you can also include time and treament-time interaction in the covariates.
zibr_fit <- zibr(
logistic_cov = data.frame(Treatment = ibd$Treatment),
beta_cov = data.frame(Treatment = ibd$Treatment),
Y = ibd$Abundance, subject_ind = ibd$Subject,
time_ind = ibd$Time
)
zibr_fit
#> $logistic_est_table
#> Estimate Pvalue
#> intersept 2.6973234 6.171054e-06
#> Treatment 0.2000181 8.877107e-01
#>
#> $logistic_s1_est
#> [1] 3.278764
#>
#> $beta_est_table
#> Estimate Pvalue
#> intersept -2.7860540 0.0000000
#> Treatment -0.3233427 0.2063002
#>
#> $beta_s2_est
#> [1] 0.5389234
#>
#> $beta_v_est
#> [1] 7.622384
#>
#> $loglikelihood
#> [1] 321.9018
#>
#> $joint_p
#> Treatment
#> 0.4454947
Neither of the logsitic or beta component show significant pvalues. This indicates the abundance of this genus is not different between the two treatments, considering all time points.
This follows the analysis from the paper.
### Load the raw data
PLEASE.file <- "https://raw.githubusercontent.com/chvlyl/PLEASE/master/1_Data/Raw_Data/MetaPhlAn/PLEASE/G_Remove_unclassfied_Renormalized_Merge_Rel_MetaPhlAn_Result.xls" # nolint
PLEASE.raw <- read.table(PLEASE.file,
sep = "\t", header = TRUE, row.names = 1,
check.names = FALSE, stringsAsFactors = FALSE
)
taxa.raw <- t(PLEASE.raw)
### Make sure you load the data correctly
cat("samples", "taxa", dim(taxa.raw), "\n")
#> samples taxa 335 105
taxa.raw[1:3, 1:3]
#> g__Bacteroides g__Ruminococcus g__Faecalibacterium
#> 5001-01 0.42722310 0.000000000 0
#> 5001-02 0.23188470 0.000000000 0
#> 5001-03 0.02041114 0.006130341 0
### Load total non-human read counts
human.read.file <- "https://raw.githubusercontent.com/chvlyl/PLEASE/master/1_Data/Raw_Data/MetaPhlAn/Human_Reads/please_combo_human_reads.xls" # nolint
human.read <- read.table(human.read.file,
sep = "\t", header = TRUE,
row.names = 1, stringsAsFactors = FALSE
)
### Filter low depth samples (low non human reads)
low.depth.samples <- subset(human.read, NonHumanReads < 10000)
low.depth.samples[, 1:5]
#> NonHumanReads TotalReads HumanReads HumanPer GroupFcp
#> 5010-02 1014 1104 90 8.15217391 TNF.N
#> 5018-03 6101 6121 20 0.32674400 Diet.R
#> 5023-04 1954 2679 725 27.06233669 TNF.R
#> 6007-01 1809 4249 2440 57.42527654 TNF.R
#> 7001-02 9965 9971 6 0.06017451 Diet.N
#> 7001-03 566 613 47 7.66721044 Diet.N
#> 7001-04 626 637 11 1.72684458 Diet.N
#> 7002-04 683 696 13 1.86781609 Diet.N
#> 7003-02 4681 5719 1038 18.15002623 Diet.NR
#> 7003-03 4935 5049 114 2.25787285 Diet.NR
#> 7009-02 1895 1916 21 1.09603340 Diet.N
#> 7009-03 5319 5361 42 0.78343593 Diet.N
#> 7010-02 2375 2400 25 1.04166667 Diet.N
#> 7010-03 6312 6495 183 2.81755196 Diet.N
### Delete these samples from PLEASE data.
rownames(taxa.raw)[which(rownames(taxa.raw) %in% rownames(low.depth.samples))]
#> [1] "5018-03" "7001-02" "7003-02" "7003-03" "7009-03" "7010-03"
### Before deletion
dim(taxa.raw)
#> [1] 335 105
### After deletion
taxa.raw <- taxa.raw[-which(rownames(taxa.raw) %in% rownames(low.depth.samples)), ]
dim(taxa.raw)
#> [1] 329 105
### Filter low abundant bacterial data
filter.index1 <- apply(taxa.raw, 2, function(X) {
sum(X > 0) > 0.4 * length(X)
})
filter.index2 <- apply(taxa.raw, 2, function(X) {
quantile(X, 0.9) > 1
})
taxa.filter <- taxa.raw[, filter.index1 & filter.index2]
taxa.filter <- 100 * sweep(taxa.filter, 1, rowSums(taxa.filter), FUN = "/")
cat("after filter:", "samples", "taxa", dim(taxa.filter), "\n")
#> after filter: samples taxa 329 18
cat(colnames(taxa.filter), "\n")
#> g__Bacteroides g__Ruminococcus g__Faecalibacterium g__Bifidobacterium g__Escherichia g__Clostridium g__Dialister g__Eubacterium g__Roseburia g__Streptococcus g__Dorea g__Parabacteroides g__Lactobacillus g__Veillonella g__Haemophilus g__Alistipes g__Collinsella g__Coprobacillus
head(rowSums(taxa.filter))
#> 5001-01 5001-02 5001-03 5001-04 5002-01 5002-02
#> 100 100 100 100 100 100
###
taxa.data <- taxa.filter
dim(taxa.data)
#> [1] 329 18
#### Load sample information
sample.info.file <- "https://raw.githubusercontent.com/chvlyl/PLEASE/master/1_Data/Processed_Data/Sample_Information/2015_02_13_Processed_Sample_Information.csv" # nolint
sample.info <- read.csv(sample.info.file, row.names = 1)
#### create covariates,
#### Time, antiTNF+EEN
reg.cov <-
data.frame(Sample = rownames(taxa.data), stringsAsFactors = FALSE) %>%
left_join(add_rownames(sample.info, var = "Sample"), by = "Sample") %>%
dplyr::filter(Treatment.Specific != "PEN") %>%
dplyr::select(Sample, Time, Subject, Response, Treatment.Specific) %>%
group_by(Subject) %>%
summarise(count = n()) %>%
dplyr::filter(count == 4) %>%
dplyr::select(Subject) %>%
left_join(add_rownames(sample.info, var = "Sample"), by = "Subject") %>%
mutate(Treat = ifelse(Treatment.Specific == "antiTNF", 1, 0)) %>%
dplyr::select(Sample, Subject, Time, Response, Treat) %>%
dplyr::mutate(Subject = paste("S", Subject, sep = "")) %>%
dplyr::mutate(Time = ifelse(Time == "1", 0,
ifelse(Time == "2", 1,
ifelse(Time == "3", 4,
ifelse(Time == "4", 8, NA))))) %>%
dplyr::mutate(Time.X.Treatment = Time * Treat) %>%
as.data.frame()
#> Warning: `add_rownames()` was deprecated in dplyr 1.0.0.
#> i Please use `tibble::rownames_to_column()` instead.
#> Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
#> generated.
#> Warning: `add_rownames()` was deprecated in dplyr 1.0.0.
#> i Please use `tibble::rownames_to_column()` instead.
#> Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
#> generated.
### take out first time point
reg.cov.t1 <- subset(reg.cov, Time == 0)
rownames(reg.cov.t1) <- reg.cov.t1$Subject
reg.cov.t234 <- subset(reg.cov, Time != 0)
reg.cov.t234 <- data.frame(
baseline.sample = reg.cov.t1[reg.cov.t234$Subject, "Sample"],
baseline.subject = reg.cov.t1[reg.cov.t234$Subject, "Subject"],
reg.cov.t234,
stringsAsFactors = FALSE
)
#### Fit ZIBR model on the real data
spe.all <- colnames(taxa.data)
p.species.list.zibr <- list()
p.species.list.lme <- list()
for (spe in spe.all) {
# spe = 'g__Collinsella'
###### create covariates
X <- data.frame(
Baseline = taxa.data[reg.cov.t234$baseline.sample, spe] / 100,
# reg.cov.t234[,c('log.days','Delivery','Delivery.X.log.days')]
reg.cov.t234[, c("Time", "Treat")]
)
rownames(X) <- reg.cov.t234$Sample
Z <- X
subject.ind <- reg.cov.t234$Subject
time.ind <- reg.cov.t234$Time
####
# cat(spe,'\n')
Y <- taxa.data[reg.cov.t234$Sample, spe] / 100
# cat('Zeros/All',sum(Y==0),'/',length(Y),'\n')
####################
## fit linear random effect model with arcsin square transformation on Y
tdata <- data.frame(Y.tran = asin(sqrt(Y)), X, SID = subject.ind)
lme.fit <- nlme::lme(Y.tran ~ Baseline + Time + Treat, random = ~ 1 | SID, data = tdata)
coef.mat <- summary(lme.fit)$tTable[-1, c(1, 5)]
p.species.list.lme[[spe]] <- coef.mat[, 2]
###################
if (sum(Y > 0) < 10 || sum(Y == 0) < 10 || max(Y) < 0.01) {
p.species.list.zibr[[spe]] <- p.species.list.lme[[spe]]
next
} else {
est <- zibr(
logistic_cov = X, beta_cov = Z, Y = Y,
subject_ind = subject.ind,
time_ind = time.ind,
quad_n = 30, verbose = TRUE
)
p.species.list.zibr[[spe]] <- est$joint.p
}
# break
}
#### adjust p values
p.species.zibr <- t(as.data.frame(p.species.list.zibr))
p.species.zibr.adj <-
add_rownames(as.data.frame(p.species.zibr), var = "Species") %>% mutate_each(funs(p.adjust(., "fdr")), -Species)
#> Warning: `funs()` was deprecated in dplyr 0.8.0.
#> i Please use a list of either functions or lambdas:
#>
#> # Simple named list: list(mean = mean, median = median)
#>
#> # Auto named with `tibble::lst()`: tibble::lst(mean, median)
#>
#> # Using lambdas list(~ mean(., trim = .2), ~ median(., na.rm = TRUE))
#> Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
#> generated.
#> Warning: `mutate_each_()` was deprecated in dplyr 0.7.0.
#> i Please use `across()` instead.
#> i The deprecated feature was likely used in the dplyr package.
#> Please report the issue at <https://github.com/tidyverse/dplyr/issues>.
#> This warning is displayed once every 8 hours.
#> Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
#> generated.
#> Warning: `add_rownames()` was deprecated in dplyr 1.0.0.
#> i Please use `tibble::rownames_to_column()` instead.
#> Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
#> generated.
# write.csv(p.species.zibr.adj,file=paste('4_Results/Real_Data_Estimation_Results_antiTNF_EEN_ZIBR.csv',sep=''))
p.species.lme <- t(as.data.frame(p.species.list.lme))
p.species.lme.adj <-
add_rownames(as.data.frame(p.species.lme), var = "Species") %>% mutate_each(funs(p.adjust(., "fdr")), -Species)
#> Warning: `funs()` was deprecated in dplyr 0.8.0.
#> i Please use a list of either functions or lambdas:
#>
#> # Simple named list: list(mean = mean, median = median)
#>
#> # Auto named with `tibble::lst()`: tibble::lst(mean, median)
#>
#> # Using lambdas list(~ mean(., trim = .2), ~ median(., na.rm = TRUE))
#> Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
#> generated.
#> Warning: `add_rownames()` was deprecated in dplyr 1.0.0.
#> i Please use `tibble::rownames_to_column()` instead.
#> Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
#> generated.
# write.csv(p.species.lme.adj,file=paste('4_Results/Real_Data_Estimation_Results_antiTNF_EEN_LME.csv',sep=''))
####
intersect(
p.species.lme.adj[p.species.lme.adj$Treat < 0.05, "Species"],
p.species.zibr.adj[p.species.zibr.adj$Treat < 0.05, "Species"]
)
#> # A tibble: 0 x 1
#> # i 1 variable: Species <chr>
setdiff(
p.species.lme.adj[p.species.lme.adj$Treat < 0.05, "Species"],
p.species.zibr.adj[p.species.zibr.adj$Treat < 0.05, "Species"]
)
#> # A tibble: 7 x 1
#> Species
#> <chr>
#> 1 g__Ruminococcus
#> 2 g__Faecalibacterium
#> 3 g__Bifidobacterium
#> 4 g__Dialister
#> 5 g__Streptococcus
#> 6 g__Haemophilus
#> 7 g__Alistipes
setdiff(
p.species.zibr.adj[p.species.zibr.adj$Treat < 0.05, "Species"],
p.species.lme.adj[p.species.lme.adj$Treat < 0.05, "Species"]
)
#> # A tibble: 0 x 1
#> # i 1 variable: Species <chr>
intersect(
p.species.lme.adj[p.species.lme.adj$Time < 0.05, "Species"],
p.species.zibr.adj[p.species.zibr.adj$Time < 0.05, "Species"]
)
#> # A tibble: 0 x 1
#> # i 1 variable: Species <chr>
setdiff(
p.species.lme.adj[p.species.lme.adj$Time < 0.05, "Species"],
p.species.zibr.adj[p.species.zibr.adj$Time < 0.05, "Species"]
)
#> # A tibble: 1 x 1
#> Species
#> <chr>
#> 1 g__Eubacterium
setdiff(
p.species.lme.adj[p.species.lme.adj$Time < 0.05, "Species"],
p.species.zibr.adj[p.species.zibr.adj$Time < 0.05, "Species"]
)
#> # A tibble: 1 x 1
#> Species
#> <chr>
#> 1 g__Eubacterium