J. Taroni 2018
Pathway Level Information ExtractoR (PLIER) (Mao, et al. bioRxiv. 2017.) is a framework that explicitly aligns latent variables (LVs) with prior knowledge in the form of (often curated) gene sets. Comparisons of PLIER to other methods (e.g., sparse PCA) and other evaluations can be found in the PLIER preprint.
We’re going to explore the recount2 dataset and the corresponding PLIER model. (See greenelab/rheum-plier-data for the processing code.) We’re interested in coming up with ways to characterize PLIER models (and eventually compare them).
`%>%` <- dplyr::`%>%`
# custom functions
source(file.path("util", "plier_util.R"))
# plot and result directory setup for this notebook
plot.dir <- file.path("plots", "02")
dir.create(plot.dir, recursive = TRUE, showWarnings = FALSE)
results.dir <- file.path("results", "02")
dir.create(results.dir, recursive = TRUE, showWarnings = FALSE)
# PLIER model
plier.results <- readRDS(file.path("data", "recount2_PLIER_data",
"recount_PLIER_model.RDS"))
# data that was prepped for use with PLIER
recount.list <- readRDS(file.path("data", "recount2_PLIER_data",
"recount_data_prep_PLIER.RDS"))
If the prior information coefficient matrix, U, has a low number of positive entries for each LV, biological interpretation should be more straightforward. This is one of the constraints in the PLIER model.
For each latent variable (i.e., not just those significantly associated with prior information), how many of the pathways/genesets have a positive entry?
num.lvs <- nrow(plier.results$B)
u.sparsity.all <- CalculateUSparsity(plier.results = plier.results,
significant.only = FALSE)
ggplot2::ggplot(as.data.frame(u.sparsity.all),
ggplot2::aes(x = u.sparsity.all)) +
ggplot2::geom_density(fill = "blue", alpha = 0.5) +
ggplot2::theme_bw() +
ggplot2::labs(x = "proportion of positive entries in U") +
ggplot2::ggtitle(paste("All LVs, n =", num.lvs))
png.file <- file.path(plot.dir, "recount2_prop_pos_entries_U_all_lvs.png")
ggplot2::ggsave(filename = png.file, plot = ggplot2::last_plot(),
width = 7, height = 5, units = "in")
summary(u.sparsity.all)
Min. 1st Qu. Median Mean 3rd Qu. Max.
0.000000 0.000000 0.001592 0.003480 0.004777 0.015924
What proportion of entries in the U matrix for each LV are significantly associated with that LV?
u.sparsity.sig <- CalculateUSparsity(plier.results,
significant.only = TRUE,
fdr.cutoff = 0.05)
ggplot2::ggplot(as.data.frame(u.sparsity.sig),
ggplot2::aes(x = u.sparsity.sig)) +
ggplot2::geom_density(fill = "blue", alpha = 0.5) +
ggplot2::theme_bw() +
ggplot2::labs(x = "proportion of positive entries in U") +
ggplot2::ggtitle("Significant pathways only")
png.file <- file.path(plot.dir,
"recount2_prop_pos_entries_U_significant.png")
ggplot2::ggsave(filename = png.file, plot = ggplot2::last_plot(),
width = 7, height = 5, units = "in")
summary(u.sparsity.sig)
Min. 1st Qu. Median Mean 3rd Qu. Max.
0.000000 0.000000 0.000000 0.000747 0.000000 0.012739
We’re interested in how the LVs output from PLIER are related to the genesets input to PLIER.
coverage.results <- GetPathwayCoverage(plier.results = plier.results)
What proportion of the pathways input into PLIER are significantly associated (FDR cutoff = 0.05) with LVs?
# Pathway coverage results
coverage.results$pathway
[1] 0.4187898
What proportion of the PLIER LVs have a gene set associated with them?
# LVs
coverage.results$lv
[1] 0.2016211
We reconstruct gene expression data from the gene loadings and LVs.
# reconstructed recount2 expression data from PLIER model
recount.recon <- GetReconstructedExprs(z.matrix = as.matrix(plier.results$Z),
b.matrix = as.matrix(plier.results$B))
# write reconstructed expression to results
recon.mat.file <- file.path(results.dir,
"recount2_recount2_model_recon_exprs.RDS")
saveRDS(recount.recon, file = recon.mat.file)
# input expression data from intermediate file
recount.input.exprs <- recount.list$rpkm.cm
# calculate reconstruction error (per sample)
recon.error <- GetReconstructionMASE(true.mat = recount.input.exprs,
recon.mat = recount.recon)
# density plot
ggplot2::ggplot(as.data.frame(recon.error), ggplot2::aes(x = recon.error)) +
ggplot2::geom_density(fill = "blue", alpha = 0.4) +
ggplot2::theme_bw() +
ggplot2::labs(x = "Sample MASE",
title = "Input vs. PLIER reconstructed recount2 data",
subtitle = paste("All LVs, n =", num.lvs)) +
ggplot2::theme(plot.title = ggplot2::element_text(hjust = 0.5, face = "bold"))
png.file <- file.path(plot.dir,
"recount2_recon_MASE_all_lvs.png")
ggplot2::ggsave(filename = png.file, plot = ggplot2::last_plot(),
width = 7, height = 5, units = "in")
Spearman correlation between input and reconstructed values was used as an evaluation in Cleary, et al. As noted in the 01-PLIER_util_proof-of-concept_notebook
:
If correlation between the input and the reconstructed data is high, that suggests that reconstruction is “successful.” Given the different constraints in PLIER, we would not expect to perfectly (
rho = 1
) reconstruct the input data. This particular evaluation will be most useful when we look at applying a trained PLIER model to a test dataset.
# calculate correlation
recon.cor <- GetReconstructionCorrelation(true.mat = recount.input.exprs,
recon.mat = recount.recon)
# density plot
ggplot2::ggplot(as.data.frame(recon.cor), ggplot2::aes(x = recon.cor)) +
ggplot2::geom_density(fill = "blue", alpha = 0.4) +
ggplot2::theme_bw() +
ggplot2::labs(x = "Sample Spearman Correlation",
title = "Input vs. PLIER reconstructed recount2 data",
subtitle = paste("All LVs, n =", num.lvs)) +
ggplot2::theme(plot.title = ggplot2::element_text(hjust = 0.5, face = "bold"))
png.file <- file.path(plot.dir,
"recount2_recon_spearman_all_lvs.png")
ggplot2::ggsave(filename = png.file, plot = ggplot2::last_plot(),
width = 7, height = 5, units = "in")
We expect that samples that are highly correlated pre- and post-PLIER should have low MASE.
ggplot2::ggplot(as.data.frame(cbind(recon.cor, recon.error)),
ggplot2::aes(x = recon.cor,
y = recon.error)) +
ggplot2::geom_point(alpha = 0.2) +
ggplot2::theme_bw() +
ggplot2::labs(x = "Spearman Correlation",
y = "MASE",
title = paste("All LVs, n =", num.lvs))
png.file <- file.path(plot.dir,
"recount2_error_cor_scatter_all_lvs.png")
ggplot2::ggsave(filename = png.file, plot = ggplot2::last_plot(),
width = 7, height = 5, units = "in")
Here, we’ll filter the Z and B matrices to only include LVs that are significantly associated with a pathway/gene set that was supplied during the training of the PLIER model. We’ll use an FDR cutoff of 0.05 (as we did for CalculateUSparsity
above).
plier.summary <- plier.results$summary
sig.summary <- plier.summary %>%
dplyr::filter(FDR < 0.05)
sig.lvs <- unique(sig.summary$`LV index`)
# drop columns (LVs) from Z that are not significantly associated with prior
# info
z.mat <- plier.results$Z
sig.z.mat <- z.mat[, as.integer(sig.lvs)]
# drop rows (LVs) from B that are not significantly associated with prior info
b.mat <- plier.results$B
sig.b.mat <- b.mat[as.integer(sig.lvs), ]
# the reconstruction itself only with significant LVs
sig.recon <- GetReconstructedExprs(z.matrix = sig.z.mat,
b.matrix = sig.b.mat)
# write to results
sig.recon.mat.file <-
file.path(results.dir, "recount2_recount2_model_sig_lvs_recon_exprs.RDS")
saveRDS(sig.recon, file = sig.recon.mat.file)
# calculate reconstruction error (per sample)
sig.recon.error <- GetReconstructionMASE(true.mat = recount.input.exprs,
recon.mat = sig.recon)
# calculate correlation
sig.recon.cor <- GetReconstructionCorrelation(true.mat = recount.input.exprs,
recon.mat = sig.recon)
# tidy format
recon.eval.df <-
rbind(cbind(colnames(recount.input.exprs), recon.error, recon.cor,
rep(paste("All, n =", num.lvs), length(recon.error))),
cbind(colnames(recount.input.exprs), sig.recon.error, sig.recon.cor,
rep(paste("Pathway-associated, n =", length(sig.lvs)),
length(sig.recon.error))))
colnames(recon.eval.df) <- c("Sample", "MASE",
"Spearman correlation",
"LVs used in reconstruction")
recon.eval.df <-
as.data.frame(recon.eval.df) %>%
dplyr::mutate(MASE = as.numeric(as.character(MASE)),
`Spearman correlation` =
as.numeric(as.character(`Spearman correlation`)))
recon.eval.file <- file.path(results.dir,
"recount2_recount2_model_recon_eval_df.tsv")
readr::write_tsv(recon.eval.df,
path = recon.eval.file )
MASE plot
# density plot
ggplot2::ggplot(recon.eval.df,
ggplot2::aes(x = MASE, group = `LVs used in reconstruction`,
fill = `LVs used in reconstruction`)) +
ggplot2::geom_density(alpha = 0.4) +
ggplot2::theme_bw() +
ggplot2::scale_fill_manual(values = c("white", "black")) +
ggplot2::labs(title = "Input vs. PLIER reconstructed recount2 data") +
ggplot2::theme(plot.title = ggplot2::element_text(hjust = 0.5, face = "bold"))
png.file <- file.path(plot.dir,
"recount2_recon_MASE.png")
ggplot2::ggsave(filename = png.file, plot = ggplot2::last_plot(),
width = 7, height = 5, units = "in")
Correlation plot
# density plot
ggplot2::ggplot(recon.eval.df,
ggplot2::aes(x = `Spearman correlation`,
group = `LVs used in reconstruction`,
fill = `LVs used in reconstruction`)) +
ggplot2::geom_density(alpha = 0.4) +
ggplot2::theme_bw() +
ggplot2::scale_fill_manual(values = c("white", "black")) +
ggplot2::labs(title = "Input vs. PLIER reconstructed recount2 data") +
ggplot2::theme(plot.title = ggplot2::element_text(hjust = 0.5, face = "bold"))
png.file <- file.path(plot.dir,
"recount2_recon_spearman.png")
ggplot2::ggsave(filename = png.file, plot = ggplot2::last_plot(),
width = 7, height = 5, units = "in")
Scatterplot
ggplot2::ggplot(recon.eval.df,
ggplot2::aes(x = `Spearman correlation`,
y = MASE,
color = `LVs used in reconstruction`,
group = `LVs used in reconstruction`)) +
ggplot2::geom_point(alpha = 0.1) +
ggplot2::theme_bw() +
ggplot2::labs(x = "Sample Spearman Correlation",
y = "Sample MASE",
title = "Input vs. PLIER reconstructed recount2 data") +
ggplot2::scale_color_grey() +
ggplot2::theme(plot.title = ggplot2::element_text(hjust = 0.5, face = "bold"))
png.file <- file.path(plot.dir,
"recount2_recon_scatter.png")
ggplot2::ggsave(filename = png.file, plot = ggplot2::last_plot(),
width = 10, height = 5, units = "in")