J. Taroni 2018
In 06-sle-wb_cell_type
, we examined how LVs from the SLE WB PLIER model (results/05/SLE-WB_PLIER_model.RDS
) compared to known patterns in the Banchereau, et al. data set that was included in the SLE WB compendium (and therefore, training).
Specifically, we asked:
The answer to both of these questions is yes. This demonstrated the utility of PLIER for extracting cell type patterns from heterogeneous (at least in terms of patient population and platform) gene expression data from a single tissue (whole blood) that had been assembled into a compendium (greenelab/rheum-plier-data/sle-wb
).
SLE is not a rare condition – though the molecular heterogeneity and multi-tissue nature of the disease warrant sophisticated approaches to the analysis of transcriptomic data. (This SLE WB compendium is comprised of 1640 samples and is incomplete; we picked 7 data sets based on the platform used and the availability of raw data.) Thus, we can use data-intensive approaches (e.g., PLIER) that are inappropriate for smaller data sets and, by extension, rare or understudied diseases.
Using a “shovel-ready” data set like recount2 to train a PLIER model (see greenelab/rheum-plier-data/recount2
) and applying this model to smaller data sets may “get around” the data-intensive requirement, but more investigation is warranted.
In this notebook, we examine whether we can use the recount2 PLIER model (data/recount2_PLIER_data/recount_PLIER_model.RDS
) to analyze neutrophil and plasma cell patterns in the Banchereau, et al. data. If the recount2 model performance is similar to that of the SLE model (examined in 06-sle-wb_cell_type
), that is evidence supporting the validity of this transfer learning approach.
`%>%` <- dplyr::`%>%`
Warning message:
In scan(file = file, what = what, sep = sep, quote = quote, dec = dec, :
EOF within quoted string
library(PLIER)
Loading required package: RColorBrewer
Loading required package: gplots
Attaching package: ‘gplots’
The following object is masked from ‘package:stats’:
lowess
Loading required package: pheatmap
Loading required package: glmnet
Loading required package: Matrix
Loading required package: foreach
Loaded glmnet 2.0-13
Loading required package: knitr
Loading required package: rsvd
Loading required package: qvalue
# custom functions
source(file.path("util", "plier_util.R"))
# plot and result directory setup for this notebook
plot.dir <- file.path("plots", "07")
dir.create(plot.dir, recursive = TRUE, showWarnings = FALSE)
results.dir <- file.path("results", "07")
dir.create(results.dir, recursive = TRUE, showWarnings = FALSE)
recount.plier <- readRDS(file.path("data", "recount2_PLIER_data",
"recount_PLIER_model.RDS"))
symbol.file <-
file.path("data", "expression_data",
"SLE_WB_all_microarray_QN_zto_before_with_GeneSymbol.pcl")
sle.exprs.df <- readr::read_tsv(symbol.file, progress = FALSE)
Parsed with column specification:
cols(
.default = col_double(),
EntrezID = col_integer(),
GeneSymbol = col_character()
)
See spec(...) for full column specifications.
# matrix with gene symbol as rownames
exprs.mat <- dplyr::select(sle.exprs.df, -EntrezID)
rownames(exprs.mat) <- exprs.mat$GeneSymbol
Setting row names on a tibble is deprecated.
exprs.mat <- as.matrix(dplyr::select(exprs.mat, -GeneSymbol))
From 06-sle-wb_cell_type
neutro.file <- file.path("results", "06",
"Banchereau_count_SLE_model_neutrophil_LV.tsv")
sle.neutro.df <- readr::read_tsv(neutro.file)
Parsed with column specification:
cols(
Sample = col_character(),
LV2 = col_double(),
LV27 = col_double(),
LV34 = col_double(),
LV87 = col_double(),
Neutrophil.Count = col_double()
)
plasma.file <- file.path("results", "06",
"Banchereau_DA_group_SLE_model_plasma_cell_LVs.tsv")
sle.plasma.df <- readr::read_tsv(plasma.file)
Parsed with column specification:
cols(
Sample = col_character(),
LV52 = col_double(),
LV136 = col_double(),
Disease.Activity = col_character()
)
# summary information from the recount2 PLIER model
recount.summary <- recount.plier$summary
Are there recount2 LVs associated with neutrophil gene sets?
recount.summary %>%
dplyr::filter(grepl("Neutrophil", pathway),
FDR < 0.05)
The following latent variables look most promising: LV524
, LV603
, LV985
(AUC > 0.75
for both SVM Neutrophils
and IRIS_Neutrophil-Resting
)
What other gene sets are associated with these LVs?
recount.summary %>%
dplyr::filter(`LV index` %in% c(524, 603, 985),
FDR < 0.05) %>%
dplyr::arrange(`LV index`, desc(AUC))
Some notes on these results:
LV524
is also associated with the following monocyte-related gene sets: SVM Monocytes
, DMAP_MONO1
, and DMAP_MONO2
. Again, we would like something that tells us about neutrophils only if at all possible.LV603
is associated with two other gene sets, PID_IL8CXCR2_PATHWAY
and SIG_PIP3_SIGNALING_IN_B_LYMPHOCYTES
. It’s worth noting that interleukin 8 (IL8
) is a neutrophil chemoattractant, and SIG_PIP3_SIGNALING_IN_B_LYMPHOCYTES
may just be related to PI3K signaling & chemotaxis in general.LV985
is also associated with SIG_PIP3_SIGNALING_IN_B_LYMPHOCYTES
and PID_FCER1PATHWAY
. FCER1
refers to Fc-epsilon receptor 1 which is mainly expressed on other granulocytes.PLIER::plotU(plierRes = recount.plier,
pval.cutoff = 1e-06,
indexCol = c(524, 603, 985),
top = 10)
png(file.path(plot.dir, "recount2_neutrophil_Uplot.png"),
res = 300, width = 7, height = 7, units = "in")
PLIER::plotU(plierRes = recount.plier,
pval.cutoff = 1e-06,
indexCol = c(524, 603, 985),
top = 10)
dev.off()
null device
1
Are there LVs associated with plasma cell gene sets in the recount2 model?
recount.summary %>%
dplyr::filter(grepl("Plasma", pathway),
FDR < 0.05)
What other gene sets are associated with LV951
?
recount.summary %>%
dplyr::filter(`LV index` == 951)
KEGG_INTESTINAL_IMMUNE_NETWORK_FOR_IGA_PRODUCTION
is not a significant association (FDR = 0.13
) There were 2 LVs from the SLE WB PLIER model that were associated with plasma cell gene sets. Those LVs were also associated with other processes (e.g., REACTOME_UNFOLDED_PROTEIN_RESPONSE
, KEGG_HEDGEHOG_SIGNALING_PATHWAY
)
sle.recount.b <- GetNewDataB(exprs.mat = exprs.mat,
plier.model = recount.plier)
b.file <- file.path(results.dir, "SLE-WB_B_matrix_recount2_model.RDS")
saveRDS(sle.recount.b, file = b.file)
neutro.lv.df <- as.data.frame(cbind(colnames(sle.recount.b),
t(sle.recount.b[c(524, 603, 985), ])))
colnames(neutro.lv.df) <- c("Sample", "recount2_LV524", "recount2_LV603",
"recount2_LV985")
# join with sle wb results & neutrophil counts
neutro.df <- dplyr::inner_join(sle.neutro.df, neutro.lv.df) %>%
dplyr::mutate(recount2_LV524 = as.numeric(as.character(recount2_LV524)),
recount2_LV603 = as.numeric(as.character(recount2_LV603)),
recount2_LV985 = as.numeric(as.character(recount2_LV985)))
Joining, by = "Sample"
Column `Sample` joining character vector and factor, coercing into character vector
head(neutro.df)
neutro.out.file <- file.path(results.dir, "neutrophil_count_LV_both_models.tsv")
readr::write_tsv(neutro.df, path = neutro.out.file)
# function for scatter plots, given lv (a string indicating the LV of interest)
# make a scatter plot where the LV is the x variable, neutrophil count is the
# y variable & fit a line with geom_smooth(method = "lm")
# also will annotate the plot with the supplied r-squared value (rsq arg)
# where the text is placed is automatically chosen from the x and y values
# needs to be used in this global environment
LVScatter <- function(lv, rsq) {
y.var <- "Neutrophil.Count"
# calculate where to put the r-squared value
x.range <- max(neutro.df[, lv]) - min(neutro.df[, lv])
x.coord <- min(neutro.df[, lv]) + (x.range * 0.15)
y.range <- max(neutro.df[, y.var]) - min(neutro.df[, y.var])
y.coord <- max(neutro.df[, y.var]) - (y.range * 0.15)
ggplot2::ggplot(neutro.df, ggplot2::aes_string(x = lv, y = y.var)) +
ggplot2::geom_point(alpha = 0.7) +
ggplot2::geom_smooth(method = "lm") +
ggplot2::theme_bw() +
ggplot2::labs(y = "Neutrophil Count") +
ggplot2::theme(legend.position = "none",
text = ggplot2::element_text(size = 15)) +
ggplot2::annotate("text", x = x.coord, y = y.coord,
label = paste("r-squared =", rsq))
}
summary(lm(neutro.df$Neutrophil.Count ~ neutro.df$recount2_LV524))
Call:
lm(formula = neutro.df$Neutrophil.Count ~ neutro.df$recount2_LV524)
Residuals:
Min 1Q Median 3Q Max
-8.8588 -1.1612 -0.2481 0.9283 10.7119
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 4.12742 0.06649 62.08 <2e-16 ***
neutro.df$recount2_LV524 20.67546 0.97663 21.17 <2e-16 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 1.94 on 851 degrees of freedom
Multiple R-squared: 0.345, Adjusted R-squared: 0.3442
F-statistic: 448.2 on 1 and 851 DF, p-value: < 2.2e-16
LVScatter(lv = "recount2_LV524", rsq = 0.34)
summary(lm(neutro.df$Neutrophil.Count ~ neutro.df$recount2_LV603))
Call:
lm(formula = neutro.df$Neutrophil.Count ~ neutro.df$recount2_LV603)
Residuals:
Min 1Q Median 3Q Max
-4.1245 -1.1945 -0.2527 0.8522 12.0900
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 4.00409 0.06612 60.56 <2e-16 ***
neutro.df$recount2_LV603 11.09268 0.50550 21.94 <2e-16 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 1.916 on 851 degrees of freedom
Multiple R-squared: 0.3614, Adjusted R-squared: 0.3606
F-statistic: 481.5 on 1 and 851 DF, p-value: < 2.2e-16
LVScatter(lv = "recount2_LV603", rsq = 0.36)
summary(lm(neutro.df$Neutrophil.Count ~ neutro.df$recount2_LV985))
Call:
lm(formula = neutro.df$Neutrophil.Count ~ neutro.df$recount2_LV985)
Residuals:
Min 1Q Median 3Q Max
-4.7147 -1.2945 -0.3547 0.8808 12.6068
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 4.06785 0.06814 59.70 <2e-16 ***
neutro.df$recount2_LV985 14.77640 0.74475 19.84 <2e-16 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 1.982 on 851 degrees of freedom
Multiple R-squared: 0.3163, Adjusted R-squared: 0.3155
F-statistic: 393.7 on 1 and 851 DF, p-value: < 2.2e-16
LVScatter(lv = "recount2_LV985", rsq = 0.32)
The neutrophil-associated LVs from the recount2 model generally perform as well as the neutrophil-associated LVs from the SLE WB PLIER model.
We’ll plot the results from SLE WB LV87
and recount2 LV603
together because these are both LVs that are not significantly associated with monocyte signatures and have high AUC for the neutrophil gene sets.
sle.plot <- LVScatter("LV87", rsq = 0.29) +
ggplot2::labs(title = "SLE WB PLIER model") +
ggplot2::theme(plot.title = ggplot2::element_text(hjust = 0.5,
face = "bold"))
recount.plot <- LVScatter("recount2_LV603", rsq = 0.36) +
ggplot2::labs(title = "recount2 PLIER model") +
ggplot2::theme(plot.title = ggplot2::element_text(hjust = 0.5,
face = "bold"))
pdf(file.path(plot.dir, "Neutrophil_LV_model_comparison.pdf"), width = 14,
height = 7)
gridExtra::grid.arrange(sle.plot, recount.plot, ncol = 2)
dev.off()
null device
1
summary(lm(neutro.df$LV87 ~ neutro.df$recount2_LV603))
Call:
lm(formula = neutro.df$LV87 ~ neutro.df$recount2_LV603)
Residuals:
Min 1Q Median 3Q Max
-0.66051 -0.07107 0.01107 0.08593 0.33998
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -0.016814 0.004702 -3.576 0.000369 ***
neutro.df$recount2_LV603 2.753708 0.035945 76.610 < 2e-16 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 0.1362 on 851 degrees of freedom
Multiple R-squared: 0.8734, Adjusted R-squared: 0.8732
F-statistic: 5869 on 1 and 851 DF, p-value: < 2.2e-16
ggplot2::ggplot(neutro.df, ggplot2::aes(x = recount2_LV603,
y = LV87)) +
ggplot2::geom_point(alpha = 0.7) +
ggplot2::geom_smooth(method = "lm") +
ggplot2::theme_bw() +
ggplot2::labs(x = "recount2 LV603", y = "SLE WB LV87") +
ggplot2::theme(legend.position = "none",
text = ggplot2::element_text(size = 15)) +
ggplot2::annotate(geom = "text", x = -0.375, y = 1,
label = "r-squared = 0.87")
plot.file <-
file.path(plot.dir, "Banchereau_neutrophil_count_LV87_v_recLV603_scatter.png")
ggplot2::ggsave(plot.file, plot = ggplot2::last_plot())
Saving 7 x 7 in image
There are 987 latent variables from the recount2 PLIER model. Could any one of them be well-correlated with the Banchereau, et al. neutrophil counts by chance?
recount.b.t <- as.data.frame(t(sle.recount.b))
recount.b.t <- tibble::rownames_to_column(recount.b.t, "Sample")
head(recount.b.t)
b.count.df <- dplyr::inner_join(x = neutro.df[, c("Sample",
"Neutrophil.Count")],
y = recount.b.t, by = "Sample")
rsq.all.lv <- (cor(x = b.count.df$Neutrophil.Count,
y = b.count.df[, 3:ncol(b.count.df)])) ^ 2
rsq.df <- as.data.frame(matrix(rsq.all.lv, ncol = 1))
colnames(rsq.df) <- "r.squared"
ggplot2::ggplot(rsq.df, ggplot2::aes(x = r.squared)) +
ggplot2::geom_density() +
ggplot2::geom_segment(mapping = ggplot2::aes(x = rsq.df[603, ],
xend = rsq.df[603, ],
y = 14, yend = 0),
arrow = grid::arrow(),
colour = "blue") +
ggplot2::theme_bw() +
ggplot2::theme(text = ggplot2::element_text(size = 13)) +
ggplot2::labs(x = "R-squared", subtitle = "Neutrophil Count ~ LV Value",
title = "Distribution of R-squared values") +
ggplot2::annotate(geom = "text", x = 0.36, y = 15, colour = "blue",
label = "LV603", size = 5)
plot.file <-
file.path(plot.dir, "Banchereau_neutrophil_count_recount2_lv_rsq.png")
ggplot2::ggsave(plot.file, plot = ggplot2::last_plot())
Saving 7 x 7 in image
which.max(rsq.all.lv)
[1] 603
recount2 LV603
has the highest R-squared with the Banchereau, et al. neutrophil counts.
# get LV951 into data.frame form
plasma.lv.df <- as.data.frame(cbind(colnames(sle.recount.b),
sle.recount.b[951, ]))
colnames(plasma.lv.df) <- c("Sample", "recount2_LV951")
# join with SLE WB results
plasma.df <- dplyr::inner_join(sle.plasma.df, plasma.lv.df, by = "Sample") %>%
dplyr::mutate(recount2_LV951 = as.numeric(as.character(recount2_LV951)))
Column `Sample` joining character vector and factor, coercing into character vector
head(plasma.df)
plasma.file <- file.path(results.dir, "plasma_cell_LVs_both_models.tsv")
readr::write_tsv(plasma.df, plasma.file)
plasma.df %>%
ggplot2::ggplot(ggplot2::aes(x = Disease.Activity,
y = recount2_LV951)) +
ggplot2::geom_boxplot(notch = TRUE) +
ggplot2::theme_bw() +
ggplot2::labs(x = "Disease Activity") +
ggplot2::theme(text = ggplot2::element_text(size = 15))
# check for statistical significance
# (pairwise t-test is consistent with original publication as far as I can tell)
pairwise.t.test(x = plasma.df$recount2_LV951,
g = plasma.df$Disease.Activity,
p.adjust.method = "bonferroni")
Pairwise comparisons using t tests with pooled SD
data: plasma.df$recount2_LV951 and plasma.df$Disease.Activity
DA1 DA2
DA2 0.047 -
DA3 5.0e-13 2.3e-06
P value adjustment method: bonferroni
This is highly similar to what we saw in 06-sle-wb_cell_type
.
Now we’ll plot the recount2 (recount2_LV951
) and SLE WB model (LV52
) results side by side as we did above with the neutrophil scatterplots.
sle.plot <- plasma.df %>%
ggplot2::ggplot(ggplot2::aes(x = Disease.Activity,
y = LV52)) +
ggplot2::geom_boxplot(notch = TRUE) +
ggplot2::theme_bw() +
ggplot2::labs(x = "Disease Activity",
title = "SLE WB PLIER model") +
ggplot2::theme(plot.title = ggplot2::element_text(hjust = 0.5,
face = "bold")) +
ggplot2::theme(text = ggplot2::element_text(size = 15))
recount.plot <- plasma.df %>%
ggplot2::ggplot(ggplot2::aes(x = Disease.Activity,
y = recount2_LV951)) +
ggplot2::geom_boxplot(notch = TRUE) +
ggplot2::theme_bw() +
ggplot2::labs(x = "Disease Activity",
title = "recount2 PLIER model") +
ggplot2::theme(plot.title = ggplot2::element_text(hjust = 0.5,
face = "bold")) +
ggplot2::theme(text = ggplot2::element_text(size = 15))
pdf(file.path(plot.dir, "Plasma_cell_LV_model_comparison.pdf"), width = 14,
height = 7)
gridExtra::grid.arrange(sle.plot, recount.plot, ncol = 2)
dev.off()
null device
1