J. Taroni 2018
In 06-sle-wb_cell_type, we examined how LVs from the SLE WB PLIER model (results/05/SLE-WB_PLIER_model.RDS) compared to known patterns in the Banchereau, et al. data set that was included in the SLE WB compendium (and therefore, training).
Specifically, we asked:
The answer to both of these questions is yes. This demonstrated the utility of PLIER for extracting cell type patterns from heterogeneous (at least in terms of patient population and platform) gene expression data from a single tissue (whole blood) that had been assembled into a compendium (greenelab/rheum-plier-data/sle-wb).
SLE is not a rare condition – though the molecular heterogeneity and multi-tissue nature of the disease warrant sophisticated approaches to the analysis of transcriptomic data. (This SLE WB compendium is comprised of 1640 samples and is incomplete; we picked 7 data sets based on the platform used and the availability of raw data.) Thus, we can use data-intensive approaches (e.g., PLIER) that are inappropriate for smaller data sets and, by extension, rare or understudied diseases.
Using a “shovel-ready” data set like recount2 to train a PLIER model (see greenelab/rheum-plier-data/recount2) and applying this model to smaller data sets may “get around” the data-intensive requirement, but more investigation is warranted.
In this notebook, we examine whether we can use the recount2 PLIER model (data/recount2_PLIER_data/recount_PLIER_model.RDS) to analyze neutrophil and plasma cell patterns in the Banchereau, et al. data. If the recount2 model performance is similar to that of the SLE model (examined in 06-sle-wb_cell_type), that is evidence supporting the validity of this transfer learning approach.
`%>%` <- dplyr::`%>%`Warning message:
In scan(file = file, what = what, sep = sep, quote = quote, dec = dec,  :
  EOF within quoted stringlibrary(PLIER)Loading required package: RColorBrewer
Loading required package: gplots
Attaching package: ‘gplots’
The following object is masked from ‘package:stats’:
    lowess
Loading required package: pheatmap
Loading required package: glmnet
Loading required package: Matrix
Loading required package: foreach
Loaded glmnet 2.0-13
Loading required package: knitr
Loading required package: rsvd
Loading required package: qvalue# custom functions
source(file.path("util", "plier_util.R"))# plot and result directory setup for this notebook
plot.dir <- file.path("plots", "07")
dir.create(plot.dir, recursive = TRUE, showWarnings = FALSE)
results.dir <- file.path("results", "07")
dir.create(results.dir, recursive = TRUE, showWarnings = FALSE)recount.plier <- readRDS(file.path("data", "recount2_PLIER_data", 
                                   "recount_PLIER_model.RDS"))symbol.file <- 
  file.path("data", "expression_data", 
            "SLE_WB_all_microarray_QN_zto_before_with_GeneSymbol.pcl")
sle.exprs.df <- readr::read_tsv(symbol.file, progress = FALSE)Parsed with column specification:
cols(
  .default = col_double(),
  EntrezID = col_integer(),
  GeneSymbol = col_character()
)
See spec(...) for full column specifications.# matrix with gene symbol as rownames
exprs.mat <- dplyr::select(sle.exprs.df, -EntrezID)
rownames(exprs.mat) <- exprs.mat$GeneSymbolSetting row names on a tibble is deprecated.exprs.mat <- as.matrix(dplyr::select(exprs.mat, -GeneSymbol))From 06-sle-wb_cell_type
neutro.file <- file.path("results", "06",
                         "Banchereau_count_SLE_model_neutrophil_LV.tsv")
sle.neutro.df <- readr::read_tsv(neutro.file)Parsed with column specification:
cols(
  Sample = col_character(),
  LV2 = col_double(),
  LV27 = col_double(),
  LV34 = col_double(),
  LV87 = col_double(),
  Neutrophil.Count = col_double()
)plasma.file <- file.path("results", "06", 
                         "Banchereau_DA_group_SLE_model_plasma_cell_LVs.tsv")
sle.plasma.df <- readr::read_tsv(plasma.file)Parsed with column specification:
cols(
  Sample = col_character(),
  LV52 = col_double(),
  LV136 = col_double(),
  Disease.Activity = col_character()
)# summary information from the recount2 PLIER model
recount.summary <- recount.plier$summaryAre there recount2 LVs associated with neutrophil gene sets?
recount.summary %>%
  dplyr::filter(grepl("Neutrophil", pathway), 
                FDR < 0.05)The following latent variables look most promising: LV524, LV603, LV985 (AUC > 0.75 for both SVM Neutrophils and IRIS_Neutrophil-Resting)
What other gene sets are associated with these LVs?
recount.summary %>%
  dplyr::filter(`LV index` %in% c(524, 603, 985),
                FDR < 0.05) %>%
  dplyr::arrange(`LV index`, desc(AUC))Some notes on these results:
LV524 is also associated with the following monocyte-related gene sets: SVM Monocytes, DMAP_MONO1, and DMAP_MONO2. Again, we would like something that tells us about neutrophils only if at all possible.LV603 is associated with two other gene sets, PID_IL8CXCR2_PATHWAY and SIG_PIP3_SIGNALING_IN_B_LYMPHOCYTES. It’s worth noting that interleukin 8 (IL8) is a neutrophil chemoattractant, and SIG_PIP3_SIGNALING_IN_B_LYMPHOCYTES may just be related to PI3K signaling & chemotaxis in general.LV985 is also associated with SIG_PIP3_SIGNALING_IN_B_LYMPHOCYTES and PID_FCER1PATHWAY. FCER1 refers to Fc-epsilon receptor 1 which is mainly expressed on other granulocytes.PLIER::plotU(plierRes = recount.plier,
             pval.cutoff = 1e-06,
             indexCol = c(524, 603, 985),
             top = 10)png(file.path(plot.dir, "recount2_neutrophil_Uplot.png"), 
    res = 300, width = 7, height = 7, units = "in")
PLIER::plotU(plierRes = recount.plier,
             pval.cutoff = 1e-06,
             indexCol = c(524, 603, 985),
             top = 10)
dev.off()null device 
          1 Are there LVs associated with plasma cell gene sets in the recount2 model?
recount.summary %>%
  dplyr::filter(grepl("Plasma", pathway), 
                FDR < 0.05)What other gene sets are associated with LV951?
recount.summary %>%
  dplyr::filter(`LV index` == 951)KEGG_INTESTINAL_IMMUNE_NETWORK_FOR_IGA_PRODUCTION is not a significant association (FDR = 0.13) There were 2 LVs from the SLE WB PLIER model that were associated with plasma cell gene sets. Those LVs were also associated with other processes (e.g., REACTOME_UNFOLDED_PROTEIN_RESPONSE, KEGG_HEDGEHOG_SIGNALING_PATHWAY)
sle.recount.b <- GetNewDataB(exprs.mat = exprs.mat,
                             plier.model = recount.plier)
b.file <- file.path(results.dir, "SLE-WB_B_matrix_recount2_model.RDS")
saveRDS(sle.recount.b, file = b.file)neutro.lv.df <- as.data.frame(cbind(colnames(sle.recount.b), 
                                    t(sle.recount.b[c(524, 603, 985), ])))
colnames(neutro.lv.df) <- c("Sample", "recount2_LV524", "recount2_LV603", 
                            "recount2_LV985")# join with sle wb results & neutrophil counts
neutro.df <- dplyr::inner_join(sle.neutro.df, neutro.lv.df) %>%
              dplyr::mutate(recount2_LV524 = as.numeric(as.character(recount2_LV524)), 
                            recount2_LV603 = as.numeric(as.character(recount2_LV603)),
                            recount2_LV985 = as.numeric(as.character(recount2_LV985)))Joining, by = "Sample"
Column `Sample` joining character vector and factor, coercing into character vectorhead(neutro.df)neutro.out.file <- file.path(results.dir, "neutrophil_count_LV_both_models.tsv")
readr::write_tsv(neutro.df, path = neutro.out.file)# function for scatter plots, given lv (a string indicating the LV of interest)
# make a scatter plot where the LV is the x variable, neutrophil count is the
# y variable & fit a line with geom_smooth(method = "lm")
# also will annotate the plot with the supplied r-squared value (rsq arg)
# where the text is placed is automatically chosen from the x and y values
# needs to be used in this global environment
LVScatter <- function(lv, rsq) {
  y.var <- "Neutrophil.Count"
  
  # calculate where to put the r-squared value
  x.range <- max(neutro.df[, lv]) - min(neutro.df[, lv])
  x.coord <- min(neutro.df[, lv]) + (x.range * 0.15)
  y.range <- max(neutro.df[, y.var]) - min(neutro.df[, y.var])
  y.coord <- max(neutro.df[, y.var]) - (y.range * 0.15)
  
  ggplot2::ggplot(neutro.df, ggplot2::aes_string(x = lv, y = y.var)) +
    ggplot2::geom_point(alpha = 0.7) +
    ggplot2::geom_smooth(method = "lm") +
    ggplot2::theme_bw() +
    ggplot2::labs(y = "Neutrophil Count") +
    ggplot2::theme(legend.position = "none", 
                   text = ggplot2::element_text(size = 15)) +
    ggplot2::annotate("text", x = x.coord, y = y.coord, 
                      label = paste("r-squared =", rsq))
}summary(lm(neutro.df$Neutrophil.Count ~ neutro.df$recount2_LV524))
Call:
lm(formula = neutro.df$Neutrophil.Count ~ neutro.df$recount2_LV524)
Residuals:
    Min      1Q  Median      3Q     Max 
-8.8588 -1.1612 -0.2481  0.9283 10.7119 
Coefficients:
                         Estimate Std. Error t value Pr(>|t|)    
(Intercept)               4.12742    0.06649   62.08   <2e-16 ***
neutro.df$recount2_LV524 20.67546    0.97663   21.17   <2e-16 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 1.94 on 851 degrees of freedom
Multiple R-squared:  0.345, Adjusted R-squared:  0.3442 
F-statistic: 448.2 on 1 and 851 DF,  p-value: < 2.2e-16LVScatter(lv = "recount2_LV524", rsq = 0.34)summary(lm(neutro.df$Neutrophil.Count ~ neutro.df$recount2_LV603))
Call:
lm(formula = neutro.df$Neutrophil.Count ~ neutro.df$recount2_LV603)
Residuals:
    Min      1Q  Median      3Q     Max 
-4.1245 -1.1945 -0.2527  0.8522 12.0900 
Coefficients:
                         Estimate Std. Error t value Pr(>|t|)    
(Intercept)               4.00409    0.06612   60.56   <2e-16 ***
neutro.df$recount2_LV603 11.09268    0.50550   21.94   <2e-16 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 1.916 on 851 degrees of freedom
Multiple R-squared:  0.3614,    Adjusted R-squared:  0.3606 
F-statistic: 481.5 on 1 and 851 DF,  p-value: < 2.2e-16LVScatter(lv = "recount2_LV603", rsq = 0.36)summary(lm(neutro.df$Neutrophil.Count ~ neutro.df$recount2_LV985))
Call:
lm(formula = neutro.df$Neutrophil.Count ~ neutro.df$recount2_LV985)
Residuals:
    Min      1Q  Median      3Q     Max 
-4.7147 -1.2945 -0.3547  0.8808 12.6068 
Coefficients:
                         Estimate Std. Error t value Pr(>|t|)    
(Intercept)               4.06785    0.06814   59.70   <2e-16 ***
neutro.df$recount2_LV985 14.77640    0.74475   19.84   <2e-16 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 1.982 on 851 degrees of freedom
Multiple R-squared:  0.3163,    Adjusted R-squared:  0.3155 
F-statistic: 393.7 on 1 and 851 DF,  p-value: < 2.2e-16LVScatter(lv = "recount2_LV985", rsq = 0.32)The neutrophil-associated LVs from the recount2 model generally perform as well as the neutrophil-associated LVs from the SLE WB PLIER model.
We’ll plot the results from SLE WB LV87 and recount2 LV603 together because these are both LVs that are not significantly associated with monocyte signatures and have high AUC for the neutrophil gene sets.
sle.plot <- LVScatter("LV87", rsq = 0.29) +
  ggplot2::labs(title = "SLE WB PLIER model") +
  ggplot2::theme(plot.title = ggplot2::element_text(hjust = 0.5, 
                                                    face = "bold"))recount.plot <- LVScatter("recount2_LV603", rsq = 0.36) +
  ggplot2::labs(title = "recount2 PLIER model") +
  ggplot2::theme(plot.title = ggplot2::element_text(hjust = 0.5, 
                                                    face = "bold"))pdf(file.path(plot.dir, "Neutrophil_LV_model_comparison.pdf"), width = 14,
    height = 7)
gridExtra::grid.arrange(sle.plot, recount.plot, ncol = 2)
dev.off()null device 
          1 summary(lm(neutro.df$LV87 ~ neutro.df$recount2_LV603))
Call:
lm(formula = neutro.df$LV87 ~ neutro.df$recount2_LV603)
Residuals:
     Min       1Q   Median       3Q      Max 
-0.66051 -0.07107  0.01107  0.08593  0.33998 
Coefficients:
                          Estimate Std. Error t value Pr(>|t|)    
(Intercept)              -0.016814   0.004702  -3.576 0.000369 ***
neutro.df$recount2_LV603  2.753708   0.035945  76.610  < 2e-16 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 0.1362 on 851 degrees of freedom
Multiple R-squared:  0.8734,    Adjusted R-squared:  0.8732 
F-statistic:  5869 on 1 and 851 DF,  p-value: < 2.2e-16ggplot2::ggplot(neutro.df, ggplot2::aes(x = recount2_LV603, 
                                        y = LV87)) +
    ggplot2::geom_point(alpha = 0.7) +
    ggplot2::geom_smooth(method = "lm") +
    ggplot2::theme_bw() +
    ggplot2::labs(x = "recount2 LV603", y = "SLE WB LV87") +
    ggplot2::theme(legend.position = "none", 
                   text = ggplot2::element_text(size = 15)) +
    ggplot2::annotate(geom = "text", x = -0.375, y = 1,  
                      label = "r-squared = 0.87") plot.file <- 
  file.path(plot.dir, "Banchereau_neutrophil_count_LV87_v_recLV603_scatter.png")
ggplot2::ggsave(plot.file, plot = ggplot2::last_plot())Saving 7 x 7 in imageThere are 987 latent variables from the recount2 PLIER model. Could any one of them be well-correlated with the Banchereau, et al. neutrophil counts by chance?
recount.b.t <- as.data.frame(t(sle.recount.b))
recount.b.t <- tibble::rownames_to_column(recount.b.t, "Sample")
head(recount.b.t)b.count.df <- dplyr::inner_join(x = neutro.df[, c("Sample", 
                                                  "Neutrophil.Count")],
                                y = recount.b.t, by = "Sample")rsq.all.lv <- (cor(x = b.count.df$Neutrophil.Count, 
                   y = b.count.df[, 3:ncol(b.count.df)])) ^ 2
rsq.df <- as.data.frame(matrix(rsq.all.lv, ncol = 1))
colnames(rsq.df) <- "r.squared"ggplot2::ggplot(rsq.df, ggplot2::aes(x = r.squared)) +
  ggplot2::geom_density() +
  ggplot2::geom_segment(mapping = ggplot2::aes(x = rsq.df[603, ], 
                                               xend = rsq.df[603, ],
                                               y = 14, yend = 0),
                        arrow = grid::arrow(),
                        colour = "blue") +
  ggplot2::theme_bw() +
  ggplot2::theme(text = ggplot2::element_text(size = 13)) +
  ggplot2::labs(x = "R-squared", subtitle = "Neutrophil Count ~ LV Value", 
                title = "Distribution of R-squared values") +
  ggplot2::annotate(geom = "text", x = 0.36, y = 15, colour = "blue", 
                    label = "LV603", size = 5) plot.file <- 
  file.path(plot.dir, "Banchereau_neutrophil_count_recount2_lv_rsq.png")
ggplot2::ggsave(plot.file, plot = ggplot2::last_plot())Saving 7 x 7 in imagewhich.max(rsq.all.lv)[1] 603recount2 LV603 has the highest R-squared with the Banchereau, et al. neutrophil counts.
# get LV951 into data.frame form
plasma.lv.df <- as.data.frame(cbind(colnames(sle.recount.b), 
                                    sle.recount.b[951, ]))
colnames(plasma.lv.df) <- c("Sample", "recount2_LV951")# join with SLE WB results
plasma.df <- dplyr::inner_join(sle.plasma.df, plasma.lv.df, by = "Sample") %>%
  dplyr::mutate(recount2_LV951 = as.numeric(as.character(recount2_LV951)))Column `Sample` joining character vector and factor, coercing into character vectorhead(plasma.df)plasma.file <- file.path(results.dir, "plasma_cell_LVs_both_models.tsv")
readr::write_tsv(plasma.df, plasma.file)plasma.df %>%
  ggplot2::ggplot(ggplot2::aes(x = Disease.Activity, 
                               y = recount2_LV951)) +
  ggplot2::geom_boxplot(notch = TRUE) +
  ggplot2::theme_bw() +
  ggplot2::labs(x = "Disease Activity") +
  ggplot2::theme(text = ggplot2::element_text(size = 15))# check for statistical significance 
# (pairwise t-test is consistent with original publication as far as I can tell)
pairwise.t.test(x = plasma.df$recount2_LV951, 
                g = plasma.df$Disease.Activity, 
                p.adjust.method = "bonferroni")
    Pairwise comparisons using t tests with pooled SD 
data:  plasma.df$recount2_LV951 and plasma.df$Disease.Activity 
    DA1     DA2    
DA2 0.047   -      
DA3 5.0e-13 2.3e-06
P value adjustment method: bonferroni This is highly similar to what we saw in 06-sle-wb_cell_type.
Now we’ll plot the recount2 (recount2_LV951) and SLE WB model (LV52) results side by side as we did above with the neutrophil scatterplots.
sle.plot <- plasma.df %>%
  ggplot2::ggplot(ggplot2::aes(x = Disease.Activity, 
                               y = LV52)) +
  ggplot2::geom_boxplot(notch = TRUE) +
  ggplot2::theme_bw() +
  ggplot2::labs(x = "Disease Activity",
                title = "SLE WB PLIER model") +
  ggplot2::theme(plot.title = ggplot2::element_text(hjust = 0.5, 
                                                    face = "bold")) +
  ggplot2::theme(text = ggplot2::element_text(size = 15))recount.plot <- plasma.df %>%
  ggplot2::ggplot(ggplot2::aes(x = Disease.Activity, 
                               y = recount2_LV951)) +
  ggplot2::geom_boxplot(notch = TRUE) +
  ggplot2::theme_bw() +
  ggplot2::labs(x = "Disease Activity",
                title = "recount2 PLIER model") +
  ggplot2::theme(plot.title = ggplot2::element_text(hjust = 0.5, 
                                                    face = "bold")) +
  ggplot2::theme(text = ggplot2::element_text(size = 15))pdf(file.path(plot.dir, "Plasma_cell_LV_model_comparison.pdf"), width = 14,
    height = 7)
gridExtra::grid.arrange(sle.plot, recount.plot, ncol = 2)
dev.off()null device 
          1