J. Taroni 2018

In this notebook, we’ll clean diffuse intrinsic pontine glioma (DIPG) data. There is no DIPG data in recount2, so this is another use case for this project.

We’ll be working with two datasets stored in the greenelab/rheum-plier-data repository:

Citations:

Paugh BS, Broniscer A, Qu C, et al. Genome-wide analyses identify recurrent amplifications of receptor tyrosine kinases and cell-cycle regulatory genes in diffuse intrinsic pontine glioma. J Clin Oncol. 2011;29(30):3999-4006.

Buczkowicz P, Hoeman C, Rakopoulos P, et al. Genomic analysis of diffuse intrinsic pontine gliomas identifies three molecular subgroups and recurrent activating ACVR1 mutations. Nat Genet. 2014;46(5):451-6.

Set up

# magrittr pipe
`%>%` <- dplyr::`%>%`
# we need the function that aggregates duplicate gene identifiers to the
# mean value
source(file.path("util", "test_LV_differences.R"))

Directory setup

# directory that holds the gene expression files
exprs.dir <- file.path("data", "expression_data")
# directory that holds the sample metadata
sample.info.dir <- file.path("data", "sample_info")

Read in and clean data

GSE50021

We have the series matrix for GSE50021, which contains both the expression values and the metadata

series.mat.file <- file.path(exprs.dir, "GSE50021_series_matrix.txt")
# expression matrix -- everything but the comment lines that begin with !
ma.data <- 
  readr::read_delim(series.mat.file, 
                    delim = "\t", 
                    comment = "!",
                    col_names = TRUE, 
                    skip = 1)
Parsed with column specification:
cols(
  .default = col_double(),
  ID_REF = col_character()
)
See spec(...) for full column specifications.

|=========================                                                                                                               |  18%    1 MB
|==========================                                                                                                              |  19%    2 MB
|===========================                                                                                                             |  20%    2 MB
|============================                                                                                                            |  20%    2 MB
|=============================                                                                                                           |  21%    2 MB
|==============================                                                                                                          |  22%    2 MB
|===============================                                                                                                         |  23%    2 MB
|================================                                                                                                        |  23%    2 MB
|=================================                                                                                                       |  24%    2 MB
|==================================                                                                                                      |  25%    2 MB
|===================================                                                                                                     |  25%    2 MB
|====================================                                                                                                    |  26%    2 MB
|=====================================                                                                                                   |  27%    2 MB
|======================================                                                                                                  |  28%    3 MB
|=======================================                                                                                                 |  28%    3 MB
|========================================                                                                                                |  29%    3 MB
|=========================================                                                                                               |  30%    3 MB
|==========================================                                                                                              |  31%    3 MB
|===========================================                                                                                             |  31%    3 MB
|============================================                                                                                            |  32%    3 MB
|=============================================                                                                                           |  33%    3 MB
|==============================================                                                                                          |  34%    3 MB
|===============================================                                                                                         |  34%    3 MB
|================================================                                                                                        |  35%    3 MB
|=================================================                                                                                       |  36%    3 MB
|==================================================                                                                                      |  37%    3 MB
|===================================================                                                                                     |  37%    4 MB
|====================================================                                                                                    |  38%    4 MB
|=====================================================                                                                                   |  39%    4 MB
|======================================================                                                                                  |  40%    4 MB
|=======================================================                                                                                 |  40%    4 MB
|========================================================                                                                                |  41%    4 MB
|=========================================================                                                                               |  42%    4 MB
|==========================================================                                                                              |  43%    4 MB
|===========================================================                                                                             |  43%    4 MB
|=============================================================                                                                           |  44%    4 MB
|==============================================================                                                                          |  45%    4 MB
|===============================================================                                                                         |  46%    4 MB
|================================================================                                                                        |  46%    4 MB
|=================================================================                                                                       |  47%    5 MB
|==================================================================                                                                      |  48%    5 MB
|===================================================================                                                                     |  48%    5 MB
|====================================================================                                                                    |  49%    5 MB
|=====================================================================                                                                   |  50%    5 MB
|======================================================================                                                                  |  51%    5 MB
|=======================================================================                                                                 |  51%    5 MB
|========================================================================                                                                |  52%    5 MB
|=========================================================================                                                               |  53%    5 MB
|==========================================================================                                                              |  54%    5 MB
|===========================================================================                                                             |  54%    5 MB
|============================================================================                                                            |  55%    5 MB
|=============================================================================                                                           |  56%    6 MB
|==============================================================================                                                          |  57%    6 MB
|===============================================================================                                                         |  57%    6 MB
|================================================================================                                                        |  58%    6 MB
|=================================================================================                                                       |  59%    6 MB
|==================================================================================                                                      |  60%    6 MB
|===================================================================================                                                     |  60%    6 MB
|====================================================================================                                                    |  61%    6 MB
|=====================================================================================                                                   |  62%    6 MB
|======================================================================================                                                  |  63%    6 MB
|=======================================================================================                                                 |  63%    6 MB
|========================================================================================                                                |  64%    6 MB
|=========================================================================================                                               |  65%    6 MB
|==========================================================================================                                              |  66%    7 MB
|===========================================================================================                                             |  66%    7 MB
|============================================================================================                                            |  67%    7 MB
|=============================================================================================                                           |  68%    7 MB
|==============================================================================================                                          |  69%    7 MB
|===============================================================================================                                         |  69%    7 MB
|================================================================================================                                        |  70%    7 MB
|=================================================================================================                                       |  71%    7 MB
|==================================================================================================                                      |  72%    7 MB
|===================================================================================================                                     |  72%    7 MB
|====================================================================================================                                    |  73%    7 MB
|=====================================================================================================                                   |  74%    7 MB
|======================================================================================================                                  |  74%    7 MB
|=======================================================================================================                                 |  75%    8 MB
|========================================================================================================                                |  76%    8 MB
|=========================================================================================================                               |  77%    8 MB
|==========================================================================================================                              |  77%    8 MB
|===========================================================================================================                             |  78%    8 MB
|============================================================================================================                            |  79%    8 MB
|=============================================================================================================                           |  80%    8 MB
|==============================================================================================================                          |  80%    8 MB
|===============================================================================================================                         |  81%    8 MB
|================================================================================================================                        |  82%    8 MB
|=================================================================================================================                       |  83%    8 MB
|==================================================================================================================                      |  83%    8 MB
|===================================================================================================================                     |  84%    9 MB
|====================================================================================================================                    |  85%    9 MB
|=====================================================================================================================                   |  86%    9 MB
|=======================================================================================================================                 |  86%    9 MB
|========================================================================================================================                |  87%    9 MB
|=========================================================================================================================               |  88%    9 MB
|==========================================================================================================================              |  89%    9 MB
|===========================================================================================================================             |  89%    9 MB
|============================================================================================================================            |  90%    9 MB
|=============================================================================================================================           |  91%    9 MB
|==============================================================================================================================          |  92%    9 MB
|===============================================================================================================================         |  92%    9 MB
|================================================================================================================================        |  93%    9 MB
|=================================================================================================================================       |  94%   10 MB
|==================================================================================================================================      |  95%   10 MB
|===================================================================================================================================     |  95%   10 MB
|====================================================================================================================================    |  96%   10 MB
|=====================================================================================================================================   |  97%   10 MB
|======================================================================================================================================  |  98%   10 MB
|======================================================================================================================================= |  98%   10 MB
|========================================================================================================================================|  99%   10 MB
|=========================================================================================================================================| 100%   10 MB

Gene identifier conversion

# The GPL information from GEO, which was made public on Jul 18, 2011 and 
# last updated on Jan 18, 2013
gpl.info.df <- readr::read_tsv(file.path(exprs.dir, "GPL13938-11302.txt"),
                               comment = "#")
Parsed with column specification:
cols(
  .default = col_character(),
  Entrez_Gene_ID = col_integer(),
  GI = col_integer(),
  Array_Address_Id = col_integer(),
  Probe_Start = col_integer()
)
See spec(...) for full column specifications.
number of columns of result is not a multiple of vector length (arg 1)6 parsing failures.
row # A tibble: 5 x 5 col     row col   expected   actual     file                                      expected   <int> <chr> <chr>      <chr>      <chr>                                     actual 1  6771 NA    29 columns 27 columns 'data/expression_data/GPL13938-11302.txt' file 2  6772 NA    29 columns 27 columns 'data/expression_data/GPL13938-11302.txt' row 3  8712 NA    29 columns 27 columns 'data/expression_data/GPL13938-11302.txt' col 4  9671 NA    29 columns 27 columns 'data/expression_data/GPL13938-11302.txt' expected 5 19712 NA    29 columns 27 columns 'data/expression_data/GPL13938-11302.txt'

See problems(...) for more details.
annot.gse50021.df <- gpl.info.df %>%
  # from the GEO information, grab just the probe identifier and the gene
  # symbol columns
  dplyr::select(c(ID, Symbol)) %>%
  # only ILMN IDs (probes) in both
  dplyr::inner_join(ma.data, by = c("ID"  = "ID_REF")) %>%
  # collapsing duplicate symbols later will require the symbols to be in the
  # first column, called "Gene", with no additional columns
  dplyr::mutate(Gene = Symbol) %>%
  dplyr::select(-ID, -Symbol) %>%
  dplyr::select(Gene, dplyr::everything())

Collapse duplicate symbols and write to file.

# summarize to mean
annot.mean.df <- PrepExpressionDF(annot.gse50021.df)
readr::write_tsv(annot.mean.df, file.path(exprs.dir, "GSE50021_mean_agg.pcl"))

Sample metadata

As mentioned above, metadata is also extracted from the series matrix file. We do this for a single line at a time that we’ve picked based on the relevance to any downstream analysis we might do (contact information, for example, does not help us in this context).

We’ll write a custom function specifically for this context and environment

# given a line number of the series matrix file (series.mat.file),
# get the values
GetSampleAttributes <- function(skip.value) {
  conn <- file(series.mat.file)
  open(conn)
  sample.attributes <- read.table(conn, skip = skip.value, nrow = 1)
  close(conn)
  return(sample.attributes)
}
# sample accession e.g., GSMXXXXX
sample.accession <- GetSampleAttributes(skip.value = 79)
# source name
source.name <- GetSampleAttributes(skip.value = 85)
# tissue
tissue <- GetSampleAttributes(skip.value = 88)
# gender
gender <- GetSampleAttributes(skip.value = 89)
# age at diagnosis
age.dx <- GetSampleAttributes(skip.value = 90)
# overall survival
survival <- GetSampleAttributes(skip.value = 91)
# get those lines into data.frame format
smpl.info.df <- as.data.frame(t(dplyr::bind_rows(sample.accession, 
                                                 source.name, 
                                                 tissue,
                                                 gender,
                                                 age.dx,
                                                 survival))[-1, ]) 
Unequal factor levels: coercing to characterbinding character and factor vector, coercing into character vectorbinding character and factor vector, coercing into character vectorUnequal factor levels: coercing to characterbinding character and factor vector, coercing into character vectorbinding character and factor vector, coercing into character vectorUnequal factor levels: coercing to characterbinding character and factor vector, coercing into character vectorbinding character and factor vector, coercing into character vectorUnequal factor levels: coercing to characterbinding character and factor vector, coercing into character vectorbinding character and factor vector, coercing into character vectorUnequal factor levels: coercing to characterbinding character and factor vector, coercing into character vectorbinding character and factor vector, coercing into character vectorUnequal factor levels: coercing to characterbinding character and factor vector, coercing into character vectorbinding character and factor vector, coercing into character vectorUnequal factor levels: coercing to characterbinding character and factor vector, coercing into character vectorbinding character and factor vector, coercing into character vectorUnequal factor levels: coercing to characterbinding character and factor vector, coercing into character vectorbinding character and factor vector, coercing into character vectorUnequal factor levels: coercing to characterbinding character and factor vector, coercing into character vectorbinding character and factor vector, coercing into character vectorUnequal factor levels: coercing to characterbinding character and factor vector, coercing into character vectorbinding character and factor vector, coercing into character vectorUnequal factor levels: coercing to characterbinding character and factor vector, coercing into character vectorbinding character and factor vector, coercing into character vectorUnequal factor levels: coercing to characterbinding character and factor vector, coercing into character vectorbinding character and factor vector, coercing into character vectorUnequal factor levels: coercing to characterbinding character and factor vector, coercing into character vectorbinding character and factor vector, coercing into character vectorUnequal factor levels: coercing to characterbinding character and factor vector, coercing into character vectorbinding character and factor vector, coercing into character vectorUnequal factor levels: coercing to characterbinding character and factor vector, coercing into character vectorbinding character and factor vector, coercing into character vectorUnequal factor levels: coercing to characterbinding character and factor vector, coercing into character vectorbinding character and factor vector, coercing into character vectorUnequal factor levels: coercing to characterbinding character and factor vector, coercing into character vectorbinding character and factor vector, coercing into character vectorUnequal factor levels: coercing to characterbinding character and factor vector, coercing into character vectorbinding character and factor vector, coercing into character vectorUnequal factor levels: coercing to characterbinding character and factor vector, coercing into character vectorbinding character and factor vector, coercing into character vectorUnequal factor levels: coercing to characterbinding character and factor vector, coercing into character vectorbinding character and factor vector, coercing into character vectorUnequal factor levels: coercing to characterbinding character and factor vector, coercing into character vectorbinding character and factor vector, coercing into character vectorUnequal factor levels: coercing to characterbinding character and factor vector, coercing into character vectorbinding character and factor vector, coercing into character vectorUnequal factor levels: coercing to characterbinding character and factor vector, coercing into character vectorbinding character and factor vector, coercing into character vectorUnequal factor levels: coercing to characterbinding character and factor vector, coercing into character vectorbinding character and factor vector, coercing into character vectorUnequal factor levels: coercing to characterbinding character and factor vector, coercing into character vectorbinding character and factor vector, coercing into character vectorUnequal factor levels: coercing to characterbinding character and factor vector, coercing into character vectorbinding character and factor vector, coercing into character vectorUnequal factor levels: coercing to characterbinding character and factor vector, coercing into character vectorbinding character and factor vector, coercing into character vectorUnequal factor levels: coercing to characterbinding character and factor vector, coercing into character vectorbinding character and factor vector, coercing into character vectorUnequal factor levels: coercing to characterbinding character and factor vector, coercing into character vectorbinding character and factor vector, coercing into character vectorUnequal factor levels: coercing to characterbinding character and factor vector, coercing into character vectorbinding character and factor vector, coercing into character vectorUnequal factor levels: coercing to characterbinding character and factor vector, coercing into character vectorbinding character and factor vector, coercing into character vectorUnequal factor levels: coercing to characterbinding character and factor vector, coercing into character vectorbinding character and factor vector, coercing into character vectorUnequal factor levels: coercing to characterbinding character and factor vector, coercing into character vectorbinding character and factor vector, coercing into character vectorUnequal factor levels: coercing to characterbinding character and factor vector, coercing into character vectorbinding character and factor vector, coercing into character vectorUnequal factor levels: coercing to characterbinding character and factor vector, coercing into character vectorbinding character and factor vector, coercing into character vectorUnequal factor levels: coercing to characterbinding character and factor vector, coercing into character vectorbinding character and factor vector, coercing into character vectorUnequal factor levels: coercing to characterbinding character and factor vector, coercing into character vectorbinding character and factor vector, coercing into character vectorUnequal factor levels: coercing to characterbinding character and factor vector, coercing into character vectorbinding character and factor vector, coercing into character vectorUnequal factor levels: coercing to characterbinding character and factor vector, coercing into character vectorbinding character and factor vector, coercing into character vectorUnequal factor levels: coercing to characterbinding character and factor vector, coercing into character vectorbinding character and factor vector, coercing into character vectorUnequal factor levels: coercing to characterbinding character and factor vector, coercing into character vectorbinding character and factor vector, coercing into character vectorUnequal factor levels: coercing to characterbinding character and factor vector, coercing into character vectorbinding character and factor vector, coercing into character vectorUnequal factor levels: coercing to characterbinding character and factor vector, coercing into character vectorbinding character and factor vector, coercing into character vectorUnequal factor levels: coercing to characterbinding character and factor vector, coercing into character vectorbinding character and factor vector, coercing into character vectorUnequal factor levels: coercing to characterbinding character and factor vector, coercing into character vectorbinding character and factor vector, coercing into character vectorUnequal factor levels: coercing to characterbinding character and factor vector, coercing into character vectorbinding character and factor vector, coercing into character vectorbinding character and factor vector, coercing into character vectorbinding character and factor vector, coercing into character vectorbinding character and factor vector, coercing into character vectorbinding character and factor vector, coercing into character vectorbinding character and factor vector, coercing into character vectorbinding character and factor vector, coercing into character vectorbinding character and factor vector, coercing into character vectorbinding character and factor vector, coercing into character vectorbinding character and factor vector, coercing into character vectorbinding character and factor vector, coercing into character vectorbinding character and factor vector, coercing into character vectorbinding character and factor vector, coercing into character vectorbinding character and factor vector, coercing into character vectorbinding character and factor vector, coercing into character vectorbinding character and factor vector, coercing into character vectorbinding character and factor vector, coercing into character vectorbinding character and factor vector, coercing into character vectorbinding character and factor vector, coercing into character vectorbinding character and factor vector, coercing into character vectorbinding character and factor vector, coercing into character vectorbinding character and factor vector, coercing into character vectorbinding character and factor vector, coercing into character vectorbinding character and factor vector, coercing into character vectorbinding character and factor vector, coercing into character vectorbinding character and factor vector, coercing into character vectorbinding character and factor vector, coercing into character vectorbinding character and factor vector, coercing into character vectorbinding character and factor vector, coercing into character vectorbinding character and factor vector, coercing into character vectorbinding character and factor vector, coercing into character vectorbinding character and factor vector, coercing into character vectorbinding character and factor vector, coercing into character vectorbinding character and factor vector, coercing into character vectorbinding character and factor vector, coercing into character vectorbinding character and factor vector, coercing into character vectorbinding character and factor vector, coercing into character vectorbinding character and factor vector, coercing into character vectorbinding character and factor vector, coercing into character vectorbinding character and factor vector, coercing into character vectorbinding character and factor vector, coercing into character vectorbinding character and factor vector, coercing into character vectorbinding character and factor vector, coercing into character vectorbinding character and factor vector, coercing into character vectorbinding character and factor vector, coercing into character vectorbinding character and factor vector, coercing into character vectorbinding character and factor vector, coercing into character vectorbinding character and factor vector, coercing into character vectorbinding character and factor vector, coercing into character vectorbinding character and factor vector, coercing into character vectorbinding character and factor vector, coercing into character vectorbinding character and factor vector, coercing into character vectorbinding character and factor vector, coercing into character vectorbinding character and factor vector, coercing into character vectorbinding character and factor vector, coercing into character vectorbinding character and factor vector, coercing into character vectorbinding character and factor vector, coercing into character vectorbinding character and factor vector, coercing into character vectorbinding character and factor vector, coercing into character vectorbinding character and factor vector, coercing into character vectorbinding character and factor vector, coercing into character vectorbinding character and factor vector, coercing into character vectorbinding character and factor vector, coercing into character vectorbinding character and factor vector, coercing into character vectorbinding character and factor vector, coercing into character vectorbinding character and factor vector, coercing into character vectorbinding character and factor vector, coercing into character vectorbinding character and factor vector, coercing into character vectorbinding character and factor vector, coercing into character vectorbinding character and factor vector, coercing into character vectorbinding character and factor vector, coercing into character vectorbinding character and factor vector, coercing into character vectorbinding character and factor vector, coercing into character vectorbinding character and factor vector, coercing into character vectorbinding character and factor vector, coercing into character vectorbinding character and factor vector, coercing into character vectorbinding character and factor vector, coercing into character vectorbinding character and factor vector, coercing into character vectorbinding character and factor vector, coercing into character vectorbinding character and factor vector, coercing into character vectorbinding character and factor vector, coercing into character vectorbinding character and factor vector, coercing into character vectorbinding character and factor vector, coercing into character vectorbinding character and factor vector, coercing into character vectorbinding character and factor vector, coercing into character vectorbinding character and factor vector, coercing into character vectorbinding character and factor vector, coercing into character vectorbinding character and factor vector, coercing into character vectorbinding character and factor vector, coercing into character vectorbinding character and factor vector, coercing into character vectorbinding character and factor vector, coercing into character vectorbinding character and factor vector, coercing into character vectorbinding character and factor vector, coercing into character vectorbinding character and factor vector, coercing into character vectorbinding character and factor vector, coercing into character vectorbinding character and factor vector, coercing into character vectorbinding character and factor vector, coercing into character vectorbinding character and factor vector, coercing into character vectorbinding character and factor vector, coercing into character vectorbinding character and factor vector, coercing into character vectorbinding character and factor vector, coercing into character vectorbinding character and factor vector, coercing into character vectorbinding character and factor vector, coercing into character vectorbinding character and factor vector, coercing into character vectorbinding character and factor vector, coercing into character vectorbinding character and factor vector, coercing into character vectorbinding character and factor vector, coercing into character vectorbinding character and factor vector, coercing into character vectorbinding character and factor vector, coercing into character vectorbinding character and factor vector, coercing into character vectorbinding character and factor vector, coercing into character vectorbinding character and factor vector, coercing into character vectorbinding character and factor vector, coercing into character vectorbinding character and factor vector, coercing into character vectorbinding character and factor vector, coercing into character vectorbinding character and factor vector, coercing into character vectorbinding character and factor vector, coercing into character vectorbinding character and factor vector, coercing into character vectorbinding character and factor vector, coercing into character vectorbinding character and factor vector, coercing into character vectorbinding character and factor vector, coercing into character vectorbinding character and factor vector, coercing into character vectorbinding character and factor vector, coercing into character vectorbinding character and factor vector, coercing into character vectorbinding character and factor vector, coercing into character vectorbinding character and factor vector, coercing into character vectorbinding character and factor vector, coercing into character vectorbinding character and factor vector, coercing into character vectorbinding character and factor vector, coercing into character vectorbinding character and factor vector, coercing into character vectorbinding character and factor vector, coercing into character vectorbinding character and factor vector, coercing into character vectorbinding character and factor vector, coercing into character vectorbinding character and factor vector, coercing into character vectorbinding character and factor vector, coercing into character vectorbinding character and factor vector, coercing into character vectorbinding character and factor vector, coercing into character vectorbinding character and factor vector, coercing into character vectorbinding character and factor vector, coercing into character vectorbinding character and factor vector, coercing into character vectorbinding character and factor vector, coercing into character vectorbinding character and factor vector, coercing into character vectorbinding character and factor vector, coercing into character vectorbinding character and factor vector, coercing into character vectorbinding character and factor vector, coercing into character vectorbinding character and factor vector, coercing into character vectorbinding character and factor vector, coercing into character vectorbinding character and factor vector, coercing into character vectorbinding character and factor vector, coercing into character vectorbinding character and factor vector, coercing into character vectorbinding character and factor vector, coercing into character vectorbinding character and factor vector, coercing into character vectorbinding character and factor vector, coercing into character vectorbinding character and factor vector, coercing into character vectorbinding character and factor vector, coercing into character vectorbinding character and factor vector, coercing into character vectorbinding character and factor vector, coercing into character vectorbinding character and factor vector, coercing into character vectorbinding character and factor vector, coercing into character vectorbinding character and factor vector, coercing into character vectorbinding character and factor vector, coercing into character vectorbinding character and factor vector, coercing into character vectorbinding character and factor vector, coercing into character vectorbinding character and factor vector, coercing into character vectorbinding character and factor vector, coercing into character vectorbinding character and factor vector, coercing into character vectorbinding character and factor vector, coercing into character vectorbinding character and factor vector, coercing into character vectorbinding character and factor vector, coercing into character vectorbinding character and factor vector, coercing into character vectorbinding character and factor vector, coercing into character vectorbinding character and factor vector, coercing into character vectorbinding character and factor vector, coercing into character vectorbinding character and factor vector, coercing into character vectorbinding character and factor vector, coercing into character vectorbinding character and factor vector, coercing into character vectorbinding character and factor vector, coercing into character vectorbinding character and factor vector, coercing into character vectorbinding character and factor vector, coercing into character vectorbinding character and factor vector, coercing into character vectorbinding character and factor vector, coercing into character vectorbinding character and factor vector, coercing into character vectorbinding character and factor vector, coercing into character vectorbinding character and factor vector, coercing into character vectorbinding character and factor vector, coercing into character vector
colnames(smpl.info.df) <- c("sample_accession", "source_name", "tissue",
                            "gender", "age_at_diagnosis_yrs", 
                            "overall_survival_yrs")
# strip extraneous strings
smpl.info.df <- smpl.info.df %>%
  dplyr::mutate(tissue = gsub("cell type: ", "", tissue),
                gender = gsub("gender: ", "", gender),
                age_at_diagnosis_yrs = gsub("age at dx \\(years\\): ", "", 
                                            age_at_diagnosis_yrs),
                overall_survival_yrs = gsub("os \\(years\\): ", "", 
                                            overall_survival_yrs))
# change "N/A" to NA
smpl.info.df[which(smpl.info.df == "N/A", arr.ind = TRUE)] <- NA

Write the cleaned metadata to a TSV file.

readr::write_tsv(smpl.info.df, 
                 file.path(sample.info.dir, "GSE50021_cleaned_metadata.tsv"))

Clean up the workspace a bit before working with the next dataset.

to.keep <- c("%>%", "exprs.dir", "sample.info.dir", "PrepExpressionDF")
rm(list = setdiff(ls(), to.keep))

E-GEOD-26576

We used an Entrez ID BrainArray package to process this data set and we need gene symbols to work with PLIER.

# SCANfast processed PCL
gse26576.file <- file.path(exprs.dir, 
                           "DIPG_E-GEOD-26576_hgu133plus2_SCANfast.pcl")
# read in the PCL file, remove the trailing "_at" added by Brainarray (these
# are Entrez gene identifiers), drop the Entrez IDs with _at appended, reordered
# such that the gene identifiers are the first column
gse26576.df <- readr::read_tsv(gse26576.file) %>%
  dplyr::mutate(EntrezID = sub("_at", "", X1)) %>%
  dplyr::select(-X1) %>%
  dplyr::select(EntrezID, dplyr::everything())
Missing column names filled in: 'X1' [1]Parsed with column specification:
cols(
  .default = col_double(),
  X1 = col_character()
)
See spec(...) for full column specifications.

Gene identifier conversion

# extract the Entrez ID Gene symbol mapping from org.Hs.eg.db
symbol.obj <- org.Hs.eg.db::org.Hs.egSYMBOL
mapped.genes <- AnnotationDbi::mappedkeys(symbol.obj)
symbol.list <- as.list(symbol.obj[mapped.genes])
symbol.df <- as.data.frame(cbind(names(symbol.list), unlist(symbol.list)))
colnames(symbol.df) <- c("EntrezID", "GeneSymbol")

Join the annotation data.frame to the expression data.frame

annot.gse26576.df <- symbol.df %>%
  dplyr::inner_join(gse26576.df, by = "EntrezID")
Column `EntrezID` joining factor and character vector, coercing into character vector
rm(symbol.df)

Are there any duplicates?

any(duplicated(annot.gse26576.df$GeneSymbol))
[1] FALSE

No, so we don’t need to do anything else. Write the data.frame that includes gene symbols to file.

gse26576.output.file <- 
  file.path(exprs.dir, 
            "DIPG_E-GEOD-26576_hgu133plus2_SCANfast_with_GeneSymbol.pcl")
readr::write_tsv(annot.gse26576.df, path = gse26576.output.file)

Sample Metadata

meta.gse26576.file <- file.path(sample.info.dir, "E-GEOD-26576.sdrf.txt")
meta.gse26576.df <- readr::read_tsv(meta.gse26576.file)
Duplicated column names deduplicated: 'Term Source REF' => 'Term Source REF_1' [13], 'Term Accession Number' => 'Term Accession Number_1' [14], 'Term Source REF' => 'Term Source REF_2' [16], 'Term Accession Number' => 'Term Accession Number_2' [17], 'Protocol REF' => 'Protocol REF_1' [24], 'Protocol REF' => 'Protocol REF_2' [27], 'Term Source REF' => 'Term Source REF_3' [30], 'Protocol REF' => 'Protocol REF_3' [31], 'Protocol REF' => 'Protocol REF_4' [32], 'Protocol REF' => 'Protocol REF_5' [35], 'Term Source REF' => 'Term Source REF_4' [43], 'Term Accession Number' => 'Term Accession Number_3' [44], 'Term Source REF' => 'Term Source REF_5' [46], 'Term Accession Number' => 'Term Accession Number_4' [47]Parsed with column specification:
cols(
  .default = col_character(),
  `Characteristics[age at diagnosis (years)]` = col_double(),
  `Characteristics[age at diagonosis (years)]` = col_double(),
  `Characteristics[survival (years)]` = col_double()
)
See spec(...) for full column specifications.

The sample-data relationship files from ArrayExpress (specifically, their column names) are pretty tidyverse-unfriendly.

cleaned.meta.df <- data.frame(
  sample_id = gsub(" 1", "", meta.gse26576.df$`Source Name`),
  sample_file = meta.gse26576.df$`Array Data File`,
  sample_title = meta.gse26576.df$`Comment [Sample_title]`,
  age_at_diagnosis = meta.gse26576.df$`Characteristics[age at diagnosis (years)]`,
  disease_state = meta.gse26576.df$`Characteristics[disease]`,
  histology = meta.gse26576.df$`Characteristics[histology]`,
  sample_collection = meta.gse26576.df$`Characteristics[sample collection]`,
  material_type = meta.gse26576.df$`Material Type`
) %>%
  # get rid of the genomic DNA samples
  dplyr::filter(material_type == "total RNA") %>%
  dplyr::select(-material_type)

There is information in the sample_title field that can help us fill in the disease_state blanks.

cleaned.meta.df <- cleaned.meta.df %>%
  dplyr::mutate(disease_state = dplyr::case_when(
    grepl("normal", cleaned.meta.df$sample_title) ~ "normal",
    grepl("low", cleaned.meta.df$sample_title) ~ "LGG",
    grepl("DIPG", cleaned.meta.df$sample_title) ~ "DIPG",
    grepl("Glioblastoma", cleaned.meta.df$sample_title) ~ "Glioblastoma"
  ))

The sample_file field will match the headers of the PCL file. Write the cleaned metadata to file.

readr::write_tsv(cleaned.meta.df, 
                 path = file.path(sample.info.dir, 
                                  "E-GEOD-26576_cleaned_metadata.tsv"))
