Dear forum,
I am working on a comparison study involving TCGA RNA-seq data.
What I wanted to do is a comparison of cancer vs normal data using the HTSEQ-count data as downloadable from TCGA GDC portal. I would like to use the vst normalized data for downstream analysis (exploratory clustering etc.).
Following the standard DESeq2 workflow as described in the current vignette, I get a list of differentially expressed genes and a vst-normalized dataset.
If I plot that data, I get the following plot:
meanSDPlot of vst normalized data
To me, this does not look like a more or less straight line at all. Which means there appears to be a problem with heteroscedasticity.
So the question is: I do understand, that differential gene expression analysis in DESeq2 operates on the unnormalized count data and is not dependent on the assumption, that data is homoscedastic, meaning the above finding is not a problem for calling DE genes (correct me, if this is wrong, please).
However, as I would like to use the vst normalized data for downstream analysis (clustering, etc) -do I need to worry about the output here? And if so, what can I do to improve the situation?
What follows is a condensed step-by-step description of what I did to arrive at this plot -read on if you want to crosscheck to spot any errors:
Libraries
library("tidyverse")
library("readr")
library("stringr")
library("DESeq2")
library("TCGAbiolinks")
library("DEGreport")
library("RColorBrewer")
library("pheatmap")
library("vsn")
Data download
I downloaded the available RNA-seq data for TCGA-GBM and TCGA-LGG:
GDCDIRECTORY <- "/home/jovyan/data//gdcdata/GDCdata"
GDCCATEGORY <- "Transcriptome Profiling"
GDCDATATYPE <- "Gene Expression Quantification"
GDCWORKFLOW <- "HTSeq - Counts"
GDCSTRATEGY <- "RNA-Seq"
TCGA-GBM Glioblastoma multiforme primary tumor HTSeq count data
query <- GDCquery(project = "TCGA-GBM",
data.category = GDCCATEGORY,
sample.type = c("Primary solid Tumor"),
data.type = GDCDATATYPE,
workflow.type = GDCWORKFLOW,
experimental.strategy = GDCSTRATEGY,
legacy = FALSE)
GDCdownload(query,
directory = GDCDIRECTORY,
method = "api",
files.per.chunk = 10)
gbm_primary_tumor_counts <- as_tibble(GDCprepare(query,
summarizedExperiment = FALSE,
directory = GDCDIRECTORY
)
)
Data is being converted to a tibble for filtering and postprocessing
gbm_primary_tumor_counts_tl <- gbm_primary_tumor_counts[-(1:5),] %>%
gather(key = sample, value = count, -(1:1)) %>%
rename(gene_id = X1) %>%
mutate(log2_count = log2(count + 1)) %>%
mutate(sample_type = c("Primary Solid Tumor")) %>%
mutate(project = c("TCGA_GBM")) %>%
mutate(condition = str_extract(sample_type, "Tumor|Normal")) %>%
select(project, condition, sample_type, sample, gene_id, count, log2_count)
Some of the columns are converted to factors for plotting purposes
gbm_primary_tumor_counts_tl$project <- factor(gbm_primary_tumor_counts_tl$project, levels = c("PB_LGG", "TCGA_LGG", "TCGA_GBM"))
gbm_primary_tumor_counts_tl$condition <- factor(gbm_primary_tumor_counts_tl$condition, levels = c("Normal", "Tumor"))
gbm_primary_tumor_counts_tl$sample_type <- factor(gbm_primary_tumor_counts_tl$sample_type, levels = c("Solid Tissue Normal", "Primary Solid Tumor", "Recurrent Solid Tumor"))
And written to file, so I don't need to redownload and reprocess everything
write_rds(gbm_primary_tumor_counts_tl, "gbm_primary_tumor_counts_tl.rds")
TCGA-GBM Normal tissue HTSeq count data
[...] The above mentioned steps are repeated for the other datasets below, I won't print this in full, as it is repetitive code (could be a function).
write_rds(gbm_normal_counts_tl, "gbm_normal_counts_tl.rds")
TCGA-LGG Lower grade glioma primary tumor HTSeq count data
write_rds(lgg_primary_tumor_counts_tl, "lgg_primary_tumor_counts_tl.rds")
TCGA-LGG Lower grade glioma recurrent tumor HTSeq count data
write_rds(lgg_recurrent_tumor_counts_tl, "lgg_recurrent_tumor_counts_tl.rds")
Undisclosed tumor dataset
write_rds(pb_primary_tumor_counts_tl, "pb_primary_tumor_counts_tl.rds")
Merging and preprocessing
union_all_counts <- gbm_primary_tumor_counts_tl %>%
bind_rows(gbm_normal_counts_tl) %>%
bind_rows(lgg_primary_tumor_counts_tl) %>%
bind_rows(lgg_recurrent_tumor_counts_tl) %>%
bind_rows(pb_primary_tumor_counts_tl)
Some overview
summary(union_all_counts)
project condition sample_type
PB_LGG : 60483 Normal: 302415 Solid Tissue Normal : 302415
TCGA_LGG:31995507 Tumor :41491338 Primary Solid Tumor :40402644
TCGA_GBM: 9737763 Recurrent Solid Tumor: 1088694
sample gene_id count log2_count
Length:41793753 Length:41793753 Min. : 0 Min. : 0.000
Class :character Class :character 1st Qu.: 0 1st Qu.: 0.000
Mode :character Mode :character Median : 1 Median : 1.000
Mean : 927 Mean : 3.408
3rd Qu.: 80 3rd Qu.: 6.340
Max. :5905385 Max. :22.494
Extract and transform to matrix for DESeq2
Transform the merged data and extract a matrix ready to be used by DESeq2 (see https://github.com/tidyverse/tibble/issues/123#issuecomment-427114589)
deseq2_input_count_matrix <- union_all_counts %>%
select(gene_id, sample, count) %>%
spread(sample, count) %>%
remove_rownames() %>%
column_to_rownames(var="gene_id") %>%
as.matrix()
Prepare the coldata object
coldata <- distinct(union_all_counts, condition, sample, sample_type, project) %>%
.[match(colnames(deseq2_input_count_matrix), .$sample),] %>%
select(sample, project, sample_type, condition) %>%
remove_rownames() %>%
column_to_rownames(var="sample")
Crosscheck that count and order is correct as per https://bioconductor.org/packages/devel/bioc/vignettes/DESeq2/inst/doc/DESeq2.html#count-matrix-input
all(rownames(coldata) %in% colnames(deseq2_input_count_matrix))
TRUE
and
all(rownames(coldata) == colnames(deseq2_input_count_matrix))
TRUE
Overview
summary(coldata)
project sample_type condition
PB_LGG : 1 Solid Tissue Normal : 5 Normal: 5
TCGA_LGG:529 Primary Solid Tumor :668 Tumor :686
TCGA_GBM:161 Recurrent Solid Tumor: 18
Run DESeq2
Prepare DESeq2 as per https://bioconductor.org/packages/devel/bioc/vignettes/DESeq2/inst/doc/DESeq2.html#count-matrix-input with a somewhat more complex design, which could tease apart the condition and the projects (GBM, LGG, PB).
dds <- DESeqDataSetFromMatrix(countData = deseq2_input_count_matrix,
colData = coldata,
design = ~ project + condition
)
dds
converting counts to integer mode
class: DESeqDataSet
dim: 60483 691
metadata(1): version
assays(1): counts
rownames(60483): ENSG00000000003.13 ENSG00000000005.5 ...
ENSGR0000280767.1 ENSGR0000281849.1
rowData names(0):
colnames(691): PB-201711-0025-1 TCGA-02-0047-01A-01R-1849-01 ...
TCGA-WY-A85D-01A-11R-A36H-07 TCGA-WY-A85E-01A-11R-A36H-07
colData names(3): project sample_type condition
Compute differentially expressed genes
dds <- DESeq(dds, parallel = TRUE)
estimating size factors
estimating dispersions
gene-wise dispersion estimates: 6 workers
mean-dispersion relationship
final dispersion estimates, fitting model and testing: 6 workers
-- replacing outliers and refitting for 43867 genes
-- DESeq argument 'minReplicatesForReplace' = 7
-- original counts are preserved in counts(dds)
estimating dispersions
fitting model and testing
Check results
res <- results(dds)
res
log2 fold change (MLE): condition Tumor vs Normal
Wald test p-value: condition Tumor vs Normal
DataFrame with 60483 rows and 6 columns
baseMean log2FoldChange lfcSE stat pvalue
<numeric> <numeric> <numeric> <numeric> <numeric>
ENSG00000000003.13 3563.38508 2.9848976 0.2807142 10.633227 2.087615e-26
ENSG00000000005.5 13.13047 2.2069554 0.7913884 2.788713 5.291787e-03
ENSG00000000419.11 1157.15673 0.4207727 0.1483583 2.836192 4.565498e-03
ENSG00000000457.12 678.17249 -0.1772486 0.1182657 -1.498732 1.339432e-01
ENSG00000000460.15 309.54244 1.8416618 0.2669081 6.899986 5.200784e-12
... ... ... ... ... ...
ENSGR0000275287.3 0 NA NA NA NA
ENSGR0000276543.3 0 NA NA NA NA
ENSGR0000277120.3 0 NA NA NA NA
ENSGR0000280767.1 0 NA NA NA NA
ENSGR0000281849.1 0 NA NA NA NA
padj
<numeric>
ENSG00000000003.13 2.879038e-23
ENSG00000000005.5 1.564804e-02
ENSG00000000419.11 1.378057e-02
ENSG00000000457.12 2.305606e-01
ENSG00000000460.15 1.786943e-10
... ...
ENSGR0000275287.3 NA
ENSGR0000276543.3 NA
ENSGR0000277120.3 NA
ENSGR0000280767.1 NA
ENSGR0000281849.1 NA
This seems OK to me. Did I miss anything here?
summary(res)
out of 57953 with nonzero total read count
adjusted p-value < 0.1
LFC > 0 (up) : 10443, 18%
LFC < 0 (down) : 8708, 15%
outliers [1] : 25, 0.043%
low counts [2] : 17935, 31%
(mean count < 0)
[1] see 'cooksCutoff' argument of ?results
[2] see 'independentFiltering' argument of ?results
Also OK?
Save lists of differentially expressed genes to file
As the DESeq2 computation was quite lengthy for that large amount of data, we write results to intermediate files again:
write_rds(res, "87_prep_final_output_differentially_expressed_genes.rds")
Normalize data
Normalize the input using information about the samples from the experimental design. Because rlog takes way too long and can't be parallelized, we use the vst transform.
vsd <- vst(dds, blind = FALSE)
Write it to file as well, so we don't have to recompute
write_rds(vsd, "87_prep_final_deseq2_vst_normalized_counts.rds")
Plot mean variance distribution
To arrive at the plot shown above:
meanSdPlot(assay(vsd))
(Congrats, if you lasted until here. :-))
Environment
All computations done in a docker container based on https://jupyter-docker-stacks.readthedocs.io/en/latest/
sessionInfo()
R version 3.4.2 (2017-09-28)
Platform: x86_64-conda_cos6-linux-gnu (64-bit)
Running under: Ubuntu 18.04.1 LTS
Matrix products: default
BLAS: /opt/conda/lib/R/lib/libRblas.so
LAPACK: /opt/conda/lib/R/lib/libRlapack.so
locale:
[1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
[3] LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8
[5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
[7] LC_PAPER=en_US.UTF-8 LC_NAME=C
[9] LC_ADDRESS=C LC_TELEPHONE=C
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
attached base packages:
[1] parallel stats4 stats graphics grDevices utils datasets
[8] methods base
other attached packages:
[1] hexbin_1.27.1 bindrcpp_0.2.2
[3] vsn_3.46.0 pheatmap_1.0.12
[5] RColorBrewer_1.1-2 DEGreport_1.14.1
[7] quantreg_5.38 SparseM_1.77
[9] TCGAbiolinks_2.11.3 DESeq2_1.18.1
[11] SummarizedExperiment_1.8.1 DelayedArray_0.4.1
[13] matrixStats_0.54.0 Biobase_2.38.0
[15] GenomicRanges_1.30.3 GenomeInfoDb_1.14.0
[17] IRanges_2.12.0 S4Vectors_0.16.0
[19] BiocGenerics_0.24.0 stringr_1.3.1
[21] dplyr_0.7.8 purrr_0.3.0
[23] readr_1.3.1 tidyr_0.8.2
[25] tibble_2.0.1 ggplot2_3.1.0
[27] tidyverse_1.1.1
[...]
Thank you!
Thanks for the answer:
I would guess, that the general assumption for RNA-seq (or microarray experiments) (i.e. most genes do not change) hold for most of the samples. From what is known about that type of cancer, advanced stages show a high degree of mutations and therefore deregulation of many pathways and genes. I would also assume, that there are differences in the samples from different centers (TCGA being a multi-center study).
I do not expect a straight line therefore, but the shape of the curve and distribution of the values made me worry a bit, that something is not quite right here.
I could add e.g. "center" to the design to see, if this would improve the situation.
Or do you think, that this data may be usable for downstream analysis using clustering tools as is?
It looks fine to me actually. What you don’t want is to have high variance coming from single count genes because of log(x+1) inducing a large data spread near 0.
Thank you for your advice. :-)
Marc: one thing you could do in addition to the mean-sd plots is look at MA-plots of pairs of samples: i.e., sample many pairs out of all possible pairs randomly, or look especially at replicates or very similar samples.
Hi Wolfgang,
Thank you for you advice. I may try this.