I am working on a comparison study involving TCGA RNA-seq data.
What I wanted to do is a comparison of cancer vs normal data using the HTSEQ-count data as downloadable from TCGA GDC portal. I would like to use the vst normalized data for downstream analysis (exploratory clustering etc.).
Following the standard DESeq2 workflow as described in the current vignette, I get a list of differentially expressed genes and a vst-normalized dataset.
If I plot that data, I get the following plot:
meanSDPlot of vst normalized data
To me, this does not look like a more or less straight line at all. Which means there appears to be a problem with heteroscedasticity.
So the question is: I do understand, that differential gene expression analysis in DESeq2 operates on the unnormalized count data and is not dependent on the assumption, that data is homoscedastic, meaning the above finding is not a problem for calling DE genes (correct me, if this is wrong, please).
However, as I would like to use the vst normalized data for downstream analysis (clustering, etc) -do I need to worry about the output here? And if so, what can I do to improve the situation?
What follows is a condensed step-by-step description of what I did to arrive at this plot -read on if you want to crosscheck to spot any errors:
library("tidyverse") library("readr") library("stringr") library("DESeq2") library("TCGAbiolinks") library("DEGreport") library("RColorBrewer") library("pheatmap") library("vsn")
I downloaded the available RNA-seq data for TCGA-GBM and TCGA-LGG:
GDCDIRECTORY <- "/home/jovyan/data//gdcdata/GDCdata" GDCCATEGORY <- "Transcriptome Profiling" GDCDATATYPE <- "Gene Expression Quantification" GDCWORKFLOW <- "HTSeq - Counts" GDCSTRATEGY <- "RNA-Seq"
TCGA-GBM Glioblastoma multiforme primary tumor HTSeq count data
query <- GDCquery(project = "TCGA-GBM", data.category = GDCCATEGORY, sample.type = c("Primary solid Tumor"), data.type = GDCDATATYPE, workflow.type = GDCWORKFLOW, experimental.strategy = GDCSTRATEGY, legacy = FALSE) GDCdownload(query, directory = GDCDIRECTORY, method = "api", files.per.chunk = 10) gbm_primary_tumor_counts <- as_tibble(GDCprepare(query, summarizedExperiment = FALSE, directory = GDCDIRECTORY ) )
Data is being converted to a tibble for filtering and postprocessing
gbm_primary_tumor_counts_tl <- gbm_primary_tumor_counts[-(1:5),] %>% gather(key = sample, value = count, -(1:1)) %>% rename(gene_id = X1) %>% mutate(log2_count = log2(count + 1)) %>% mutate(sample_type = c("Primary Solid Tumor")) %>% mutate(project = c("TCGA_GBM")) %>% mutate(condition = str_extract(sample_type, "Tumor|Normal")) %>% select(project, condition, sample_type, sample, gene_id, count, log2_count)
Some of the columns are converted to factors for plotting purposes
gbm_primary_tumor_counts_tl$project <- factor(gbm_primary_tumor_counts_tl$project, levels = c("PB_LGG", "TCGA_LGG", "TCGA_GBM")) gbm_primary_tumor_counts_tl$condition <- factor(gbm_primary_tumor_counts_tl$condition, levels = c("Normal", "Tumor")) gbm_primary_tumor_counts_tl$sample_type <- factor(gbm_primary_tumor_counts_tl$sample_type, levels = c("Solid Tissue Normal", "Primary Solid Tumor", "Recurrent Solid Tumor"))
And written to file, so I don't need to redownload and reprocess everything
TCGA-GBM Normal tissue HTSeq count data
[...] The above mentioned steps are repeated for the other datasets below, I won't print this in full, as it is repetitive code (could be a function).
TCGA-LGG Lower grade glioma primary tumor HTSeq count data
TCGA-LGG Lower grade glioma recurrent tumor HTSeq count data
Undisclosed tumor dataset
Merging and preprocessing
union_all_counts <- gbm_primary_tumor_counts_tl %>% bind_rows(gbm_normal_counts_tl) %>% bind_rows(lgg_primary_tumor_counts_tl) %>% bind_rows(lgg_recurrent_tumor_counts_tl) %>% bind_rows(pb_primary_tumor_counts_tl)
summary(union_all_counts) project condition sample_type PB_LGG : 60483 Normal: 302415 Solid Tissue Normal : 302415 TCGA_LGG:31995507 Tumor :41491338 Primary Solid Tumor :40402644 TCGA_GBM: 9737763 Recurrent Solid Tumor: 1088694 sample gene_id count log2_count Length:41793753 Length:41793753 Min. : 0 Min. : 0.000 Class :character Class :character 1st Qu.: 0 1st Qu.: 0.000 Mode :character Mode :character Median : 1 Median : 1.000 Mean : 927 Mean : 3.408 3rd Qu.: 80 3rd Qu.: 6.340 Max. :5905385 Max. :22.494
Extract and transform to matrix for DESeq2
Transform the merged data and extract a matrix ready to be used by DESeq2 (see https://github.com/tidyverse/tibble/issues/123#issuecomment-427114589)
deseq2_input_count_matrix <- union_all_counts %>% select(gene_id, sample, count) %>% spread(sample, count) %>% remove_rownames() %>% column_to_rownames(var="gene_id") %>% as.matrix()
Prepare the coldata object
coldata <- distinct(union_all_counts, condition, sample, sample_type, project) %>% .[match(colnames(deseq2_input_count_matrix), .$sample),] %>% select(sample, project, sample_type, condition) %>% remove_rownames() %>% column_to_rownames(var="sample")
Crosscheck that count and order is correct as per https://bioconductor.org/packages/devel/bioc/vignettes/DESeq2/inst/doc/DESeq2.html#count-matrix-input
all(rownames(coldata) %in% colnames(deseq2_input_count_matrix)) TRUE
all(rownames(coldata) == colnames(deseq2_input_count_matrix)) TRUE
summary(coldata) project sample_type condition PB_LGG : 1 Solid Tissue Normal : 5 Normal: 5 TCGA_LGG:529 Primary Solid Tumor :668 Tumor :686 TCGA_GBM:161 Recurrent Solid Tumor: 18
Prepare DESeq2 as per https://bioconductor.org/packages/devel/bioc/vignettes/DESeq2/inst/doc/DESeq2.html#count-matrix-input with a somewhat more complex design, which could tease apart the condition and the projects (GBM, LGG, PB).
dds <- DESeqDataSetFromMatrix(countData = deseq2_input_count_matrix, colData = coldata, design = ~ project + condition ) dds converting counts to integer mode class: DESeqDataSet dim: 60483 691 metadata(1): version assays(1): counts rownames(60483): ENSG00000000003.13 ENSG00000000005.5 ... ENSGR0000280767.1 ENSGR0000281849.1 rowData names(0): colnames(691): PB-201711-0025-1 TCGA-02-0047-01A-01R-1849-01 ... TCGA-WY-A85D-01A-11R-A36H-07 TCGA-WY-A85E-01A-11R-A36H-07 colData names(3): project sample_type condition
Compute differentially expressed genes
dds <- DESeq(dds, parallel = TRUE) estimating size factors estimating dispersions gene-wise dispersion estimates: 6 workers mean-dispersion relationship final dispersion estimates, fitting model and testing: 6 workers -- replacing outliers and refitting for 43867 genes -- DESeq argument 'minReplicatesForReplace' = 7 -- original counts are preserved in counts(dds) estimating dispersions fitting model and testing
res <- results(dds) res log2 fold change (MLE): condition Tumor vs Normal Wald test p-value: condition Tumor vs Normal DataFrame with 60483 rows and 6 columns baseMean log2FoldChange lfcSE stat pvalue <numeric> <numeric> <numeric> <numeric> <numeric> ENSG00000000003.13 3563.38508 2.9848976 0.2807142 10.633227 2.087615e-26 ENSG00000000005.5 13.13047 2.2069554 0.7913884 2.788713 5.291787e-03 ENSG00000000419.11 1157.15673 0.4207727 0.1483583 2.836192 4.565498e-03 ENSG00000000457.12 678.17249 -0.1772486 0.1182657 -1.498732 1.339432e-01 ENSG00000000460.15 309.54244 1.8416618 0.2669081 6.899986 5.200784e-12 ... ... ... ... ... ... ENSGR0000275287.3 0 NA NA NA NA ENSGR0000276543.3 0 NA NA NA NA ENSGR0000277120.3 0 NA NA NA NA ENSGR0000280767.1 0 NA NA NA NA ENSGR0000281849.1 0 NA NA NA NA padj <numeric> ENSG00000000003.13 2.879038e-23 ENSG00000000005.5 1.564804e-02 ENSG00000000419.11 1.378057e-02 ENSG00000000457.12 2.305606e-01 ENSG00000000460.15 1.786943e-10 ... ... ENSGR0000275287.3 NA ENSGR0000276543.3 NA ENSGR0000277120.3 NA ENSGR0000280767.1 NA ENSGR0000281849.1 NA
This seems OK to me. Did I miss anything here?
summary(res) out of 57953 with nonzero total read count adjusted p-value < 0.1 LFC > 0 (up) : 10443, 18% LFC < 0 (down) : 8708, 15% outliers  : 25, 0.043% low counts  : 17935, 31% (mean count < 0)  see 'cooksCutoff' argument of ?results  see 'independentFiltering' argument of ?results
Save lists of differentially expressed genes to file
As the DESeq2 computation was quite lengthy for that large amount of data, we write results to intermediate files again:
Normalize the input using information about the samples from the experimental design. Because rlog takes way too long and can't be parallelized, we use the vst transform.
vsd <- vst(dds, blind = FALSE)
Write it to file as well, so we don't have to recompute
Plot mean variance distribution
To arrive at the plot shown above:
(Congrats, if you lasted until here. :-))
All computations done in a docker container based on https://jupyter-docker-stacks.readthedocs.io/en/latest/
sessionInfo() R version 3.4.2 (2017-09-28) Platform: x86_64-conda_cos6-linux-gnu (64-bit) Running under: Ubuntu 18.04.1 LTS Matrix products: default BLAS: /opt/conda/lib/R/lib/libRblas.so LAPACK: /opt/conda/lib/R/lib/libRlapack.so locale:  LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C  LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8  LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8  LC_PAPER=en_US.UTF-8 LC_NAME=C  LC_ADDRESS=C LC_TELEPHONE=C  LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C attached base packages:  parallel stats4 stats graphics grDevices utils datasets  methods base other attached packages:  hexbin_1.27.1 bindrcpp_0.2.2  vsn_3.46.0 pheatmap_1.0.12  RColorBrewer_1.1-2 DEGreport_1.14.1  quantreg_5.38 SparseM_1.77  TCGAbiolinks_2.11.3 DESeq2_1.18.1  SummarizedExperiment_1.8.1 DelayedArray_0.4.1  matrixStats_0.54.0 Biobase_2.38.0  GenomicRanges_1.30.3 GenomeInfoDb_1.14.0  IRanges_2.12.0 S4Vectors_0.16.0  BiocGenerics_0.24.0 stringr_1.3.1  dplyr_0.7.8 purrr_0.3.0  readr_1.3.1 tidyr_0.8.2  tibble_2.0.1 ggplot2_3.1.0  tidyverse_1.1.1 [...]
Thanks for the answer:
I would guess, that the general assumption for RNA-seq (or microarray experiments) (i.e. most genes do not change) hold for most of the samples. From what is known about that type of cancer, advanced stages show a high degree of mutations and therefore deregulation of many pathways and genes. I would also assume, that there are differences in the samples from different centers (TCGA being a multi-center study).
I do not expect a straight line therefore, but the shape of the curve and distribution of the values made me worry a bit, that something is not quite right here.
I could add e.g. "center" to the design to see, if this would improve the situation.
Or do you think, that this data may be usable for downstream analysis using clustering tools as is?
It looks fine to me actually. What you don’t want is to have high variance coming from single count genes because of log(x+1) inducing a large data spread near 0.
Thank you for your advice. :-)
Marc: one thing you could do in addition to the mean-sd plots is look at MA-plots of pairs of samples: i.e., sample many pairs out of all possible pairs randomly, or look especially at replicates or very similar samples.
Thank you for you advice. I may try this.