DESeq2 vst normalization: Do I have a heteroscedasticity problem and if so, what can I do about it?
Entering edit mode
Marc Saric ▴ 70
Last seen 4.7 years ago

Dear forum,

I am working on a comparison study involving TCGA RNA-seq data.

What I wanted to do is a comparison of cancer vs normal data using the HTSEQ-count data as downloadable from TCGA GDC portal. I would like to use the vst normalized data for downstream analysis (exploratory clustering etc.).

Following the standard DESeq2 workflow as described in the current vignette, I get a list of differentially expressed genes and a vst-normalized dataset.

If I plot that data, I get the following plot:

meanSDPlot of vst normalized data

To me, this does not look like a more or less straight line at all. Which means there appears to be a problem with heteroscedasticity.

So the question is: I do understand, that differential gene expression analysis in DESeq2 operates on the unnormalized count data and is not dependent on the assumption, that data is homoscedastic, meaning the above finding is not a problem for calling DE genes (correct me, if this is wrong, please).

However, as I would like to use the vst normalized data for downstream analysis (clustering, etc) -do I need to worry about the output here? And if so, what can I do to improve the situation?

What follows is a condensed step-by-step description of what I did to arrive at this plot -read on if you want to crosscheck to spot any errors:



Data download

I downloaded the available RNA-seq data for TCGA-GBM and TCGA-LGG:

GDCDIRECTORY <- "/home/jovyan/data//gdcdata/GDCdata"
GDCCATEGORY <- "Transcriptome Profiling"
GDCDATATYPE <- "Gene Expression Quantification"
GDCWORKFLOW <- "HTSeq - Counts"

TCGA-GBM Glioblastoma multiforme primary tumor HTSeq count data

query <- GDCquery(project = "TCGA-GBM",
                  data.category = GDCCATEGORY,
                  sample.type = c("Primary solid Tumor"),
                  data.type = GDCDATATYPE,
                  workflow.type = GDCWORKFLOW,
                  experimental.strategy = GDCSTRATEGY,
                  legacy = FALSE)
            directory = GDCDIRECTORY,
            method = "api",
            files.per.chunk = 10)
gbm_primary_tumor_counts <- as_tibble(GDCprepare(query,
                                                 summarizedExperiment = FALSE,
                                                 directory = GDCDIRECTORY

Data is being converted to a tibble for filtering and postprocessing

gbm_primary_tumor_counts_tl <- gbm_primary_tumor_counts[-(1:5),] %>% 
    gather(key = sample, value = count, -(1:1)) %>%
    rename(gene_id = X1) %>%
    mutate(log2_count = log2(count + 1)) %>%
    mutate(sample_type = c("Primary Solid Tumor")) %>%
    mutate(project = c("TCGA_GBM")) %>%
    mutate(condition = str_extract(sample_type, "Tumor|Normal")) %>%
    select(project, condition, sample_type, sample, gene_id, count, log2_count)

Some of the columns are converted to factors for plotting purposes

gbm_primary_tumor_counts_tl$project <- factor(gbm_primary_tumor_counts_tl$project, levels = c("PB_LGG", "TCGA_LGG", "TCGA_GBM"))
gbm_primary_tumor_counts_tl$condition <- factor(gbm_primary_tumor_counts_tl$condition, levels = c("Normal", "Tumor"))
gbm_primary_tumor_counts_tl$sample_type <- factor(gbm_primary_tumor_counts_tl$sample_type, levels = c("Solid Tissue Normal", "Primary Solid Tumor", "Recurrent Solid Tumor"))

And written to file, so I don't need to redownload and reprocess everything

write_rds(gbm_primary_tumor_counts_tl, "gbm_primary_tumor_counts_tl.rds")

TCGA-GBM Normal tissue HTSeq count data

[...] The above mentioned steps are repeated for the other datasets below, I won't print this in full, as it is repetitive code (could be a function).

write_rds(gbm_normal_counts_tl, "gbm_normal_counts_tl.rds")

TCGA-LGG Lower grade glioma primary tumor HTSeq count data

write_rds(lgg_primary_tumor_counts_tl, "lgg_primary_tumor_counts_tl.rds")

TCGA-LGG Lower grade glioma recurrent tumor HTSeq count data

write_rds(lgg_recurrent_tumor_counts_tl, "lgg_recurrent_tumor_counts_tl.rds") 

Undisclosed tumor dataset

write_rds(pb_primary_tumor_counts_tl, "pb_primary_tumor_counts_tl.rds")

Merging and preprocessing

union_all_counts <- gbm_primary_tumor_counts_tl %>%
    bind_rows(gbm_normal_counts_tl) %>% 
    bind_rows(lgg_primary_tumor_counts_tl) %>%
    bind_rows(lgg_recurrent_tumor_counts_tl) %>%

Some overview


    project          condition                       sample_type      
 PB_LGG  :   60483   Normal:  302415   Solid Tissue Normal  :  302415  
 TCGA_LGG:31995507   Tumor :41491338   Primary Solid Tumor  :40402644  
 TCGA_GBM: 9737763                     Recurrent Solid Tumor: 1088694  

    sample            gene_id              count           log2_count    
 Length:41793753    Length:41793753    Min.   :      0   Min.   : 0.000  
 Class :character   Class :character   1st Qu.:      0   1st Qu.: 0.000  
 Mode  :character   Mode  :character   Median :      1   Median : 1.000  
                                       Mean   :    927   Mean   : 3.408  
                                       3rd Qu.:     80   3rd Qu.: 6.340  
                                       Max.   :5905385   Max.   :22.494  

Extract and transform to matrix for DESeq2

Transform the merged data and extract a matrix ready to be used by DESeq2 (see

deseq2_input_count_matrix <- union_all_counts %>%
    select(gene_id, sample, count) %>%
    spread(sample, count) %>%
    remove_rownames() %>% 
    column_to_rownames(var="gene_id") %>%

Prepare the coldata object

coldata <- distinct(union_all_counts, condition, sample, sample_type, project) %>% 
    .[match(colnames(deseq2_input_count_matrix), .$sample),] %>%
    select(sample, project, sample_type, condition) %>%
    remove_rownames() %>% 

Crosscheck that count and order is correct as per

all(rownames(coldata) %in% colnames(deseq2_input_count_matrix))


all(rownames(coldata) == colnames(deseq2_input_count_matrix))



         project                   sample_type   condition  
 PB_LGG  :  1   Solid Tissue Normal  :  5   Normal:  5  
 TCGA_LGG:529   Primary Solid Tumor  :668   Tumor :686  
 TCGA_GBM:161   Recurrent Solid Tumor: 18         

Run DESeq2

Prepare DESeq2 as per with a somewhat more complex design, which could tease apart the condition and the projects (GBM, LGG, PB).

dds <- DESeqDataSetFromMatrix(countData = deseq2_input_count_matrix,
                              colData = coldata,
                              design = ~ project + condition

converting counts to integer mode

class: DESeqDataSet 
dim: 60483 691 
metadata(1): version
assays(1): counts
rownames(60483): ENSG00000000003.13 ENSG00000000005.5 ...
  ENSGR0000280767.1 ENSGR0000281849.1
rowData names(0):
colnames(691): PB-201711-0025-1 TCGA-02-0047-01A-01R-1849-01 ...
  TCGA-WY-A85D-01A-11R-A36H-07 TCGA-WY-A85E-01A-11R-A36H-07
colData names(3): project sample_type condition

Compute differentially expressed genes

dds <- DESeq(dds, parallel = TRUE)
estimating size factors
estimating dispersions
gene-wise dispersion estimates: 6 workers
mean-dispersion relationship
final dispersion estimates, fitting model and testing: 6 workers
-- replacing outliers and refitting for 43867 genes
-- DESeq argument 'minReplicatesForReplace' = 7 
-- original counts are preserved in counts(dds)
estimating dispersions
fitting model and testing

Check results

res <- results(dds)
log2 fold change (MLE): condition Tumor vs Normal 
Wald test p-value: condition Tumor vs Normal 
DataFrame with 60483 rows and 6 columns
                     baseMean log2FoldChange     lfcSE      stat       pvalue
                    <numeric>      <numeric> <numeric> <numeric>    <numeric>
ENSG00000000003.13 3563.38508      2.9848976 0.2807142 10.633227 2.087615e-26
ENSG00000000005.5    13.13047      2.2069554 0.7913884  2.788713 5.291787e-03
ENSG00000000419.11 1157.15673      0.4207727 0.1483583  2.836192 4.565498e-03
ENSG00000000457.12  678.17249     -0.1772486 0.1182657 -1.498732 1.339432e-01
ENSG00000000460.15  309.54244      1.8416618 0.2669081  6.899986 5.200784e-12
...                       ...            ...       ...       ...          ...
ENSGR0000275287.3           0             NA        NA        NA           NA
ENSGR0000276543.3           0             NA        NA        NA           NA
ENSGR0000277120.3           0             NA        NA        NA           NA
ENSGR0000280767.1           0             NA        NA        NA           NA
ENSGR0000281849.1           0             NA        NA        NA           NA
ENSG00000000003.13 2.879038e-23
ENSG00000000005.5  1.564804e-02
ENSG00000000419.11 1.378057e-02
ENSG00000000457.12 2.305606e-01
ENSG00000000460.15 1.786943e-10
...                         ...
ENSGR0000275287.3            NA
ENSGR0000276543.3            NA
ENSGR0000277120.3            NA
ENSGR0000280767.1            NA
ENSGR0000281849.1            NA

This seems OK to me. Did I miss anything here?


out of 57953 with nonzero total read count
adjusted p-value < 0.1
LFC > 0 (up)     : 10443, 18% 
LFC < 0 (down)   : 8708, 15% 
outliers [1]     : 25, 0.043% 
low counts [2]   : 17935, 31% 
(mean count < 0)
[1] see 'cooksCutoff' argument of ?results
[2] see 'independentFiltering' argument of ?results

Also OK?

Save lists of differentially expressed genes to file

As the DESeq2 computation was quite lengthy for that large amount of data, we write results to intermediate files again:

write_rds(res, "87_prep_final_output_differentially_expressed_genes.rds")

Normalize data

Normalize the input using information about the samples from the experimental design. Because rlog takes way too long and can't be parallelized, we use the vst transform.

vsd <- vst(dds, blind = FALSE)

Write it to file as well, so we don't have to recompute

write_rds(vsd, "87_prep_final_deseq2_vst_normalized_counts.rds")

Plot mean variance distribution

To arrive at the plot shown above:


(Congrats, if you lasted until here. :-))


All computations done in a docker container based on

R version 3.4.2 (2017-09-28)
Platform: x86_64-conda_cos6-linux-gnu (64-bit)
Running under: Ubuntu 18.04.1 LTS

Matrix products: default
BLAS: /opt/conda/lib/R/lib/
LAPACK: /opt/conda/lib/R/lib/

 [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
 [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8    
 [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
 [9] LC_ADDRESS=C               LC_TELEPHONE=C            

attached base packages:
[1] parallel  stats4    stats     graphics  grDevices utils     datasets 
[8] methods   base     

other attached packages:
 [1] hexbin_1.27.1              bindrcpp_0.2.2            
 [3] vsn_3.46.0                 pheatmap_1.0.12           
 [5] RColorBrewer_1.1-2         DEGreport_1.14.1          
 [7] quantreg_5.38              SparseM_1.77              
 [9] TCGAbiolinks_2.11.3        DESeq2_1.18.1             
[11] SummarizedExperiment_1.8.1 DelayedArray_0.4.1        
[13] matrixStats_0.54.0         Biobase_2.38.0            
[15] GenomicRanges_1.30.3       GenomeInfoDb_1.14.0       
[17] IRanges_2.12.0             S4Vectors_0.16.0          
[19] BiocGenerics_0.24.0        stringr_1.3.1             
[21] dplyr_0.7.8                purrr_0.3.0               
[23] readr_1.3.1                tidyr_0.8.2               
[25] tibble_2.0.1               ggplot2_3.1.0             
[27] tidyverse_1.1.1           

Thank you!

deseq2 normalization rna-seq TCGA vst • 3.6k views
Entering edit mode
Last seen 7 hours ago
United States

Are there biological differences in the VST data you are plotting? From the vignette:

“Note that the vertical axis in such plots is the square root of the variance over all samples, so including the variance due to the experimental conditions. While a flat curve of the square root of variance over the mean may seem like the goal of such transformations, this may be unreasonable in the case of datasets with many true differences due to the experimental conditions.”

Entering edit mode

Thanks for the answer:

I would guess, that the general assumption for RNA-seq (or microarray experiments) (i.e. most genes do not change) hold for most of the samples. From what is known about that type of cancer, advanced stages show a high degree of mutations and therefore deregulation of many pathways and genes. I would also assume, that there are differences in the samples from different centers (TCGA being a multi-center study).

I do not expect a straight line therefore, but the shape of the curve and distribution of the values made me worry a bit, that something is not quite right here.

I could add e.g. "center" to the design to see, if this would improve the situation.

Or do you think, that this data may be usable for downstream analysis using clustering tools as is?

Entering edit mode

It looks fine to me actually. What you don’t want is to have high variance coming from single count genes because of log(x+1) inducing a large data spread near 0.

Entering edit mode

Thank you for your advice. :-)

Entering edit mode

Marc: one thing you could do in addition to the mean-sd plots is look at MA-plots of pairs of samples: i.e., sample many pairs out of all possible pairs randomly, or look especially at replicates or very similar samples.

Entering edit mode

Hi Wolfgang,

Thank you for you advice. I may try this.


Login before adding your answer.

Traffic: 338 users visited in the last hour
Help About
Access RSS

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6