Comparing means of VST tranformed RNA-Seq counts in different groups of samples with boxplots
Entering edit mode
Last seen 6 months ago
University of Salerno, Salerno, Italy

Dear All,

I would like to ask a very specific question about data transformation and
the appropriate comparison of groups of samples in gene expression data. In detail, based on
raw RNA-Seq data gene counts, I implemented the VST transformation from the DESeq2 R
package, for various clustering methodologies, and I ended up with a specific group of
genes, that show expression patterns that separate interestingly my samples into groups of
studied phenotype, based on heatmap plots.

My next goal is to perform some complementary pairwise boxplots of some specific pre-
defined cluster groups, based on a subset of these genes, in order to provide some extra
evidence of a significant difference in the relative expression of each gene in these groups.

Thus, my questions are the following:
1) Are VST transformed RNA-Seq counts appropriate for the creation of additional
relative boxplots? In order to compare the groups means for each selected gene?
Moreover, for adding p-values and significance levels, a simple test in this case, like
a t-test or an ANOVA test for more than 2 groups would be fine?

2) Or, VST transformed counts are not appropriate for comparing means, even for a
very small number of genes, and I should follow a different transformation?

For example, use my matrix object of counts:

xx <- estimateSizeFactorsForMatrix(counts=matrix.count)

and afterwards use the function:

xx2 <- normTransform(xx, f = log2, pc = 1) ?
VST deseq2 boxplot rnaseq • 3.2k views
Entering edit mode
Last seen 5 days ago
United States

It sounds like you obtained the list of 20 genes from a clustering procedure, and then you just want to plot their values up and down. Given that you use VST counts for the clustering, I would expect one to also show boxplots of VST counts to show changes over groups. I would show the same transformed data that was used to generate the clusters.

However, later you ask about calculating p-values. You have to be very careful here. Selecting a set of genes and then generating p-values can lead to loss of control of false positives unless the filtering is done in a way that is independent of the test statistic. There is a reference on this in the DESeq2 vignette. Why do you need p-values if you are performing only a clustering analysis? If you want to generate p-values, I would strongly recommend you use a standard DESeq2 analysis, including all the genes. This is important for proper parameter estimation.

Entering edit mode

Dear Michael,

thank you very much for your answer !! Please excuse me for any misguidance and not provided enough information for my goals-so briefly, these 20 genes, have been identified and selected from another dataset with the same phenotype, based on DE analysis and feature selection. On this premise, we wanted to test on a different transcriptomic with the same disease, if these genes have any interesting biological implications in clustering patients and relative survival, which has shown good results with VST and downstream analysis. That's why, in my original thread, my collaborators asked if possible, to provide some extra figures with plots such as boxplots with the resulted clusters and -if the possible p-values-in order to further enhance the interesting patterns, that the plethora of these genes show in the relative heatmap plot of those group patients. Overall, in order to summarize:

1) Firstly, your opinion about my approach with  estimateSizeFactorsForMatrix() & normTransform()

from above ? and then use a simple test for the needed boxplots? like the t.test ? as the minimum number of group samples is 30 ?

2) Concerning the implementation of DESeq2 for these genes:

my R pipeline for the creation of the VST transformed counts and patient clustering/survival was the following:

head(dat.exp) # matrix of raw counts

                TCGA-DD-AAE4-01A-11R-A41C-07 TCGA-BD-A3EP-01A-11R-A22L-07

ENSG00000000003                         1976                         4455

ENSG00000000005                            2                            1
library(clusterProfiler) # for annotation
eg =,



                        OrgDb="")) #drop=TRUE

eg <- eg[!duplicated(eg$SYMBOL),]
dd <- match(eg.ensembl,rownames(dat.exp))
dat <- dat.exp[dd,]
rownames(dat) <- as.character(eg$SYMBOL)
 dat.filt <- TCGAanalyze_Filtering(tabDF = dat,
method = "quantile",qnt.cut =  0.25) # small intensity filtering from TCGAbiolinks
dat.trans <- vst(object=dat.filt) # vst transformation

And then my last step was to inspect if the 20 wanted genes were remained after the initial filtering :
cc <- intersect(rownames(dat.trans),signature) # inspect
sel.dat <- dat.trans[signature,]

and afterwards clustering, etc..

Thus, in your opinion, if I should implement DESeq2 to inspect for DE/relative p-values regarding these genes, how I should formulate my code?

in order to have a similarity with my above approach, as also to use the same transcripts?

In detail, start with the object dat.filt, which has unique gene symbols as row names, as also been filtered?

and to create a data frame with the various cluster memberships?

For example:

condition <- factor(rep(rep(c("cluster1","cluster2","cluster3"),each=..),..))

But then how should I run DESeq for DE in order to inspect and isolate the pairwise adjusted p-values for these 20 genes, even not significant?

Also, any other comments or suggestions will help !!



Entering edit mode

If you just want to plot and do simple testing for a few genes, I suppose there’s no problem with your first approach given you have many samples. I wouldn’t recommend performing different analyses though, but stick with a single analysis plan from the start.

Entering edit mode

Dear Michael,

thank you for your updated answer !! To summarize, in order to confirm your comment and be correct, for the first approach you pinpointed, you mean that i can use  estimateSizeFactorsForMatrix() & normTransform() 

for the total dataset, and then subset to the genes of interest, in order to create the relative boxplots with the these "normalized" values, correct ? as also to perform a simple test for the relative p-values, right ?

But at the same time, the cluster/grouping membership for the patients-samples, that will be used for the boxplots would be from the initial VST transformed counts, again correct ?

Entering edit mode

This is all up to you as the analyst. We have a detailed workflow and a vignette. But it's up to you exactly what steps you want to do in your analysis.

You can use any of our transformations that are described in the vignette. For all of these, you should run them over all genes, and then subset afterward.

Entering edit mode

Dear Michael,

thank you for your overall suggestions to my questions thus far. Please excuse me to return one last time, but I would like to ask you some quick important questions about the DESeq2 pipeline and the specific functions I have mentioned above, in order to use correctly the normTransform() function. In detail, my small code chunk:


pd.deseq <- # the initial phenodata slot with all the variables

head(coad_filt,2) # the matrix container of raw filtered counts

       TCGA-3L-AA1B-01A-11R-A37K-07 TCGA-DM-A1D8-01A-11R-A155-07

TSPAN6                         7280                        10395

TNMD                             23                            1

dds <- DESeqDataSetFromMatrix(countData = coad_filt,

colData = pd.deseq,design= ~ 1)

xx2 <- normTransform(dds, f = log2, pc = 4)

So, my final questions:

1) Because my purpose of constructing the DESeq object, is for downstream analysis (clustering, boxplots etc), is it a problem that I used ~1 intersept ?

as no variable into the colData is of any interest? and I used it just for the function to run appropriately ? and won't have any effect as my purpose is not DE analysis ?

2) In the normTransform() function, you would suggest a pseudocount bigger than 1 ? like 4 ?

3) Finally, the output of normTransform function, can also be directly used-as VST transformed counts we have discussed- for hierarchical clustering ? and similar applications ?

Thank you,


Entering edit mode

~1 as a design is fine for normTransform.

You can try out different pseudocounts, or just use the VST, which is our solution to avoid fiddling with pseudocounts.

Entering edit mode
Last seen 14 months ago
United States

Hi Konstantinos, 

in order to assess differential expression of genes I would use DESeq2's built-in function DESeq. It takes as input a DESeqDataset object, which is similar to a summarized experiment; it's all described well in the vignette. DESeq internally normalizes the samples (to library size) and log2-transforms them, so you should use untransformed count data for creating the DESeqDataset. After running DESeq you can extract results for differential expression analysis using the results function on the object (results(DESeqDatasetobject)). If you have several groups you can specify the contrast of choice in the results call. The returned data frame gives you both a p-value and an FDR-adjusted p-value. You can read the vignette to DESeq2 if you want to know more about the underlying statistics. 

I use VST- or rlog-transformed counts only for sample clustering. 

For plotting results for individual genes you can use DESeq's plotCounts function (e.g., plotCounts(DESeqdataset, gene = "geneOfInterest"), which will use the same DESeqDataset object and again internally normalizes for library size. If you want to have more fine-grained control over the resulting plot you can set returnData = TRUE to extract the data and feed it e.g. into ggplot. 

If you are interested in ordering your resulting differentially expressed genes e.g. for log2FoldChange make sure to use logfold shrinkage by running the function lfcShrink on the DESeqDataset. So overall there is a few ways how to modify your data:

- normalize to library size: taken care of internally by DESeq (you can also access these counts directly with counts(DESeqdataset, normalized = TRUE) and plotCounts

- log-transform: again taken care of internally by DESeq

- vst/rlog transform: use for sample clustering

- logfold-shrinkage: use on DESeq object to order differentially expressed genes

Hope this helps!


Entering edit mode

Hi Benjamin, and thank you for your comprehensive answer and suggestions!! Actually, I have never use the DESeq workflow-only the VST transformation for sample clustering-but I have read the extensive vignette, so my updated comments are:


1)      Because I have only about 20 genes, is it worth to apply the DESeqDataset function and relative downstream procedure? As also I have read that the function results() applies an independent filtering for the tested genes? That’s why I was looking for a less complicated approach, by my above alternative function estimateSizeFactorsForMatrix() & normTransform(), and then use a simple test like a t.test, but I don’t know if it is correct ?


2)      Actually, I have never also used the plotCounts() function, but my main purpose again is to accompany my boxplots when comparing means, with the relative p-value, because these are the plots that were asked by my collaborators. So, perhaps to extract individually the Wald p-values, but without the independent filtering? One other issue, is that I don’t have the relative sample information for the clusters described, until using these 20 genes, so perhaps I have to repeat the annotation  and the DESeq procedure from the start..

Entering edit mode

Hi Konstantinos, 

regarding 1: In my view it is totally worth to apply DESeq2 - it's just a couple of lines of code and there's less room to make mistakes than doing the statistics yourself. If you really want to you can also switch off independent filtering in the results function (independentFiltering = FALSE). However, I feel it's important to be aware as to why you limit your analysis to these 20 genes: did you decide a priori to limit the analysis to these? In that case I would probably run DESeq and then calculate the FDR-adjusted p-value manually (e.g. by taking the returned p-values and running the p.adjust function). On the other hand it would of course not be advisable to tinker with the p-value adjustment if you decided to look at these 20 genes only after inspecting the results you got. 

Regarding 2: I'm not quite sure I fully understand what you are trying to accomplish - make box plots that automatically add the adjusted p-value? I would use plotCounts() with returnData set to TRUE and then customize from there with ggplot; you can also run your list of genes as a loop. Automatically adding the p-value will take a few lines of code though to extract it from the results table and add it in the right place - it'd be more straightforward to just do it manually in an illustration program if it's a one-time thing. 



Entering edit mode

A simple t-test has less power than the Bioconductor methods for RNA-seq count data when you have in the range of ~3-5 replicates per biological group.


Login before adding your answer.

Traffic: 514 users visited in the last hour
Help About
Access RSS

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6