Question: DESeq2: combine replicates following rlog transformation for visualization, clustering
0
4.6 years ago by
snf0
UC Berkeley
snf0 wrote:

Hello -

I have a series of 12 seq datasets with two replicates and would like to perform clustering analysis of the rlog-normalized data. However, after rlog transformation, I still have two replicates for each dataset and therefore have to combine the reps in some way before clustering.  I have two related questions:

1) What is the most proper way to combine rlog values for the replicates?  Averaging?

2) What is the most proper way to collapse the data to a ~0-1 scale or other normalized scale to cluster it?  Currently the magnitude of difference in mean rlog values between genes is dominating the changes in relative rlog value between datasets.  My solution thus far has been normalization by the row max, but I would appreciate a second opinion.

Thanks much,

-Stephen

rnaseq deseq2 • 3.1k views
modified 4.6 years ago • written 4.6 years ago by snf0
Answer: DESeq2: combine replicates following rlog transformation for visualization, clus
1
4.6 years ago by
EMBL European Molecular Biology Laboratory
Wolfgang Huber13k wrote:

Stephen

1) If they are replicates and you just want to pool them, why not sum the per-gene read counts (e.g. using the function collapseReplicates in DESeq2) before computing the rlog? Averaging after rlog seems also reasonable, and it would be interesting to hear from you whether it makes any difference downstream.

2) My advice would be similar to what people have been doing 'all the time' with (log-transformed) microarray data: if genes are rows and samples columns, then subtract the row mean (but do not scale by variance or the like); depending on the use case, it can be useful to remove the rows that do not show enough variance.

Wolfgang

Hello,

I have a follow up question, since I am also trying to cluster normalized counts per biological state (but not per biological replication per biological state).

In regards to:

1) Isn't collapseReplicates in DESeq2 only for technical replicates? Through Deseq2, how can I combine biological replicates and obtain normalized counts for midstream clustering analysis? Is there a function in DESeq2?

2) What do you mean to "subtract the row mean (but do not scale be variance or the like)"  is this instead of averaging? could you provide more detail please?

Thank you!

Answer: DESeq2: combine replicates following rlog transformation for visualization, clus
0
4.6 years ago by
snf0
UC Berkeley
snf0 wrote:

Thanks Wolfgang.

I did try to perform the rlog transform after collapsing reps, but received an error:

ddsCollapsed <- collapseReplicates(dds, groupby=dds$fraction, run=dds$Run)
avgrld <- rlog(ddsCollapsed, blind=FALSE)
Error in estimateDispersionsGeneEst(object, quiet = TRUE) :
the number of samples and the number of model coefficients are equal,
i.e., there are no replicates to estimate the dispersion.
use an alternate design formula


As such, I assumed that the rlog transform had to be performed on data with replicates. Perhaps I should rlog and save the parms to execute the rlog transform on the collapsed data?  If I can get this to work then I will perform the pre- and post-collapse rlogs and put the results here.

Re: #2 - I'll try that, thanks!

Answer: DESeq2: combine replicates following rlog transformation for visualization, clus
0
4.6 years ago by
EMBL European Molecular Biology Laboratory
Wolfgang Huber13k wrote:

Stephen

Why are you using blind=FALSE?

(What is the intended use of the transformed data?)

Wolfgang

Answer: DESeq2: combine replicates following rlog transformation for visualization, clus
0
4.6 years ago by
snf0
UC Berkeley
snf0 wrote:

The immediate downstream use is for clustering by row, where each column of the row represents a different biological state.  I have strong reason to believe that there is large variability between some of these samples (e.g. some are cytoplasmic and some nuclear), so based on the ?rlog page "If many of genes have large differences in counts due to the experimental design, it is important to set blind=FALSE for downstream analysis." and other posts (e.g. https://stat.ethz.ch/pipermail/bioconductor/2014-January/057299.html) I selected blind=FALSE.  This is after initial QC between reps with blind=TRUE.

hi Stephan,

I'd go with Wolfgang's suggestion of collapsing after rlog.