Question

DESeq2: combine replicates following rlog transformation for visualization, clustering

1

Entering edit mode

snf ▴ 10

@snf-6981

Last seen 8.5 years ago

UC Berkeley

Hello -

I have a series of 12 seq datasets with two replicates and would like to perform clustering analysis of the rlog-normalized data. However, after rlog transformation, I still have two replicates for each dataset and therefore have to combine the reps in some way before clustering. I have two related questions:

1) What is the most proper way to combine rlog values for the replicates? Averaging?

2) What is the most proper way to collapse the data to a ~0-1 scale or other normalized scale to cluster it? Currently the magnitude of difference in mean rlog values between genes is dominating the changes in relative rlog value between datasets. My solution thus far has been normalization by the row max, but I would appreciate a second opinion.

Thanks much,

-Stephen

deseq2 rnaseq • 6.6k views

ADD COMMENT • link 9.7 years ago snf ▴ 10

score 1 · Answer 1 · 2014-11-04

1

Entering edit mode

Wolfgang Huber ★ 13k

@wolfgang-huber-3550

Last seen 29 days ago

EMBL European Molecular Biology Laborat…

Stephen

1) If they are replicates and you just want to pool them, why not sum the per-gene read counts (e.g. using the function collapseReplicates in DESeq2) before computing the rlog? Averaging after rlog seems also reasonable, and it would be interesting to hear from you whether it makes any difference downstream.

2) My advice would be similar to what people have been doing 'all the time' with (log-transformed) microarray data: if genes are rows and samples columns, then subtract the row mean (but do not scale by variance or the like); depending on the use case, it can be useful to remove the rows that do not show enough variance.

Wolfgang

ADD COMMENT • link 9.7 years ago Wolfgang Huber ★ 13k

0

Entering edit mode

Hello,

I have a follow up question, since I am also trying to cluster normalized counts per biological state (but not per biological replication per biological state).

In regards to:

1) Isn't collapseReplicates in DESeq2 only for technical replicates? Through Deseq2, how can I combine biological replicates and obtain normalized counts for midstream clustering analysis? Is there a function in DESeq2?

2) What do you mean to "subtract the row mean (but do not scale be variance or the like)" is this instead of averaging? could you provide more detail please?

Thank you!

ADD REPLY • link 6.0 years ago nm.albader • 0

score 0 · Answer 2 · 2014-11-04

Thanks Wolfgang.

I did try to perform the rlog transform after collapsing reps, but received an error:

ddsCollapsed <- collapseReplicates(dds, groupby=dds$fraction, run=dds$Run)
avgrld <- rlog(ddsCollapsed, blind=FALSE) 
Error in estimateDispersionsGeneEst(object, quiet = TRUE) : 
  the number of samples and the number of model coefficients are equal,
  i.e., there are no replicates to estimate the dispersion.
  use an alternate design formula

As such, I assumed that the rlog transform had to be performed on data with replicates. Perhaps I should rlog and save the parms to execute the rlog transform on the collapsed data? If I can get this to work then I will perform the pre- and post-collapse rlogs and put the results here.

Re: #2 - I'll try that, thanks!

score 0 · Answer 3 · 2014-11-04

0

Entering edit mode

Wolfgang Huber ★ 13k

@wolfgang-huber-3550

Last seen 29 days ago

EMBL European Molecular Biology Laborat…

Stephen

Why are you using blind=FALSE?

(What is the intended use of the transformed data?)

Wolfgang

ADD COMMENT • link 9.7 years ago Wolfgang Huber ★ 13k

score 0 · Answer 4 · 2014-11-04

0

Entering edit mode

snf ▴ 10

@snf-6981

Last seen 8.5 years ago

UC Berkeley

The immediate downstream use is for clustering by row, where each column of the row represents a different biological state. I have strong reason to believe that there is large variability between some of these samples (e.g. some are cytoplasmic and some nuclear), so based on the ?rlog page "If many of genes have large differences in counts due to the experimental design, it is important to set blind=FALSE for downstream analysis." and other posts (e.g. https://stat.ethz.ch/pipermail/bioconductor/2014-January/057299.html) I selected blind=FALSE. This is after initial QC between reps with blind=TRUE.

ADD COMMENT • link 9.7 years ago snf ▴ 10

0

Entering edit mode

hi Stephan,

I'd go with Wolfgang's suggestion of collapsing after rlog.

ADD REPLY • link 9.7 years ago Michael Love 42k