Question

Clustering with differentially expressed genes in RNAseq

0

Entering edit mode

User34591 • 0

@user34591-8699

Last seen 6.6 years ago

France

I have performed differential analysis with DESeq2, edgeR and voom-limma. My goal was to evaluate each methods on my data. Now I would like to perform clustering with the genes found differentially expressed (DEG) by each method to see how well these genes discriminat my two conditions. To do so, I have to shoot my counts through some type of variance stabilizing transformation

Should I do all my clusterings with rlog data and the 3 sets of DEG ?

I wonder if it is relevant to do a clustering with DEG from voom on data that have been transform with rlog or with DEG from DESeq2 on data that have been transform by voom ?

Any clues ?
Many thanks

deseq2 limma voom edger • 3.0k views

ADD COMMENT • link updated 9.2 years ago by Aaron Lun ★ 28k • written 9.2 years ago by User34591 • 0

score 3 · Answer 1 · 2016-04-21

This doesn't seem like a particularly useful evaluation strategy. By definition, all of the methods will identify DEGs that will have different expression between conditions. Thus, I suspect that all methods would form "correct" clusters (i.e., samples would be clustered together corresponding to their conditions) based on the reported DEGs, unless something goes catastrophically wrong. The main differences between methods are more subtle, and lie in how they model the variability and in error rate control, neither of which can be easily assessed with clustering - or at all, in fact, unless you have a data set with known truths, e.g., simulations.

Anyway, when I want to cluster RNA-seq data, I usually apply a log-transformation:

expr.vals <- cpm(y, log=TRUE, prior.count=2) # where 'y' is a DGEList.

... and then do my clustering on the matrix of expression values. The log-transformation provides some level of variance stabilization, especially at large counts (see the voom paper for some discussion of this). The large prior count just protects against poorly defined log-fold changes at low counts. It's better to use this for clustering instead of the E-values reported by voom, as voom only adds a continuity correction of 0.5. (Note that this small addition is fine within limma, as observations are precision-weighted anyway, but if you're doing unweighted analyses outside of limma with those E-values, then you could run into problems.)

I suppose if you wanted to make comparisons between methods easier, you should use the same variance-stabilized expression values (either with cpm above, or with rlog in DESeq) to make your clusters for all methods. I don't think this would pose any problems, but then again, I don't think this is a worthwhile analysis in the first place, so make of that what you will.