Question

to know about the reason in results obtained using DESeq and cufflinks

0

Entering edit mode

Aniket Vatsya ▴ 40

@aniket-vatsya-4237

Last seen 11.4 years ago

Dear all, Could you please tell me why there is large differnce in number of differntially expressed genes obtained from cufflinks and DESeq. I found nearly 3000 upregulated genes at FDR 5% using cufflinks whereas just found 50 upregulated genes at 10% using DESeq. I dont have any replicates. > Also does DESeq overcome length bias, a general problem in RNA seq data analysis? Regards Aniket [[alternative HTML version deleted]]

DESeq DESeq • 2.0k views

ADD COMMENT • link updated 15.4 years ago by Simon Anders ★ 3.8k • written 15.4 years ago by Aniket Vatsya ▴ 40

score 0 · Answer 1 · 2010-08-30

Hi On 08/30/2010 03:03 PM, Aniket Vatsya wrote: > Could you please tell me why there is large differnce in number of > differntially expressed genes obtained from cufflinks and DESeq. I found > nearly 3000 upregulated genes at FDR 5% using cufflinks whereas just found > 50 upregulated genes at 10% using DESeq. I dont have any replicates. I suppose, by 'cufflinks', you mean the 'cuffdiff' tool that comes with cufflinks. The reason is that DESeq and cuffdiff address two apparently similar, but actually very different questions. If you have two samples, cuffdiff tests, for each transcript, whether there is evidence that the concentration of this transcript is not the same in the two samples. If you have two different experimental conditions, with replicates for each condition, DESeq tests, whether, for a given gene, the change in expression strength between the two conditions is large as compared to the variation within each replicate group. This is a crucial difference. Imagine you had not replicates, just two samples, a control sample and one that was treated in some way. In the control sample, a certain gene has (after appropriate normalization) 100 counts, and in the treatment sample, it has 130 counts. You might be tempted to conclude that the treatment causes this gene to be upregulated by 30%. But now, image, you do your control experiment five times, and get 100 counts, 120 count, 85 counts, 145 counts, and 129 counts. Now it becomes clear that 30% upregulation may well mean nothing at all but could easily be caused by just random differences in the samples that have nothing to do with the treatment. This is why doing such experiments without any replicates is rather pointless. You simply need to know how much expression changes even if you try to keep the conditions constant. cuffdiff is of course correct if it tells you that a change from 100 to 130 counts is likely due to a real difference in transcript concentration between the two samples. However, this is unlikely to be the answer to your question, which presumably should be: In which genes does difference expression change _due_to_ the differences in treatment? Hence, even if you had replicates, DESeq would give you much less hits than cufflinks. Please read the DESeq package vignette or our paper to learn about the assumption of variance-mean dependence and what the "blind variance estimation" does that you seem to have used (as otherwise DESeq would have refused to process data without replicates). Simon

score 0 · Answer 2 · 2010-08-30

Hi On 08/30/2010 03:03 PM, Aniket Vatsya wrote: >> Also does DESeq overcome length bias, a general problem in RNA seq data > analysis? I don't quite agree with the term "length bias" as it is not really a bias in the differential expression analysis. In RNA-Seq, the number of reads mapped to a gene determines the power you have to detect differential expression. See Fig. 2 of our preprint (http://precedings.nature.com/documents/4282/version/2) for an illustration. For the example data used in this figure, differential expression (at 10% FDR) can be detected if the log2 fold change is at least around 0.5, if the count values are very high. If you have only around 100 counts, the log2 fold change needs to be at least 1, and for 10 counts, at least 2. Hence, the power to detect differential expression depends strongly on the count, and the count in turn depends on two things, namely (i) the expression strength (say, averaged over both conditions) and (ii) the gene length (because longer genes give rise to more fragments at the same expression level). In a subsequent analysis looking, e.g., for enrichment in gene categories, this causes bias. However, this bias should not and cannot be dealt with by the method to test for differential expression. It should, however, be taken into account by the enrichment test. When adjusting such a test, I would suggest to use directly the count level as input, and not the transcript length, as the latter is only half of the story. Simon