Question

Reference for differential gene expression analysis

0

Entering edit mode

fagire • 0

@fagire-7144

Last seen 9.4 years ago

Uruguay

Hi,

I plan to use your DESeq2 package for differential expression analysis between two conditions and I'm wondering which transcriptome/s (consensus or singles) should I use as reference. I don't have a genome for my specie.

Some people suggest generate a single assembly based on combining all reads across all samples as inputs and then align the reads separately back to the single ("consensus") assembly for downstream analysis of differential expression.

The other option simply consists on aligning the reads of each sample with its corresponding assembly. I do not know to what extent the heterogeneity of individual assemblies (distinct number of genes and isoforms, differences on transcripts lengths, etc) can affect the analysis of differential gene expression.

Which would be the best option?

One last question, many people use RSEM to obtain the expected count (for non-model species) in spite that DESeq use raw count. Do you think, for example, that the Corset (http://genomebiology.com/2014/15/7/410) approach would be better?

Thanks in advance,

Facundo

deseq2 • 2.9k views

ADD COMMENT • link updated 9.4 years ago by Michael Love 42k • written 9.4 years ago by fagire • 0

score 2 · Answer 1 · 2014-12-09

Hi

I think you have already answered your own question:

"I do not know to what extent the heterogeneity of individual assemblies (distinct number of genes and isoforms, differences on transcripts lengths, etc) can affect the analysis of differential gene expression."

Exactly. But it seems safe to say that it will somehow affect the analysis, and that you then cannot say whether any differences between groups that you see are really biological differences or simply due to differences in quality and content of the sample-specific references.

Therefore, your first option seems to me to be the only sensible way to go: "... generate a single assembly based on combining all reads across all samples as inputs and then align the reads separately back to the single ("consensus") assembly for downstream analysis of differential expression."

"One last question, many people use RSEM to obtain the expected count (for non-model species) in spite that DESeq use raw count. Do you think, for example, that the Corset (http://genomebiology.com/2014/15/7/410) approach would be better?"

I'm not that familiar with RSEM and hadn't heard of Corset until now to give a qualified answer. But judging from the Corset paper's abstract, this sounds like a quite useful approach. Maybe somebody else here can share some first-hand experience in using it?

Simon

score 1 · Answer 2 · 2014-12-09

Unfortunately I personally don't have enough experience in generation of new transcriptomes to answer which will give better results.

Both options, single consensus or each to its assembly, can be accommodated by DESeq2. The single consensus method is the typical workflow. If you align reads of each sample to its own assembly, you would obviously need to make sure you've properly matched up the genes from the different samples, and would have to remove any genes which don't have a match across all samples. Secondly, the count of the reads which align to a gene is proportional to the average effective transcript length, where the average is weighted by the proportional expression of each transcript of a gene. This is, for example, provided by RSEM's rsem-calculate-expression as a column "effective_length" in the *genes.results file. So you can use this column, aggregated over samples, to account for the effect of the differences in transcript lengths across samples on the count of reads which aligned uniquely to the genes. The way to do this is to supply a matrix of average effective transcript lengths (so a matrix which is # genes x # samples) to the normMatrix argument of estimateSizeFactors(), and then continuing with DESeq(). If you go with multiple assemblies and try DESeq2, this would be my recommendation, and not to use the expected count.

Yes, Corset looks like it is worth trying here.