Search
Question: Reference for differential gene expression analysis
0
gravatar for fagire
2.9 years ago by
fagire0
Uruguay
fagire0 wrote:

Hi,

I plan to use your DESeq2 package for differential expression analysis between two conditions and I'm wondering which transcriptome/s (consensus or singles) should I use as reference. I don't have a genome for my specie.

Some people suggest generate a single assembly based on combining all reads across all samples as inputs and then align the reads separately back to the single ("consensus") assembly for downstream analysis of differential expression.

The other option simply consists on aligning the reads of each sample with its corresponding assembly. I do not know to what extent the heterogeneity of individual assemblies (distinct number of genes and isoforms, differences on transcripts lengths, etc) can affect the analysis of differential gene expression.

Which would be the best option?

One last question, many people use RSEM to obtain the expected count (for non-model species) in spite that DESeq use raw count. Do you think, for example, that the Corset (http://genomebiology.com/2014/15/7/410) approach would be better?

Thanks in advance,

Facundo

ADD COMMENTlink modified 2.9 years ago by Michael Love14k • written 2.9 years ago by fagire0
2
gravatar for Simon Anders
2.9 years ago by
Simon Anders3.4k
Zentrum für Molekularbiologie, Universität Heidelberg
Simon Anders3.4k wrote:

Hi

I think you have already answered your own question:

"I do not know to what extent the heterogeneity of individual assemblies (distinct number of genes and isoforms, differences on transcripts lengths, etc) can affect the analysis of differential gene expression."

Exactly. But it seems safe to say that it will somehow affect the analysis, and that you then cannot say whether any differences between groups that you see are really biological differences or simply due to differences in quality and content of the sample-specific references.

Therefore, your first option seems to me to be the only sensible way to go: "... generate a single assembly based on combining all reads across all samples as inputs and then align the reads separately back to the single ("consensus") assembly for downstream analysis of differential expression."

"One last question, many people use RSEM to obtain the expected count (for non-model species) in spite that DESeq use raw count. Do you think, for example, that the Corset (http://genomebiology.com/2014/15/7/410) approach would be better?"

I'm not that familiar with RSEM and hadn't heard of Corset until now to give a qualified answer. But judging from the Corset paper's abstract, this sounds like a quite useful approach. Maybe somebody else here can share some first-hand experience in using it?

  Simon
 

ADD COMMENTlink modified 2.9 years ago • written 2.9 years ago by Simon Anders3.4k
1
gravatar for Michael Love
2.9 years ago by
Michael Love14k
United States
Michael Love14k wrote:

Unfortunately I personally don't have enough experience in generation of new transcriptomes to answer which will give better results. 

Both options, single consensus or each to its assembly, can be accommodated by DESeq2. The single consensus method is the typical workflow. If you align reads of each sample to its own assembly, you would obviously need to make sure you've properly matched up the genes from the different samples, and would have to remove any genes which don't have a match across all samples. Secondly, the count of the reads which align to a gene is proportional to the average effective transcript length, where the average is weighted by the proportional expression of each transcript of a gene. This is, for example, provided by RSEM's rsem-calculate-expression as a column "effective_length" in the *genes.results file. So you can use this column, aggregated over samples, to account for the effect of the differences in transcript lengths across samples on the count of reads which aligned uniquely to the genes. The way to do this is to supply a matrix of average effective transcript lengths (so a matrix which is # genes x # samples) to the normMatrix argument of estimateSizeFactors(), and then continuing with DESeq(). If you go with multiple assemblies and try DESeq2, this would be my recommendation, and not to use the expected count.

Yes, Corset looks like it is worth trying here.

ADD COMMENTlink written 2.9 years ago by Michael Love14k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.2.0
Traffic: 284 users visited in the last hour