Question

Influence of average genome size on DESeq2 results

0

Entering edit mode

jessawbryant • 0

@jessawbryant-8956

Last seen 8.5 years ago

United States

Hello,

I'm hoping to get confirmation that my line of reasoning is correct.

I'm using DESeq2 to test for log2 fold differences in microbial gene abundances across two habitats sampled using metagenomics. I have reason to believe that the average genome sizes in the two habitats are different. Average genome size differences influence differential abundance tests where gene counts have been converted into the ratio of total reads. For example, universal single copy genes will make up a greater proportion of a metagenome from small genomes than from a metagenome with large genomes; making the genes appear enriched in the former despite only being present in one copy in every genome in both habitats. I know DESeq2 uses count data, not proportions, but average genome size still influence DESeq2 results in a similar fashion through influence on the sample-specific size factor, correct?

I'm not interested in identifying genes with log2 fold change differences caused by genome size differences. To exclude these, I looked at the log2fold change for a set of 72 prokaryotic universal single copy genes. Their log2 fold change between habitats ranged from -.3 to -1.4. (more abundant in the habitat with smaller genomes). Therefore I am going to test for genes with a greater than two log2fold change in abundance between habitats. Does that sound reasonable?

code: results(dds, lfcThreshold=2, altHypothesis="greaterAbs",alpha=0.005)

Thanks in advance,

Jessica

deseq2 normalization metagenomics • 1.3k views

ADD COMMENT • link updated 8.5 years ago by Michael Love 41k • written 8.5 years ago by jessawbryant • 0

score 0 · Answer 1 · 2015-10-09

DESeq2 uses the median ratio method to determine the "size factor" which is used inside the model to account for differences across the columns of the count matrix (e.g. the samples).

So the total count can be skewed by a minority of rows, but the median still accurately capturing the general differences. The median is more robust in this way than the sum or mean of each column. See DESeq or DESeq2 paper for formal definition of the median ratio method.

If you have a set of rows that you know to be not duplicated (and assuming these are not all differentially expressed), you can pass this information to the controlGenes argument of estimateSizeFactors.

Note that an adjusted p-value threshold of 0.005 is pretty small. This is saying, give me a list of genes, but I can't tolerate more than 1/200 of these to be false discoveries. 0.1, 0.05, or 0.01 are more reasonable cutoffs I'd say.