I'm hoping to get confirmation that my line of reasoning is correct.
I'm using DESeq2 to test for log2 fold differences in microbial gene abundances across two habitats sampled using metagenomics. I have reason to believe that the average genome sizes in the two habitats are different. Average genome size differences influence differential abundance tests where gene counts have been converted into the ratio of total reads. For example, universal single copy genes will make up a greater proportion of a metagenome from small genomes than from a metagenome with large genomes; making the genes appear enriched in the former despite only being present in one copy in every genome in both habitats. I know DESeq2 uses count data, not proportions, but average genome size still influence DESeq2 results in a similar fashion through influence on the sample-specific size factor, correct?
I'm not interested in identifying genes with log2 fold change differences caused by genome size differences. To exclude these, I looked at the log2fold change for a set of 72 prokaryotic universal single copy genes. Their log2 fold change between habitats ranged from -.3 to -1.4. (more abundant in the habitat with smaller genomes). Therefore I am going to test for genes with a greater than two log2fold change in abundance between habitats. Does that sound reasonable?
code: results(dds, lfcThreshold=2, altHypothesis="greaterAbs",alpha=0.005)
Thanks in advance,