Question

Read normalization in a complex sample

0

Entering edit mode

RMRG ▴ 10

@rmrg-13708

Last seen 6.5 years ago

Hello,

I am hoping for some help in understanding which read normalization approach is appropriate for the biologically complex samples that I'm hoping to do DE analysis on.

I'm interested in DE of a bacterial endosymbiont of an insect; specifically, I'm trying to establish which genes are differentially expressed in the bacterium when the host insect is infected with a eukaryotic parasite. So, in one condition (4 replicates) I have host + symbiont (plus other gut bacteria, etc.) and in the other condition (also 4 replicates), I have host + symbiont + parasite.

I've taken the approach of mapping reads to a preliminary genome sequence of the symbiont, summarizing the reads with featureCounts, and doing DE with DEseq2. I've followed the basic protocol that I've seen outline in various vignettes. I've found a handful of DE genes (~10 upregulated, ~50 downregulated), but I'm concerned that I may not be normalizing my data in the most intelligent way.

The read depth for each replicate is fairly even, ranging from ~42-57 million 2 x 125 bp reads for each. And the number of mapping endosymbiont reads is fairly even, too, ranging from 0.6-0.9% of the reads (unfortunately low, but I have to live with this). I haven't quantified the proportion of reads that derive from the parasite, but it's not huge (maybe 10%) - most of the reads seem to come from the insect host and its gut flora.

As I understand, DEseq2 is probably normalizing to the number of mapped endosymbiont reads/fragments for the replicates in the approach I've taken (?). But the sample with the parasite in it is going to affect the normalization in complicated ways. Is there some better method that might be applicable in this case?

Thanks!

RMRG

deseq2 differential expression normalization • 1.1k views

ADD COMMENT • link updated 7.7 years ago by Michael Love 43k • written 7.7 years ago by RMRG ▴ 10

score 0 · Answer 1 · 2017-08-10

If you want to specify specific genes that you want to normalize with (to set the horizontal line in the MA plot) you can use the controlGenes argument of estimateSizeFactors(). However, if there are differences in a small subset of the genes (the parasite is the concern it sounds like) these are not going to be a problem for the normalization, as it is designed to be robust to these differences in the tails. By "tails" I refer to the actual method, which looks at the ratio between counts for different samples across all genes. We then take the median of these ratios, so the presense/absence of parasitic gene expression won't affect the normalization, if those are not very substantial relative to the number of insect/symbiont genes.