3 months ago by
Cambridge, United Kingdom
Methods like TMM normalization (in
calcNormFactors()) or median-based normalization (in DESeq) were designed for exactly the scenario you describe. If one sample has more hemoglobin mRNAs, the coverage of all other genes will be suppressed when the total amount of sequencing resource is fixed. This is what is known as "composition bias", and is the whole motivation for computing scaling factors (normalization factors in edgeR, size factors in DESeq - note that these are not the same thing!) that are not simply derived from the library size of each sample.
In practice, the success of normalization depends on the presence of sufficiently large counts for all the non-hemoglobin genes. All of these methods operate on ratios, and once the counts get too small, the ratios become unstable or undefined, requiring some ad hoc workarounds to avoid nonsensical scaling factor estimates. You can check that this is not the case in your data by creating MA plots for each sample (e.g., with
plotSmear); if you see lots of discrete lines or patterns on the left, your counts are probably too low.
The other consideration is that, if the proportion of hemoglobin is highly variable, you will want to set
robust=TRUE in downstream edgeR functions for dispersion estimation. This ensures that the increased variance of the hemoglobin genes will not inflate the apparent variability of the variances during empirical Bayes shrinkage. One could also filter out the hemoglobins entirely from the analysis, though this may not be sufficient; such high variability in hemoglobins can be a symptom of an underlying source of variability that affects other genes.