I am dealing with a poly-A selected RNA-seq dataset that contains samples collected from the central nervous system of mice treated either with a drug or a control vehicle. The primary goal of the analysis is the identification of differential gene expression associated with drug treatment.
During my initial QC, I noticed relatively high counts for different hemoglobin genes in the dataset, varying ca. 5-10 fold between samples. This probably indicates the varying degrees of blood contamination in the original samples (which are difficult to prepare). As far as I can tell, the degree of contamination does not appear to be associated with the variable of interest, e.g. drug treatment, as it fluctuates similarly within and across treatment groups.
Would anybody have recommendations on how to minimize the effects of such contamination on differential expression analysis? For example, I am considering an (unsupervised) surrogate variable analysis (SVA). Alternatively, given that the source of the contamination is (likely) known, I am wondering if it would be useful to include the expression of blood-specific marker genes (hemoglobin, etc) in a linear model.
Perhaps any of you have are willing to share some experience and / or advice?
Thanks a lot,