Dealing with (blood) contamination in RNA-seq data
Entering edit mode
Last seen 17 months ago

Dear Bioconductors, 

I am dealing with a poly-A selected RNA-seq dataset that contains samples collected from the central nervous system of mice treated either with a drug or a control vehicle. The primary goal of the analysis is the identification of differential gene expression associated with drug treatment.

During my initial QC, I noticed relatively high counts for different hemoglobin genes in the dataset, varying ca. 5-10 fold between samples. This probably indicates the varying degrees of blood contamination in the original samples (which are difficult to prepare). As far as I can tell, the degree of contamination does not appear to be associated with the variable of interest, e.g. drug treatment, as it fluctuates similarly within and across treatment groups.

Would anybody have recommendations on how to minimize the effects of such contamination on differential expression analysis? For example, I am considering an (unsupervised) surrogate variable analysis (SVA). Alternatively, given that the source of the contamination is (likely) known, I am wondering if it would be useful to include the expression of blood-specific marker genes (hemoglobin, etc) in a linear model.

Perhaps any of you have are willing to share some experience and / or advice?

Thanks a lot,


rnaseq rna-seq linear model limma • 2.4k views
Entering edit mode
Aaron Lun ★ 27k
Last seen 1 hour ago
The city by the bay

This sounds like a fairly standard use case for either sva or RUVSeq. Just supply your counts and the known experimental design, and let these methods pull out unwanted factors/surrogate variables corresponding to contamination. Alternatively, if you want to focus the algorithms' attention on blood contamination, you could subset the count matrix to only use blood-specific markers when identifying unwanted factors. Both of these strategies are probably better than manually building a blocking covariate yourself - the expression profile of a single gene will be subject to more noise than an empirical factor of variation constructed from many genes.

Another approach is to just filter out the haemoglobin genes (and other blood-specific markers) and ignore the effect of contamination elsewhere in the transcriptome. By the sound of it, the contamination seems to be random across groups so it won't confound DE testing between groups. While the inconsistent contamination will also inflate the variances of a few blood-expressed genes, these will be ignored if you set robust=TRUE in eBayes. If this simple solution gives workable results, then you might as well do it. Of course, if there are too many (>10%) genes with inflated variances, the robustness will not be sufficient and explicit regression of the contamination will be necessary.

Entering edit mode

Little follow-up on the topic. We have RNA-seq gene and exon-level data extracted from white blood cells. We have significant hits on various hemoglobin, haemoglobin metabolic processes and gas exchange related genes from the differential expression analysis. Foldchanges are quite small and vary between 1-2. Is it possible in general that from WBC RNA-seq data we can see expression changes in such genes or this more likely to be due contamination? And if so is RUVseq a valid option for approaching this issue?

Entering edit mode

From the count data alone, it isn't possible to determine if your log-fold changes are due to contamination. Without prior biological knowledge, who am I to say that white blood cells do not genuinely express haemoglobin after drug treatment? I guess it sounds silly, but I don't know what your drug does, either.

Similarly, RUVseq will not explicitly determine whether contamination has occurred. It will only identify hidden factors of variation, some of which may (or may not) be driven by variable contamination between samples. For practical purposes, this is enough - you don't care about the contamination itself, you just want to get rid of it for the DE analysis. As long as the contamination is not confounded with your drug treatment, RUVseq should be applicable; have a look at the haemoglobin expression across samples and ensure that the high-expressing samples are not all in one condition.

Entering edit mode

Yeah, that indeed sounds quite bizarre that WBC would express haemoglobin etc gas exchange related genes. 

1) If after RUVseq we still have the same results and there isn't any single high-expression samples in these genes do you see it as plausible that then the results might be genuine?

2) Is it an valid approach to exclude these genes in question from the analysis and do it without them?


Entering edit mode

1) Putting aside biological plausibility for the moment, there are still a few reasons why RUVseq may fail to remove contamination-related factors of variation in your data. The first is if the contamination effect doesn't lie in the first k factors, i.e., there are larger effects that take priority. The second is that RUVseq will not remove factors that are completely confounded with your conditions of interest. So it is entirely possible for technical effects to remain after applying RUVseq (albeit reduced in magnitude), thus requiring some care during interpretation.

2) Yes, and this will avoid drawing biologically nonsensical conclusions. However, any contamination effect will still be present in other genes. These may actually be more misleading as they are not obviously wrong.

Entering edit mode



Thank you very much for the help. Going to examine both approaches and see if the results make sense.

We have quite complex model in the analysis and was wondering about the RUVseq argument "k". Should it be set to 1 or how many in this case below (the vignette isn't really helpful in modifying the example code) ? And is the below right way to implement the W_1 in the model of differential expression analysis?

#Ruvseq code

design <- model.matrix(~ time + condition + time:subject.nested + condition:time, data=pData(set))
y <- DGEList(counts=counts(set1), group=coldata$CASE_CONTROL)
y <- calcNormFactors(y, method="upperquartile") 
y <- estimateGLMCommonDisp(y, design)
y <- estimateGLMTagwiseDisp(y, design)
fit <- glmFit(y, design) 
lrt <- glmLRT(fit, coef=35:36)
top <- topTags(lrt, n=nrow(set1))$table
empirical <- rownames(set1)[which(!(rownames(set1) %in% rownames(top)[1:5000]))]

set2 <- RUVg(set, empirical, k=1) 

#Code for differential expression

full <- model.matrix(~ W_1+ time + condition + time:subject.nested + condition:time, pData(set2))

reduced <- model.matrix(~ W_1 + time + condition + time:subject.nested, pData(set2))

ddsFullCountTable <- DESeqDataSetFromMatrix(countData = counts(set2),colData = pData(set2), design = ~ 1  )

ddsFullCountTable <- ddsFullCountTable[ rowSums(counts(ddsFullCountTable)) > 1, ]
keep <- rowSums(counts(ddsFullCountTable) >= 5) >= 5
ddsFullCountTable <- ddsFullCountTable[keep,]
dds <- DESeq(ddsFullCountTable, test="LRT", full=full, reduced=reduced)

Entering edit mode

This is turning into a question about DESeq and RUVseq. If you want more relevant advice, post a new question with the appropriate tags.


Login before adding your answer.

Traffic: 176 users visited in the last hour
Help About
Access RSS

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6