Search
Question: Dealing with (blood) contamination in RNA-seq data
0
gravatar for Thomas Sandmann
17 months ago by
USA
Thomas Sandmann60 wrote:

Dear Bioconductors, 

I am dealing with a poly-A selected RNA-seq dataset that contains samples collected from the central nervous system of mice treated either with a drug or a control vehicle. The primary goal of the analysis is the identification of differential gene expression associated with drug treatment.

During my initial QC, I noticed relatively high counts for different hemoglobin genes in the dataset, varying ca. 5-10 fold between samples. This probably indicates the varying degrees of blood contamination in the original samples (which are difficult to prepare). As far as I can tell, the degree of contamination does not appear to be associated with the variable of interest, e.g. drug treatment, as it fluctuates similarly within and across treatment groups.

Would anybody have recommendations on how to minimize the effects of such contamination on differential expression analysis? For example, I am considering an (unsupervised) surrogate variable analysis (SVA). Alternatively, given that the source of the contamination is (likely) known, I am wondering if it would be useful to include the expression of blood-specific marker genes (hemoglobin, etc) in a linear model.

Perhaps any of you have are willing to share some experience and / or advice?

Thanks a lot,

Thomas

ADD COMMENTlink modified 17 months ago by Aaron Lun20k • written 17 months ago by Thomas Sandmann60
3
gravatar for Aaron Lun
17 months ago by
Aaron Lun20k
Cambridge, United Kingdom
Aaron Lun20k wrote:

This sounds like a fairly standard use case for either sva or RUVSeq. Just supply your counts and the known experimental design, and let these methods pull out unwanted factors/surrogate variables corresponding to contamination. Alternatively, if you want to focus the algorithms' attention on blood contamination, you could subset the count matrix to only use blood-specific markers when identifying unwanted factors. Both of these strategies are probably better than manually building a blocking covariate yourself - the expression profile of a single gene will be subject to more noise than an empirical factor of variation constructed from many genes.

Another approach is to just filter out the haemoglobin genes (and other blood-specific markers) and ignore the effect of contamination elsewhere in the transcriptome. By the sound of it, the contamination seems to be random across groups so it won't confound DE testing between groups. While the inconsistent contamination will also inflate the variances of a few blood-expressed genes, these will be ignored if you set robust=TRUE in eBayes. If this simple solution gives workable results, then you might as well do it. Of course, if there are too many (>10%) genes with inflated variances, the robustness will not be sufficient and explicit regression of the contamination will be necessary.

ADD COMMENTlink modified 17 months ago • written 17 months ago by Aaron Lun20k

Little follow-up on the topic. We have RNA-seq gene and exon-level data extracted from white blood cells. We have significant hits on various hemoglobin, haemoglobin metabolic processes and gas exchange related genes from the differential expression analysis. Foldchanges are quite small and vary between 1-2. Is it possible in general that from WBC RNA-seq data we can see expression changes in such genes or this more likely to be due contamination? And if so is RUVseq a valid option for approaching this issue?

ADD REPLYlink written 12 months ago by heikki.sarin0
1

From the count data alone, it isn't possible to determine if your log-fold changes are due to contamination. Without prior biological knowledge, who am I to say that white blood cells do not genuinely express haemoglobin after drug treatment? I guess it sounds silly, but I don't know what your drug does, either.

Similarly, RUVseq will not explicitly determine whether contamination has occurred. It will only identify hidden factors of variation, some of which may (or may not) be driven by variable contamination between samples. For practical purposes, this is enough - you don't care about the contamination itself, you just want to get rid of it for the DE analysis. As long as the contamination is not confounded with your drug treatment, RUVseq should be applicable; have a look at the haemoglobin expression across samples and ensure that the high-expressing samples are not all in one condition.

ADD REPLYlink modified 12 months ago • written 12 months ago by Aaron Lun20k

Yeah, that indeed sounds quite bizarre that WBC would express haemoglobin etc gas exchange related genes. 

1) If after RUVseq we still have the same results and there isn't any single high-expression samples in these genes do you see it as plausible that then the results might be genuine?

2) Is it an valid approach to exclude these genes in question from the analysis and do it without them?

 

ADD REPLYlink written 12 months ago by heikki.sarin0
1

1) Putting aside biological plausibility for the moment, there are still a few reasons why RUVseq may fail to remove contamination-related factors of variation in your data. The first is if the contamination effect doesn't lie in the first k factors, i.e., there are larger effects that take priority. The second is that RUVseq will not remove factors that are completely confounded with your conditions of interest. So it is entirely possible for technical effects to remain after applying RUVseq (albeit reduced in magnitude), thus requiring some care during interpretation.

2) Yes, and this will avoid drawing biologically nonsensical conclusions. However, any contamination effect will still be present in other genes. These may actually be more misleading as they are not obviously wrong.

ADD REPLYlink modified 12 months ago • written 12 months ago by Aaron Lun20k

 

 

Thank you very much for the help. Going to examine both approaches and see if the results make sense.

We have quite complex model in the analysis and was wondering about the RUVseq argument "k". Should it be set to 1 or how many in this case below (the vignette isn't really helpful in modifying the example code) ? And is the below right way to implement the W_1 in the model of differential expression analysis?

#Ruvseq code

design <- model.matrix(~ time + condition + time:subject.nested + condition:time, data=pData(set))
y <- DGEList(counts=counts(set1), group=coldata$CASE_CONTROL)
y <- calcNormFactors(y, method="upperquartile") 
y <- estimateGLMCommonDisp(y, design)
y <- estimateGLMTagwiseDisp(y, design)
fit <- glmFit(y, design) 
lrt <- glmLRT(fit, coef=35:36)
top <- topTags(lrt, n=nrow(set1))$table
empirical <- rownames(set1)[which(!(rownames(set1) %in% rownames(top)[1:5000]))]

set2 <- RUVg(set, empirical, k=1) 

#Code for differential expression

full <- model.matrix(~ W_1+ time + condition + time:subject.nested + condition:time, pData(set2))

reduced <- model.matrix(~ W_1 + time + condition + time:subject.nested, pData(set2))

ddsFullCountTable <- DESeqDataSetFromMatrix(countData = counts(set2),colData = pData(set2), design = ~ 1  )

ddsFullCountTable <- ddsFullCountTable[ rowSums(counts(ddsFullCountTable)) > 1, ]
keep <- rowSums(counts(ddsFullCountTable) >= 5) >= 5
ddsFullCountTable <- ddsFullCountTable[keep,]
dds <- DESeq(ddsFullCountTable, test="LRT", full=full, reduced=reduced)

ADD REPLYlink written 12 months ago by heikki.sarin0
1

This is turning into a question about DESeq and RUVseq. If you want more relevant advice, post a new question with the appropriate tags.

ADD REPLYlink modified 12 months ago • written 12 months ago by Aaron Lun20k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.2.0
Traffic: 127 users visited in the last hour