Hi there,
I have 36 samples of which 23 (both cases and controls) of them turned out to have 60-70% rRNA contamination. The sequencing facility re-sequenced these 23 samples on more lanes = more seq depth in the contaminated samples with the thought that the reads would after removal of rRNA sequences correspond to the amount of reads in the 13 non-contaminated samples. I know it isn't ideal to still have a large amount of rRNA sequences in there but now this is where I am at.
The issue I have now is that I see a clear distinction on the PCA plot between the rRNA contaminated samples and the non-contaminated using:
rld <- rlog(dds, blind=FALSE)
plotPCA(rld, intgroup=("condition"))
(pcaData <- plotPCA(rld, intgroup = c("condition"), returnData=TRUE))
percentVar <- round(100 * attr(pcaData, "percentVar"))
pcaData$samples <- c(rep("control", 8), rep("cases", 8))
pcaData$name <- as.factor(pcaData$name)
The right group of samples are the contaminated lot. I understand that this grouping can be because of either the rRNA contamination or batch effect but I think it is more likely to be because of a batch effect as I don't see exactly the same pattern when I plotted it with the "old" contaminated samples. When I control for batch effect in my design (dds< DESeqDataSetFromMatrix(countData, colData, formula(~ batch + ~ condition)) I do get more DEGs than when not controlling for it (56 DEGs) - however if I run a differential expression analysis on only the non-contaminated samples (4 controls and 9 cases) I get 109 DEGs - so I don't know if I should really be including the second batch.
My question is now if there is anything else I can do in terms of trying to salvage these contaminated samples - can I control for this batch in any other way than including it in the design? I can try another tool I guess but first I would like to know if there anything else I can do in terms of visualising or controlling for this batch effect in DESeq2?
Thanks in advance.
You might consider looking into RUV ('RUVSeq' in bioconductor) with 1 nuisance variable since you are trying to address one batch effect. I have had some success with this, but be careful as the RUV method can sometimes over normalize your data and remove the biological variance you are interested in. You should be able to tell if this is the case by looking at RLE plots before and after applying RUV.