I have 36 samples of which 23 (both cases and controls) of them turned out to have 60-70% rRNA contamination. The sequencing facility re-sequenced these 23 samples on more lanes = more seq depth in the contaminated samples with the thought that the reads would after removal of rRNA sequences correspond to the amount of reads in the 13 non-contaminated samples. I know it isn't ideal to still have a large amount of rRNA sequences in there but now this is where I am at.
The issue I have now is that I see a clear distinction on the PCA plot between the rRNA contaminated samples and the non-contaminated using:
rld <- rlog(dds, blind=FALSE)
(pcaData <- plotPCA(rld, intgroup = c("condition"), returnData=TRUE))
percentVar <- round(100 * attr(pcaData, "percentVar"))
pcaData$samples <- c(rep("control", 8), rep("cases", 8))
pcaData$name <- as.factor(pcaData$name)
The right group of samples are the contaminated lot. I understand that this grouping can be because of either the rRNA contamination or batch effect but I think it is more likely to be because of a batch effect as I don't see exactly the same pattern when I plotted it with the "old" contaminated samples. When I control for batch effect in my design (dds< DESeqDataSetFromMatrix(countData, colData, formula(~ batch + ~ condition)) I do get more DEGs than when not controlling for it (56 DEGs) - however if I run a differential expression analysis on only the non-contaminated samples (4 controls and 9 cases) I get 109 DEGs - so I don't know if I should really be including the second batch.
My question is now if there is anything else I can do in terms of trying to salvage these contaminated samples - can I control for this batch in any other way than including it in the design? I can try another tool I guess but first I would like to know if there anything else I can do in terms of visualising or controlling for this batch effect in DESeq2?
Thanks in advance.