My recent RNA-Seq experiment had a few samples with high counts of rRNA genes in the third replicate. My pipeline was HISAT2 --> featureCounts --> DESeq2.
When clustering replicates 1 and 2, the replicates of each condition clustered closely together in a PCA. With all three replicates, samples of the third replicate clustered more closely together than to the samples of the other two replicates (which still clustered together by condition). I believe the clustering of the third replicate was driven by the high counts (some in the millions) of rRNA genes (which shouldn't be present as my samples underwent polyA selection). Therefore I tried the following 'solutions':
1) set the counts of rRNA genes to 0 2) remove the rRNA genes from the annotation file completely 3) (do nothing and continue with my analysis)
In the first scenario, all samples of each replicate and condition clustered nicely together.
In the second scenario, replicates of each condition clustered, but one sample was very singled out. I took a look at the normalized counts from DESeq2, and all the counts were dramatically different than the raw counts. For example, in replicates 1-3, the counts for gene X were 2074870, 2352729, 1970923 (all around 2 million) respectively. In the normalized counts file, they were now 2299461, 827, and 2215781. This is probably the most dramatic example, but I didn't observe anything like this when I had two replicates, or when the rRNA genes were simply set to 0 counts.
Which 'solution' should I have accepted? It was previously suggested to me to remove the rRNA genes from the annotation file (i.e. 'solution' 2). Is it the number of genes in the raw counts file that is driving the drastic normalization of counts in the singled out sample?