Hello,
My recent RNA-Seq experiment had a few samples with high counts of rRNA genes in the third replicate. My pipeline was HISAT2 --> featureCounts --> DESeq2.
When clustering replicates 1 and 2, the replicates of each condition clustered closely together in a PCA. With all three replicates, samples of the third replicate clustered more closely together than to the samples of the other two replicates (which still clustered together by condition). I believe the clustering of the third replicate was driven by the high counts (some in the millions) of rRNA genes (which shouldn't be present as my samples underwent polyA selection). Therefore I tried the following 'solutions':
1) set the counts of rRNA genes to 0 2) remove the rRNA genes from the annotation file completely 3) (do nothing and continue with my analysis)
In the first scenario, all samples of each replicate and condition clustered nicely together.
In the second scenario, replicates of each condition clustered, but one sample was very singled out. I took a look at the normalized counts from DESeq2, and all the counts were dramatically different than the raw counts. For example, in replicates 1-3, the counts for gene X were 2074870, 2352729, 1970923 (all around 2 million) respectively. In the normalized counts file, they were now 2299461, 827, and 2215781. This is probably the most dramatic example, but I didn't observe anything like this when I had two replicates, or when the rRNA genes were simply set to 0 counts.
Which 'solution' should I have accepted? It was previously suggested to me to remove the rRNA genes from the annotation file (i.e. 'solution' 2). Is it the number of genes in the raw counts file that is driving the drastic normalization of counts in the singled out sample?
Hi Michael Love, many thanks for your reply. The rRNA counts from DESeq2 in the 'good' replicates were 154,212 and 102,317, total counts were 26,603,787 and 26,656,847. For the 'bad', 3,556,351 rRNA and 31,533,725 total counts. My organism has ~200 rRNA genes so the counts reported here are the sums of all of them.
I'm surprised solution 1 and 2 would be different. Make sure that you are re-estimating size factors _after_ removing those rRNA features:
The
vst()
will not re-estimate size factors if they are present.I'm performing my analyses on the Galaxy web server so this shouldn't be an issue as I've submitted the jobs separately.
In general, would you say there's any problem with setting rRNA counts to 0? i.e. should I just continue my analysis with solution 1?
Setting to zero is fine. It really should be the same as removing those features.
Thanks again not only for your input, but also for your time spent answering questions and for creating DESeq2! Hope you have a great day.