DESeq2 Normalization
1
0
Entering edit mode
Jamie • 0
@jac-24317
Last seen 5 weeks ago
United Kingdom

Hello,

My recent RNA-Seq experiment had a few samples with high counts of rRNA genes in the third replicate. My pipeline was HISAT2 --> featureCounts --> DESeq2.

When clustering replicates 1 and 2, the replicates of each condition clustered closely together in a PCA. With all three replicates, samples of the third replicate clustered more closely together than to the samples of the other two replicates (which still clustered together by condition). I believe the clustering of the third replicate was driven by the high counts (some in the millions) of rRNA genes (which shouldn't be present as my samples underwent polyA selection). Therefore I tried the following 'solutions':

1) set the counts of rRNA genes to 0 2) remove the rRNA genes from the annotation file completely 3) (do nothing and continue with my analysis)

In the first scenario, all samples of each replicate and condition clustered nicely together.

In the second scenario, replicates of each condition clustered, but one sample was very singled out. I took a look at the normalized counts from DESeq2, and all the counts were dramatically different than the raw counts. For example, in replicates 1-3, the counts for gene X were 2074870, 2352729, 1970923 (all around 2 million) respectively. In the normalized counts file, they were now 2299461, 827, and 2215781. This is probably the most dramatic example, but I didn't observe anything like this when I had two replicates, or when the rRNA genes were simply set to 0 counts.

Which 'solution' should I have accepted? It was previously suggested to me to remove the rRNA genes from the annotation file (i.e. 'solution' 2). Is it the number of genes in the raw counts file that is driving the drastic normalization of counts in the singled out sample?

DESeq2 • 196 views
0
Entering edit mode
@mikelove
Last seen 3 days ago
United States

all the counts were dramatically different than the raw counts

This is just another way of saying it had a size factor far from 1. E.g. here that sizeFactor(dds) for that sample must by very large. The size factor is estimated in such a way to center the log ratios of counts comparing samples on 0.

For this sample, what was the total count for the rRNA and the non-rRNA? Same question for a "good" sample.

0
Entering edit mode

Hi Michael Love, many thanks for your reply. The rRNA counts from DESeq2 in the 'good' replicates were 154,212 and 102,317, total counts were 26,603,787 and 26,656,847. For the 'bad', 3,556,351 rRNA and 31,533,725 total counts. My organism has ~200 rRNA genes so the counts reported here are the sums of all of them.

0
Entering edit mode

I'm surprised solution 1 and 2 would be different. Make sure that you are re-estimating size factors _after_ removing those rRNA features:

dds <- estimateSizeFactors(dds)


The vst() will not re-estimate size factors if they are present.

0
Entering edit mode

I'm performing my analyses on the Galaxy web server so this shouldn't be an issue as I've submitted the jobs separately.

In general, would you say there's any problem with setting rRNA counts to 0? i.e. should I just continue my analysis with solution 1?

1
Entering edit mode

Setting to zero is fine. It really should be the same as removing those features.

0
Entering edit mode

Thanks again not only for your input, but also for your time spent answering questions and for creating DESeq2! Hope you have a great day.