Question

DESeq2 Normalization

0

Entering edit mode

Jamie • 0

@jac-24317

Last seen 2.5 years ago

United Kingdom

Hello,

My recent RNA-Seq experiment had a few samples with high counts of rRNA genes in the third replicate. My pipeline was HISAT2 --> featureCounts --> DESeq2.

When clustering replicates 1 and 2, the replicates of each condition clustered closely together in a PCA. With all three replicates, samples of the third replicate clustered more closely together than to the samples of the other two replicates (which still clustered together by condition). I believe the clustering of the third replicate was driven by the high counts (some in the millions) of rRNA genes (which shouldn't be present as my samples underwent polyA selection). Therefore I tried the following 'solutions':

1) set the counts of rRNA genes to 0 2) remove the rRNA genes from the annotation file completely 3) (do nothing and continue with my analysis)

In the first scenario, all samples of each replicate and condition clustered nicely together.

In the second scenario, replicates of each condition clustered, but one sample was very singled out. I took a look at the normalized counts from DESeq2, and all the counts were dramatically different than the raw counts. For example, in replicates 1-3, the counts for gene X were 2074870, 2352729, 1970923 (all around 2 million) respectively. In the normalized counts file, they were now 2299461, 827, and 2215781. This is probably the most dramatic example, but I didn't observe anything like this when I had two replicates, or when the rRNA genes were simply set to 0 counts.

Which 'solution' should I have accepted? It was previously suggested to me to remove the rRNA genes from the annotation file (i.e. 'solution' 2). Is it the number of genes in the raw counts file that is driving the drastic normalization of counts in the singled out sample?

DESeq2 • 1.3k views

ADD COMMENT • link 2.5 years ago Jamie • 0

score 0 · Answer 1 · 2021-10-18

0

Entering edit mode

Michael Love 41k

@mikelove

Last seen 1 hour ago

United States

all the counts were dramatically different than the raw counts

This is just another way of saying it had a size factor far from 1. E.g. here that sizeFactor(dds) for that sample must by very large. The size factor is estimated in such a way to center the log ratios of counts comparing samples on 0.

For this sample, what was the total count for the rRNA and the non-rRNA? Same question for a "good" sample.

ADD COMMENT • link 2.5 years ago Michael Love 41k

0

Entering edit mode

Hi Michael Love, many thanks for your reply. The rRNA counts from DESeq2 in the 'good' replicates were 154,212 and 102,317, total counts were 26,603,787 and 26,656,847. For the 'bad', 3,556,351 rRNA and 31,533,725 total counts. My organism has ~200 rRNA genes so the counts reported here are the sums of all of them.

ADD REPLY • link 2.5 years ago Jamie • 0

0

Entering edit mode

I'm surprised solution 1 and 2 would be different. Make sure that you are re-estimating size factors _after_ removing those rRNA features:

dds <- estimateSizeFactors(dds)

The vst() will not re-estimate size factors if they are present.

ADD REPLY • link 2.5 years ago Michael Love 41k

0

Entering edit mode

I'm performing my analyses on the Galaxy web server so this shouldn't be an issue as I've submitted the jobs separately.

In general, would you say there's any problem with setting rRNA counts to 0? i.e. should I just continue my analysis with solution 1?

ADD REPLY • link 2.5 years ago Jamie • 0

1

Entering edit mode

Setting to zero is fine. It really should be the same as removing those features.

ADD REPLY • link 2.5 years ago Michael Love 41k

0

Entering edit mode

Thanks again not only for your input, but also for your time spent answering questions and for creating DESeq2! Hope you have a great day.

ADD REPLY • link 2.5 years ago Jamie • 0