Normalization in DESeq2
1
1
Entering edit mode
@solgakarbitechnionacil-6453
Last seen 7.1 years ago
European Union

Hello,

I have a few questions regarding the normalization performed by the DESeq2 package:

  • I understand that the normalization is working under the assumption that most of the genes are not DE and only a small subset of genes are. I understand that this assumption is realistic in most experiments, but I would appreciate your input regardless the correctness of this assumption when performing analysis across different human tissues. Meaning, does it make sense to assume that most of the genes will have the same expression levels in different human tissues? or should I perform different normalization when dealing with this sort of data?
  • In some RNA-Seq experiments we encounter situations in which some libraries are sequenced in a greater depth than others, with larger range of library sizes that we usually expect. In those cases, the size factor given to the deeply sequenced samples will be much higher than 1 and the factor assigned to the smaller samples will be much lower than 1. In those cases, is it better to subsample the larger libraries in order to prevent this wide range of size factors? 
    I am addressing this issue, since when a sample receives a very small size factor, it seems to me that we are artificially increasing the counts of all genes in this sample without any biological evidence to back that up. It seems to me less problematic to lower the counts of the bigger libraries, but the artificial addition to the small ones seems more of a problem to me. What do you think?

Thank you very much,

Olga. 

deseq2 deseq • 18k views
ADD COMMENT
0
Entering edit mode

I'm tagging onto this as I have a similar question. Hope this is the right place to ask it. I'd like to follow up with the idea of when a MA plot skew is "too much" for the DESeq2 size factors calculation to handle and what, if anything, can be done about that. I'll preface by saying that we tried to avoid such issues up front by only analyzing organisms whose possible coding sequences were >= 80% covered under both conditions analyzed. Attached are three examples:

Our more typical case, with most ratios along the x axis:

https://htcf.wustl.edu/files/zdw23YMw

A more skewed but acceptable (for us) example:

https://htcf.wustl.edu/files/zdw23YMw

An extreme example:

https://htcf.wustl.edu/files/NeAvbNX2

So, if anyone has any input on 1) if DESeq is appropriate for these various scenarios or 2) if anything can be done up front to make it so, I'd really appreciate the advice!!

Thanks,

Matt

ADD REPLY
0
Entering edit mode

DESeq2 (or any method that attempts to find scaling factors based on the observed data alone) can only work with what you give it. If all genes are up-regulated, no computational method can handle this, and I'd also question why do this experiment without having some kind of spiked-in control, where sometimes I see investigators spike-in RNA from another organism as an improvement over the ERCC spike-ins.

I'm not sure how the last plot is even possible. Did you specify certain control genes? DESeq2 (using median ratio normalization) will tend to center the LFCs on the y=0 line.

ADD REPLY
0
Entering edit mode
Thanks for responding. We were also concerned about the appearance of the "extreme" example plot, and no, we did not specify our own control genes. My suspicion is more that we have a situation where the transcriptome of this organism was not well-represented in both conditions - we'll certainly investigate more and/or exclude this comparison from our analyses. One clarification, though - did the other two plots appear acceptable to you as far as DESeq2 assumptions?
ADD REPLY
0
Entering edit mode

Yes it seems fine.

ADD REPLY
2
Entering edit mode
@mikelove
Last seen 11 hours ago
United States

hi Olga

"normalization is working under the assumption that most of the genes are not DE and only a small subset of genes are."

It's not that the subset is small, but that the median ratio captures the scaling factor from the non-DE genes. You could have many DE genes, and with balance of up- and down-regulation, the median ratio would be fine. The best way to check if the assumption makes sense is an MA plot of the results table. For many tissue comparisons, one will see a bulk of genes on the x-axis, with DE genes appearing above and/or below. For experiments where all genes increase in expression across conditions, the median ratio method will not be able to capture this difference, but this is typically not the case for a tissue comparison, as there are many "housekeeping" genes with relatively similar expression pattern across tissues.

"is it better to subsample the larger libraries in order to prevent this wide range of size factors?"

No, subsampling generally reduces power, as we are throwing away observations. The model including a size factor parameter handles this.

"it seems to me that we are artificially increasing the counts of all genes in this sample without any biological evidence to back that up."

It's not increasing the count, but decreasing the expected count from the generalized linear model to match the counts observed. It's a subtle statistical point about modeling with parameters versus transformation of data. See the formula in the vignette in Section 4.1 The DESeq2 model. We have that E(K_ij) = s_j q_ij. So if we have two samples with the same q_ij, but sample A is given size factor s_A = 0.5 and sample B is given size factor s_B = 2.0, then the expected counts E(K_ij) are lower and higher for A and B respectively to match what we know about the sequencing depth of the experiments.

ADD COMMENT
0
Entering edit mode

Thank you very much for the reply, it is very helpful!

0
Entering edit mode

Thanks for the advise on how to check if the assumption makes sense using MA plot.

I am not sure i there is a single right answer to my next question, but if I will see in the MA plot that most of the genes are either up-regulated or down-regulated, does that mean that I cannot rely on the DE testing results? should I use a different normalization method before performing the DE?

0
Entering edit mode
Most either up or down is fine, if you can convince yourself that the "cloud" of genes centered on the x axis is non-DE. You could look up housekeeping genes for example.
ADD REPLY
0
Entering edit mode

Even though this post is > 2 years old, maybe you can still comment on the following: Is there a metric or any kind of threshold that one can calculate based on the MA-Plot in order to decide if:

1) "enough" data points are shared between conditions and

2) normalization is necessary and beneficial

ADD REPLY
0
Entering edit mode

Normalization is always necessary because the sequencing depth will always be a technical artifact.

I don't understand what you mean by "data points shared"

ADD REPLY
0
Entering edit mode

Thanks for the reply. What I meant is if there is a metric that can be calculated in order to decide if "the median ratio captures the scaling factor from the non-DE genes". The MA-plot is supposed to show the "bulk of genes on the x-axis", but is there any threshold beyond looking at it by eye?

ADD REPLY
0
Entering edit mode

Oh, I see, you mean, enough genes are roughly equally expressed in conditions. I don't have any test or absolute threshold. Maybe you could post the MA plot (zoomed out to -5 to 5, or -10 to 10).

ADD REPLY
0
Entering edit mode

Here is the plot. It is not RNA-seq but a NGS-based reporter assay for certain regulatory elements. The two samples have quiet different readcounts (factor 3). Would you say I can use DESeqs estimated size factors here?

ADD REPLY
0
Entering edit mode

That looks like it wouldn't be a problem for DESeq2, but you would need to look at the MA plot produced by DESeq2 to know if the x-axis goes through the cloud. That looks unnormalized to me.

ADD REPLY

Login before adding your answer.

Traffic: 602 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6