Question: DESeq normalization seems to reverse the directionality of log2FoldChange - is this normal?
gravatar for stwestreich
3.2 years ago by
stwestreich0 wrote:

Hello all,

I'm a graduate student working on analyzing RNA-Seq data from a microbiome - many different species all found in the same environment.  I used some annotation software and Python to get a list of counts for hits to each organism, within each sample.  Imported these counts into R, created a combined data table, and then used DESeq2 for comparing the 12 wild-type (WT) samples with my 12 samples from diseased individuals.

When comparing between DESeq's normalized counts, AKA counts(dds, normalized=TRUE), versus the raw counts per million transcripts (CPM) for these organisms, I noticed something interesting.  For some of the higher-abundance organisms, the CPM is higher in my WT samples, if only slightly.  However, in DESeq2's normalized data, counts are higher in my diseased samples.

In other words, for some of the organism counts, they are being shifted down in my WT samples and up in my experimental (diseased) samples, to the point where the final calculated log2FoldChange shows these organisms as more active in my experimental samples - even though the CPM is higher in my WT samples.

I know that DESeq fits to a negative binomial distribution, but is that strong enough to lead to a reversed direction of the log2FoldChange for these entries?

Code I'm running:

> completeCondition <- data.frame(condition=factor(c(rep("control", length(control_files)), rep("experimental", length(exp_files)))))
> colnames(complete_table) <- completeCondition
> dds <- DESeqDataSetFromMatrix(complete_table, completeCondition, ~ condition)
> dds <- DESeq(dds)

> normalized_data <- counts(dds, normalized=TRUE)

Any help is greatly appreciated!


ADD COMMENTlink modified 3.2 years ago by Wolfgang Huber13k • written 3.2 years ago by stwestreich0
Answer: DESeq normalization seems to reverse the directionality of log2FoldChange - is t
gravatar for Ryan C. Thompson
3.2 years ago by
The Scripps Research Institute, La Jolla, CA
Ryan C. Thompson7.3k wrote:

Of course it's normal for the normalization to change the relative levels between samples. This is the purpose of the normalization. In general, the normalization methods used for RNA-seq attempt to normalize such that the log2 fold change of the average or median gene (or in this case, the average organism) is zero, under the assumption that a large fraction of genes (organisms) are not changing. The assumptions of these normalization methods were chosen based on the typical features of RNA-seq gene counts, not metagenomics. If you're not sure whether the assumptions made by the normalization are appropriate, I suggest you consult the relevant paper to understand more about the normalization, so you can make an informed decision about whether it is the right choice. I will say, though, that un-normalized counts or counts per million are almost certainly inadequate.

Lastly, the use of the negative binomial distribution is irrelevant here, since computing the normalized counts does not, as far as I know, involve any kind of model fitting; it's just a direct computation from the counts.

ADD COMMENTlink written 3.2 years ago by Ryan C. Thompson7.3k
Answer: DESeq normalization seems to reverse the directionality of log2FoldChange - is t
gravatar for Wolfgang Huber
3.2 years ago by
EMBL European Molecular Biology Laboratory
Wolfgang Huber13k wrote:

Did you try scatterplots of the counts, for the different conditions, and highlight the species in question.

One more comment and a question:

- DESeq2 works off the actual counts, not CPM. The latter will lead to nonsensical results. Please see the vignette.

- You said these were RNA-seq data (in which the objects being counted are different genes / (m)RNAs), but then refer to the objects being counted as organisms. Which do you mean?



ADD COMMENTlink written 3.2 years ago by Wolfgang Huber13k

Hello Wolfgang,

Thank you for the suggestion!  I haven't yet tried looking at a scatterplot with the species in question selected, but I'll attempt that to see how much it shifts in position.

Yes, I am aware that DESeq2 works off of actual counts.  I am comparing the FPM matrix results with the normalized data (counts(dds)).  I am not using the CPM values as input.

This is RNA-seq data, yes, but I'm pooling all gene hits to the same organism to look at changes in organism abundance.  

ADD REPLYlink written 3.2 years ago by stwestreich0
Please log in to add an answer.


Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 16.09
Traffic: 375 users visited in the last hour