Question: NA coefficients values for samples with all zeros
gravatar for vedran.franke
2.8 years ago by
vedran.franke0 wrote:

I am trying to run a differential expression analysis for proteomics data using limma. The data has been quantified and normalized using the LFQ method - the sample distributions are log normal, but with a high median 10e7, ranging from 0 to 1e10

The problem arises with genes that are present in one condition and absent another. Before putting the values into limma I need to log them, which converts 0 - Inf, and drops the genes from the analysis. I tried replacing Inf with 0 or adding pseudocounts, but both things break the model. 

Is there any way around this?


ADD COMMENTlink modified 2.8 years ago • written 2.8 years ago by vedran.franke0
gravatar for James W. MacDonald
2.8 years ago by
United States
James W. MacDonald47k wrote:

Sure, add a small constant to all the data before taking logs.

ADD COMMENTlink written 2.8 years ago by James W. MacDonald47k

This is how the vulcano plot looks like if I either add pseudocounts or replace Inf with 0

ADD REPLYlink written 2.8 years ago by vedran.franke0

That is what you should expect, no? If you have a protein that is found at high concentration in one sample, and is essentially undetectable in another, shouldn't the fold change be huge, and the p-value really small? I am not sure this constitutes a breakage of the model.

ADD REPLYlink written 2.8 years ago by James W. MacDonald47k

It might not be obvious from this plot, but problem arises in the calculated pvalues - a lot of genes that have a considerable fold change, and have been experimentally validated are not found to be statistically significant, i.e., a gene which has a lfc of 2 gets a p.value of 0.76.

ADD REPLYlink written 2.8 years ago by vedran.franke0

This seems to be a separate problem that's unrelated to the pseudo-counts. If your example with a log-fold change of 2 is being affected by a small pseudo-count, then the raw expression value would be in the single digits, which is negligible compared to the values in the millions/billions you've described in your original post. That would suggest that it's not being expressed much or at all; at the very least, there's not a lot of evidence to detect differences, and the variability of low-abundance genes tends to be higher (at least, for RNA-seq/microarray data), which will reduce power.

So, if your example gene is actually expressed at decent levels, the pseudo-count addition shouldn't affect its inferences; instead, I'd suggest you have a look at its variability across replicates. This may explain why a decent log-fold change has such a large p-value. If the gene's expression turns out to be highly variable across replicates, then it makes sense to not call it a DE gene based on this data. Which is disappointing, but that's life.

ADD REPLYlink modified 2.8 years ago • written 2.8 years ago by Aaron Lun21k
Please log in to add an answer.


Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.2.0
Traffic: 177 users visited in the last hour