Question

NA coefficients values for samples with all zeros

0

Entering edit mode

vedran.franke • 0

@vedranfranke-7218

Last seen 6.0 years ago

Germany

I am trying to run a differential expression analysis for proteomics data using limma. The data has been quantified and normalized using the LFQ method - the sample distributions are log normal, but with a high median 10e7, ranging from 0 to 1e10

The problem arises with genes that are present in one condition and absent another. Before putting the values into limma I need to log them, which converts 0 - Inf, and drops the genes from the analysis. I tried replacing Inf with 0 or adding pseudocounts, but both things break the model.

Is there any way around this?

Tnx

limmagui proteomics • 1.7k views

ADD COMMENT • link 8.4 years ago vedran.franke • 0

score 2 · Accepted Answer · 2015-11-25

2

Entering edit mode

James W. MacDonald 65k

@james-w-macdonald-5106

Last seen 1 hour ago

United States

Sure, add a small constant to all the data before taking logs.

ADD COMMENT • link 8.4 years ago James W. MacDonald 65k

0

Entering edit mode

This is how the vulcano plot looks like if I either add pseudocounts or replace Inf with 0

ADD REPLY • link 8.4 years ago vedran.franke • 0

0

Entering edit mode

That is what you should expect, no? If you have a protein that is found at high concentration in one sample, and is essentially undetectable in another, shouldn't the fold change be huge, and the p-value really small? I am not sure this constitutes a breakage of the model.

ADD REPLY • link 8.4 years ago James W. MacDonald 65k

0

Entering edit mode

It might not be obvious from this plot, but problem arises in the calculated pvalues - a lot of genes that have a considerable fold change, and have been experimentally validated are not found to be statistically significant, i.e., a gene which has a lfc of 2 gets a p.value of 0.76.

ADD REPLY • link 8.4 years ago vedran.franke • 0

0

Entering edit mode

This seems to be a separate problem that's unrelated to the pseudo-counts. If your example with a log-fold change of 2 is being affected by a small pseudo-count, then the raw expression value would be in the single digits, which is negligible compared to the values in the millions/billions you've described in your original post. That would suggest that it's not being expressed much or at all; at the very least, there's not a lot of evidence to detect differences, and the variability of low-abundance genes tends to be higher (at least, for RNA-seq/microarray data), which will reduce power.

So, if your example gene is actually expressed at decent levels, the pseudo-count addition shouldn't affect its inferences; instead, I'd suggest you have a look at its variability across replicates. This may explain why a decent log-fold change has such a large p-value. If the gene's expression turns out to be highly variable across replicates, then it makes sense to not call it a DE gene based on this data. Which is disappointing, but that's life.

ADD REPLY • link 8.4 years ago Aaron Lun ★ 28k