NA coefficients values for samples with all zeros
1
0
Entering edit mode
@vedranfranke-7218
Last seen 6.0 years ago
Germany

I am trying to run a differential expression analysis for proteomics data using limma. The data has been quantified and normalized using the LFQ method - the sample distributions are log normal, but with a high median 10e7, ranging from 0 to 1e10

The problem arises with genes that are present in one condition and absent another. Before putting the values into limma I need to log them, which converts 0 - Inf, and drops the genes from the analysis. I tried replacing Inf with 0 or adding pseudocounts, but both things break the model. 

Is there any way around this?

Tnx

limmagui proteomics • 1.7k views
ADD COMMENT
2
Entering edit mode
@james-w-macdonald-5106
Last seen 1 hour ago
United States

Sure, add a small constant to all the data before taking logs.

ADD COMMENT
0
Entering edit mode

This is how the vulcano plot looks like if I either add pseudocounts or replace Inf with 0

ADD REPLY
0
Entering edit mode

That is what you should expect, no? If you have a protein that is found at high concentration in one sample, and is essentially undetectable in another, shouldn't the fold change be huge, and the p-value really small? I am not sure this constitutes a breakage of the model.

ADD REPLY
0
Entering edit mode

It might not be obvious from this plot, but problem arises in the calculated pvalues - a lot of genes that have a considerable fold change, and have been experimentally validated are not found to be statistically significant, i.e., a gene which has a lfc of 2 gets a p.value of 0.76.

ADD REPLY
0
Entering edit mode

This seems to be a separate problem that's unrelated to the pseudo-counts. If your example with a log-fold change of 2 is being affected by a small pseudo-count, then the raw expression value would be in the single digits, which is negligible compared to the values in the millions/billions you've described in your original post. That would suggest that it's not being expressed much or at all; at the very least, there's not a lot of evidence to detect differences, and the variability of low-abundance genes tends to be higher (at least, for RNA-seq/microarray data), which will reduce power.

So, if your example gene is actually expressed at decent levels, the pseudo-count addition shouldn't affect its inferences; instead, I'd suggest you have a look at its variability across replicates. This may explain why a decent log-fold change has such a large p-value. If the gene's expression turns out to be highly variable across replicates, then it makes sense to not call it a DE gene based on this data. Which is disappointing, but that's life.

ADD REPLY

Login before adding your answer.

Traffic: 1008 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6