Handling of missing values in limma
2
0
Entering edit mode
Arend • 0
@b8a242dd
Last seen 3 months ago
Luxembourg

Hello, I just wanted to get a detailed answer on how limma handles missing values.

I am working on proteomics data and already filtered out proteins having many missing values. However, some missing values will remain in the data. I then use limma to fit a linear model and wanted to ask how limma is treating these missing values. From the internet, I cannot find a clear answer to this question.

Best regards, Lis

limma missingValues • 417 views
0
Entering edit mode
@gordon-smyth
Last seen 1 hour ago
WEHI, Melbourne, Australia

limma treats missing values in the same way as do other linear model functions in R such as lm(), glm() etc. For each gene or protein, the cases with missing values are removed both from the data and design matrix. In other words, the linear model is fitted to the non-missing values. If a particular regression coefficent cannot be estimated from the observed data for a particular protein, then a NA value will be returned for that coefficient.

In orher words, limma treats missing values as missing at random. Any protein with at least one non-NA value in at least two groups can receive a non-NA p-value.

0
Entering edit mode
Arend • 0
@b8a242dd
Last seen 3 months ago
Luxembourg

Okay, and how am I getting statistics (p-value etc.) for the proteins with missing values?

0
Entering edit mode

For proteome experiments missing values are rather common. You should read up on imputation methods that fit your type of experiment.

0
Entering edit mode

You should re-read what Gordon said. There is nothing in his response that should lead you to believe that a protein with missing data is completely removed.

0
Entering edit mode

limma removes NA values. It does not remove non-NA values. A protein with at least one non-NA value in a least two groups will receive a non-NA p-value.

0
Entering edit mode

Thank you for this discussion. I am a Stata user and new to R and limma and struggling to understand, in part because I am adopting syntax that other people developed and made available in a publication, without much experience/understanding of R and limma.

I understand from Gordon's post >10 years ago that lmFit should provide how many observations(?) have been removed due to missing data, but the syntax I'm using does not seem like showing this to me. And I'm struggling to alter the syntax how to find out.

My data comprise of a few hundred unique individuals (rows) and several proteins (column). In this situation, what is the 'group', when you mentioned "A protein with at least one non-NA value in a least two groups will receive a non-NA p-value"?

I am using the following syntax. Could it be possible to alter somewhere to obtain the number of observations dropped from the analysis due to missingneess? The output is shown after the syntax, and it does show the number of observations (underlined, 0=491, 1=44), but it's the number including missing data, thus not what I would like to see.

0
Entering edit mode

Your groups are cases and controls.

In R, to identify NA values in a matrix y, use is.na(y). To count the number of NA observations, use sum(is.na(y)). That is just basic R, not specific to limma.

0
Entering edit mode

I used sum(is.na(lmfit)) after lmFit, but somewhat it indicated there was no NA observations even when I gave missingness in all proteins for a bunch of observations. I tried to use nobs(), but it does not seem to work after lmFit.

To remind, what I want to know is how many observation was used in an estimation by lmFit. Therefore, either the number of removed observation, or the number of used observation (in an estimation) is fine. Is there something wrong in what I did?

Also, using the data below with 5 proteins, have I correctly understood that lmFit: 1) removes persons 1 and 2 from the estimation because they had missing data in all five proteins, but 2) uses person 3 in the estimation because s/he had values in some proteins?

Thank you so much for your time for my questions.

0
Entering edit mode

limma uses all the observations. There's nothing complicated about it.

limma does not remove any persons from the analysis.

I did not advise you to apply is.na() to a fitted model object. I advised you to to apply it your data matrix.

You R code is problematic. You don't seem to have created the expression matrix correctly in the first place. The code t(dat[43:ncoldat)] that you have given both here and in previous question is not syntactically correct and could not possibly run in R. I suggest that you check the expression matrix properly before worrying about what limma is doing.