Question

Imputation of mass spec MAR and MNAR missing values with proDA

1

Entering edit mode

haris.khan ▴ 10

@hariskhan-21374

Last seen 5.5 years ago

Germany/Munich/LMU

I have a 3-condition, 4-timepoint, 3-replicates model and data that is whole proteomics mass spectrometry. My data (normalized LFQ values) contains missing values that are both Missing At Random (MAR) and Missing Not At Random (MNAR). I am sure of this because a lot of proteins have missing values only in one condition and it is of biological relevance that they are not being detected in that particular condition because they are very low abundant.

My problem is I am looking for a way that can impute MAR and MNAR values differently. If I impute all of them based on normal distribution fit, it skews the data too much to low intensity values which affects the later differential expression analysis later.

Here is my design matrix:

    Cond    Time    Batch   Rep
Fed_2_B_2   Fed 2   B   2
Fed_2_D_1   Fed 2   D   1
Fed_2_D_2   Fed 2   D   2
Fed_4_A_1   Fed 4   A   1
Fed_4_B_2   Fed 4   B   2
Fed_4_C_3   Fed 4   C   3
Fed_6_A_1   Fed 6   A   1
Fed_6_D_1   Fed 6   D   1
Fed_6_D_2   Fed 6   D   2
Fed_24_A_1  Fed 24  A   1
Fed_24_C_3  Fed 24  C   3
Fed_24_D_1  Fed 24  D   1
Fasted_2_A_1    Fasted  2   A   1
Fasted_2_C_3    Fasted  2   C   3
Fasted_2_D_1    Fasted  2   D   1
Fasted_4_B_2    Fasted  4   B   2
Fasted_4_C_3    Fasted  4   C   3
Fasted_4_D_1    Fasted  4   D   1
Fasted_6_A_1    Fasted  6   A   1
Fasted_6_B_2    Fasted  6   B   2
Fasted_6_C_3    Fasted  6   C   3
Fasted_24_B_2   Fasted  24  B   2
Fasted_24_C_3   Fasted  24  C   3
Fasted_24_D_1   Fasted  24  D   1
Refed_2_A_1 Refed   2   A   1
Refed_2_C_3 Refed   2   C   3
Refed_2_D_1 Refed   2   D   1
Refed_4_B_2 Refed   4   B   2
Refed_4_D_1 Refed   4   D   1
Refed_4_D_2 Refed   4   D   2
Refed_6_B_2 Refed   6   B   2
Refed_6_C_3 Refed   6   C   3
Refed_6_D_1 Refed   6   D   1
Refed_24_A_1    Refed   24  A   1
Refed_24_C_3    Refed   24  C   3
Refed_24_D_1    Refed   24  D   1

I was trying to use proDA (https://github.com/const-ae/proDA) to see if it can do a better job to fill out these missing values. However, I am not sure if it can work with more than 2 conditions and if I do any filtering of the data before putting it in to proDA. Before I have been filtering the data to keep only those proteins which have atleast 8 out of 12 non-zero values in atleast 2 out of the 3 conditions. This filtering reduces nrow (proteins) from 4128 initially to 868 proteins.

Any help would be appreciated.

proDA mass spec mnar mar imputation • 3.4k views

ADD COMMENT • link updated 6.6 years ago by James W. MacDonald 68k • written 6.6 years ago by haris.khan ▴ 10

score 1 · Answer 1 · 2019-07-17

1

Entering edit mode

Constantin Ahlmann-Eltze ▴ 40

@constantin-ahlmann-eltze-20493

Last seen 9 months ago

Germany/Heidelberg/EMBL

Hi Haris,

the problem that you are describing is exactly for what I developed proDA.

To show how you can use the package, I will generate some synthetic data to show how to call proDA with the appropriate linear model. I assume that your design matrix dataframe is called design_matrix.

# Y is an intensity matrix with 100 proteins
Y <- proDA::generate_synthetic_data(n_proteins = 100, n_conditions = 12, 
                                n_replicates = 3,
                                return_summarized_experiment = TRUE)

# To fit the model call the proDA() function and 
# give it your design_matrix
fit <- proDA::proDA(Y, design = ~ Cond + Time + Batch, col_data = design_matrix)
fit

# To identify proteins that change with time
results <- proDA::test_diff(fit, "Time")
head(results)

# To see other available coefficients
proDA::result_names(fit)

# Or you can also do a LRT by calling test_diff() with 
# a reduced_model
results <- proDA::test_diff(fit, reduced_model = ~ Time + Batch)

In general, it should not be necessary to filter the data, only if you want to speed up the inference you can exclude proteins for which you are sure that they will not be significant anyway.

If you have any further questions, feel free to comment below.

Best Regards, Constantin

ADD COMMENT • link 6.6 years ago Constantin Ahlmann-Eltze ▴ 40

0

Entering edit mode

I think I was not able to fully comprehend the paper on proDA so the question: Is it possible to just get a 'imputed' matrix out of the proDA function object? For eg. if I want to test pairwise comparisons using other approaches?

ADD REPLY • link 6.6 years ago haris.khan ▴ 10

0

Entering edit mode

Well, yes you can get the completed matrix using

predict(fit, type = "response")

The documenation for the predict function is here https://const-ae.github.io/proDA/reference/predict-proDAFit-method.html

But, I think you should not do that and then run another statistical test on the completed data. The problem is that you get unrealistically small variance estimates, so the method might consider too many proteins significant.

I would recommend to use the build in function test_diff() to identify significantly changed proteins.

ADD REPLY • link 6.6 years ago Constantin Ahlmann-Eltze ▴ 40

0

Entering edit mode

My only concern was that I know that there is a batch effect in the dataset. That's why I wanted to take the completed matrix, correct for the batch effect and then run a statistical test. Or would you suggest another method of dealing with this that is more compatible with proDA?

Also, I just saw that based on the imputation proDA does, the data is no longer following a normal distribution (I am guessing because of a lot of imputation of "constant" value for proteins which have a lot of missing values). Does Wald test work with non-normally distributed data?

ADD REPLY • link 6.6 years ago haris.khan ▴ 10

0

Entering edit mode

Ah, okay. Thanks for the clarification :)

I wanted to take the completed matrix, correct for the batch effect and then run a statistical test

If the batch effect that you mentioned, is the one in the column of the design matrix, the best way to handle it is to include it in the linear model. That way it is automatically regressed out and the statistical test is independent of the batch effect.

If on the other hand, you have strong reason to believe that there are additional batch effects that were not recorded, the problem is much more difficult. proDA can handle some ways to correct for global intensity differences using the median_normalization() method and the sample specific dropout curve inference. To address any more complicated batch effects, I think, is still an open research topic, but usually the first two options should already get you quite far.

based on the imputation proDA does, the data is no longer following a normal distribution [...] Does Wald test work with non-normally distributed data?

After you call the predict function, you will often observe that many of the completed values are identical to each other and the data does not look normal. But, proDA actually does not work on the completed dataset internally, but directly combines the information available from the observed and the missing values. In that sense, yes the Wald test does work for that kind of data, as long as you use the one based on the probabilistic dropout model implemented in my package :)

ADD REPLY • link 6.6 years ago Constantin Ahlmann-Eltze ▴ 40

0

Entering edit mode

I would like to extend the question. Is it possible to use the imputed data for visualizations such as PCA or heatmaps with both rows and columns clustering?

ADD REPLY • link 5.6 years ago guyho ▴ 30

0

Entering edit mode

Yes I think it can be okay to use the completed dataset for some visualizations. However, for PCA and clustering in a heatmap, I think you should use the dist_approx() function.

It is specifically designed to calculate distances between samples or proteins taking into account the uncertainty from the missing values. That function will give you a more accurate estimate of the distance than working on the completed data. For an example see the Figure 5 of the proDA paper. The code to generate the heatmap is on GitHub.

To run the equivalent of a PCA, check out multidimensional-scaling (cmdscale) which does basically the same as PCA, except that it takes the pairwise distances as input (e.g. from dist_approx()). To cluster a heatmap, call hclust(dist_approx(fit)$mean).