Question

Limma repeated-measures model: Partial NA coefficients

0

Entering edit mode

Luís C. • 0

@25c2a4bf

Last seen 2.2 years ago

Finland

Hi! I'm analysing a proteomics dataset consisting on spectral counts obtained using Spectronaut (MS2 quantification, Area under the curve within integration boundaries for each targeted ion). There are 4 measurements per subject and 5 experimental groups.

I would like to find overall differently expressed proteins between every group. Only later I will look at each time point separately. However, I have been unable to estimate the coefficients for the 5th group using lmFit() function.

I have log2-transformed the protein expression values before trying to fit a model.

> head(data[, 1:6], 6)
# A tibble: 6 × 6
  `200939A_gr4_0.min` `200939A_gr4_15.min` `200939A_gr4_30.min` 200939A_…¹ 11006…² 11006…³
                <dbl>                <dbl>                <dbl>      <dbl>   <dbl>   <dbl>
1                18.6                 19.7                 19.3       19.6    19.5    20.0
2                16.9                 18.2                 17.8       18.5    18.7    19.0
3                18.3                 16.8                 17.2       16.9    20.0    20.6
4                12.2                 12.4                 NA         12.7    13.7    13.7
5                20.2                 20.2                 20.3       20.3    19.9    19.8
6                17.1                 15.4                 16.0       16.9    16.4    15.5
# … with abbreviated variable names ¹`200939A_gr4_60.min`, ²`110064A_gr4_0.min`,
#   ³`110064A_gr4_15.min`

> dim(data)
[1] 533 175

And this is subject and model info:

> head(clean.subj, 6)
   subject_id     timepoint   diab_risk  
1:    200939A         0           S2       
2:    200939A        15          S2       
3:    200939A        30          S2       
4:    200939A        60          S2        
5:    110064A         0           S2       
6:    110064A        15          S2    

> dim(clean.subj)
[1] 175   3 

> summary(clean.subj$diab_risk)
Missing   0_IAb   1_IAb      S1      S2    S3_T1D 
      0            24      28         43      48        32

According to the Limma vignette, I tried to fit the model as it follows:

library(limma)
library(splines)

models <- ns(clean.subj$timepoint, df=4)
group <- factor(clean.subj$diab_risk)
design <- model.matrix(~0+group*models)

fit <- lmFit(data, design)
Coefficients not estimable: models4 group1_IAb:models4 groupS1:models4 groupS2:models4 groupS3_T1D:models4 
Warning message:
Partial NA coefficients for 533 probe(s)

There is one subject with just 3 timepoints, but excluding it does not solve the error. I have also tried to replace NAs using Random Forest, but the error persisted. I have also tried to remove useless parameters from the design matrix, but to no avail.

Any ideas of what am I doing wrong?

Thanks for the help!

> sessionInfo( )
R version 4.2.2 (2022-10-31 ucrt)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows 10 x64 (build 22621)

Proteomics limma repeated-measures • 2.2k views

ADD COMMENT • link updated 2.2 years ago by Gordon Smyth 52k • written 2.2 years ago by Luís C. • 0

score 2 · Accepted Answer · 2023-01-02

2

Entering edit mode

Gordon Smyth 52k

@gordon-smyth

Last seen 3 hours ago

WEHI, Melbourne, Australia

The problem is that your design matrix is enormously over-parametrized. You seem to be following Section 9.6.2 of the limma User's Guide, but that section is only intended for experiments with a very large number of time points.

For your experiment you should be following Section 9.6.1 (Replicated time points) of the User's Guide in conjunction with Section 9.7 (Multi-level experiments). You experiment actually seems quite an easy one to analyse. Just make a combined factor from timepoint and diab_risk and estimate a intrablock correlation for the subjects.

ADD COMMENT • link 2.2 years ago Gordon Smyth 52k

0

Entering edit mode

Thank you very much Gordon, it works like a charm now!

I have managed to obtain the comparisons between the groups for every time point.

However, I was wondering now what is the best way to compare the overall effect of diab_risk for each variable. I have created a second model design2 <- model.matrix(~0+group) and I have created contrasts to compare different groups. Is this statistically sound? Or should I compare the area under the curve of protein expression over time?

Thanks

ADD REPLY • link 2.2 years ago Luís C. • 0

1

Entering edit mode

No it is not correct to create a second model (which misrepresents the data by ignoring the time course structure). Just form contrasts. For example, to compare S2 to S1 averaged over all times use the contrast:

(S2.t0+S2.t15+S2.t30+S2.t60-S1.t0-S1.t15-S1.t30-S1.t60)/4

ADD REPLY • link 2.2 years ago Gordon Smyth 52k