Limma repeated-measures model: Partial NA coefficients
1
0
Entering edit mode
Luís C. • 0
@25c2a4bf
Last seen 15 months ago
Finland

Hi! I'm analysing a proteomics dataset consisting on spectral counts obtained using Spectronaut (MS2 quantification, Area under the curve within integration boundaries for each targeted ion). There are 4 measurements per subject and 5 experimental groups.

I would like to find overall differently expressed proteins between every group. Only later I will look at each time point separately. However, I have been unable to estimate the coefficients for the 5th group using lmFit() function.

I have log2-transformed the protein expression values before trying to fit a model.

> head(data[, 1:6], 6)
# A tibble: 6 × 6
  `200939A_gr4_0.min` `200939A_gr4_15.min` `200939A_gr4_30.min` 200939A_…¹ 11006…² 11006…³
                <dbl>                <dbl>                <dbl>      <dbl>   <dbl>   <dbl>
1                18.6                 19.7                 19.3       19.6    19.5    20.0
2                16.9                 18.2                 17.8       18.5    18.7    19.0
3                18.3                 16.8                 17.2       16.9    20.0    20.6
4                12.2                 12.4                 NA         12.7    13.7    13.7
5                20.2                 20.2                 20.3       20.3    19.9    19.8
6                17.1                 15.4                 16.0       16.9    16.4    15.5
# … with abbreviated variable names ¹​`200939A_gr4_60.min`, ²​`110064A_gr4_0.min`,
#   ³​`110064A_gr4_15.min`

> dim(data)
[1] 533 175

And this is subject and model info:

> head(clean.subj, 6)
   subject_id     timepoint   diab_risk  
1:    200939A         0           S2       
2:    200939A        15          S2       
3:    200939A        30          S2       
4:    200939A        60          S2        
5:    110064A         0           S2       
6:    110064A        15          S2    

> dim(clean.subj)
[1] 175   3 

> summary(clean.subj$diab_risk)
Missing   0_IAb   1_IAb      S1      S2    S3_T1D 
      0            24      28         43      48        32

According to the Limma vignette, I tried to fit the model as it follows:

library(limma)
library(splines)

models <- ns(clean.subj$timepoint, df=4)
group <- factor(clean.subj$diab_risk)
design <- model.matrix(~0+group*models)

fit <- lmFit(data, design)
Coefficients not estimable: models4 group1_IAb:models4 groupS1:models4 groupS2:models4 groupS3_T1D:models4 
Warning message:
Partial NA coefficients for 533 probe(s)

There is one subject with just 3 timepoints, but excluding it does not solve the error. I have also tried to replace NAs using Random Forest, but the error persisted. I have also tried to remove useless parameters from the design matrix, but to no avail.

Any ideas of what am I doing wrong?

Thanks for the help!

> sessionInfo( )
R version 4.2.2 (2022-10-31 ucrt)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows 10 x64 (build 22621)
Proteomics limma repeated-measures • 1.2k views
ADD COMMENT
2
Entering edit mode
@gordon-smyth
Last seen 31 minutes ago
WEHI, Melbourne, Australia

The problem is that your design matrix is enormously over-parametrized. You seem to be following Section 9.6.2 of the limma User's Guide, but that section is only intended for experiments with a very large number of time points.

For your experiment you should be following Section 9.6.1 (Replicated time points) of the User's Guide in conjunction with Section 9.7 (Multi-level experiments). You experiment actually seems quite an easy one to analyse. Just make a combined factor from timepoint and diab_risk and estimate a intrablock correlation for the subjects.

ADD COMMENT
0
Entering edit mode

Thank you very much Gordon, it works like a charm now!

I have managed to obtain the comparisons between the groups for every time point.

However, I was wondering now what is the best way to compare the overall effect of diab_risk for each variable. I have created a second model design2 <- model.matrix(~0+group) and I have created contrasts to compare different groups. Is this statistically sound? Or should I compare the area under the curve of protein expression over time?

Thanks

ADD REPLY
1
Entering edit mode

No it is not correct to create a second model (which misrepresents the data by ignoring the time course structure). Just form contrasts. For example, to compare S2 to S1 averaged over all times use the contrast:

(S2.t0+S2.t15+S2.t30+S2.t60-S1.t0-S1.t15-S1.t30-S1.t60)/4
ADD REPLY

Login before adding your answer.

Traffic: 681 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6