Question

Non-sense results using a continuous variable as predictor in LIMMA

0

Entering edit mode

Sindre ▴ 70

@sindre-6693

Last seen 7.3 years ago

I have 41 samples in 3 groups (NGT, IGT and T2D). I want to find genes correlating with VO2max, adjusted for age, sex, BMI and diabetic status. Thus, I have 3 continuous variables and one factor. There are no missing values.

Age <- as.numeric(pData.cohort2$Age)
BMI <- as.numeric(pData.cohort2$BMI)
Status <- factor(pData.cohort2$DiabetesStatus, levels=c("NGT", "IGT", "T2D"))
VO2max <- as.numeric(pData.cohort2$VO2max)

design <- model.matrix(~ 0 + log2(VO2max) + log2(Age) + log2(BMI) + Status)

> design
log2(VO2max) log2(Age) log2(BMI) StatusNGT StatusIGT StatusT2D
1 4.073820 6.042207 4.969933 0 0 1
2 4.576522 6.046578 4.642124 0 0 1
3 4.704872 6.066089 4.644433 0 0 1
4 4.828835 6.089583 4.955592 0 0 1
5 3.893362 6.000000 5.116032 0 0 1
6 4.300856 6.022368 4.286142 0 0 1
7 4.390943 6.022368 4.656496 0 0 1
8 4.807355 6.024586 4.480265 0 0 1
9 4.381975 6.022368 4.811985 0 0 1
10 4.604664 6.000000 4.777157 0 0 1
11 4.891419 6.108524 4.649041 0 0 1
12 4.694880 5.930737 4.635754 0 0 1
13 5.107688 6.022368 4.720826 0 0 1
14 4.282440 6.000000 4.832890 0 0 1
15 4.682011 6.042207 4.688180 0 0 1
16 4.620000 6.046578 5.060047 0 0 1
17 4.230357 6.087463 5.058749 0 0 1
18 5.136684 6.020147 4.663345 0 0 1
19 4.506526 6.066089 4.726831 0 1 0
20 4.383359 6.062712 4.711495 0 1 0
21 4.595742 6.067381 4.965323 0 1 0
22 5.131343 6.022368 4.916954 0 1 0
23 4.961160 6.108524 4.748998 0 1 0
24 4.752749 6.044394 4.221877 0 1 0
25 4.636915 6.066089 4.568032 0 1 0
26 4.700994 6.001073 5.028569 0 1 0
27 5.416840 6.023502 4.536053 1 0 0
28 5.264536 6.046314 4.564988 1 0 0
29 4.897724 6.043493 5.057450 1 0 0
30 4.960234 6.045268 4.331275 1 0 0
31 4.914565 6.023258 4.405312 1 0 0
32 4.537296 6.022807 4.682573 1 0 0
33 4.950935 6.065006 4.598127 1 0 0
34 4.607626 6.045705 4.441616 1 0 0
35 5.142005 6.044394 4.803227 1 0 0
36 4.977738 6.109154 4.534809 1 0 0
37 5.081510 6.043513 4.239551 1 0 0
38 4.646163 6.045244 4.628774 1 0 0
39 5.282810 6.024193 4.278728 1 0 0
40 5.044394 6.065006 4.613532 1 0 0
41 5.101818 6.043137 4.541639 1 0 0

The above is correct. And then ran:

fit<-lmFit(eset,design)
fit<-eBayes(fit)

Now comes the wired part:

> summary(decideTests(fit))
log2(VO2max) log2(Age) log2(BMI) StatusNGT StatusIGT StatusT2D
-1 0 0 0 0 0 0
0 12494 12494 12494 12494 12494 12494
1 0 0 0 0 0 0

> topTable(fit, coef="log2(VO2max)", n=20)
logFC AveExpr t P.Value adj.P.Val B
MARK4 0.4286648 6.530214 4.150396 0.0001765644 0.9015993 -2.398610
KPNA2 0.4518780 4.681989 3.985930 0.0002890322 0.9015993 -2.540450
CASP7 0.3160133 4.508867 3.877274 0.0003988647 0.9015993 -2.634155
CYP2W1 -0.4906266 4.526780 -3.863370 0.0004155577 0.9015993 -2.646140
HEY2 0.3993083 5.310449 3.850137 0.0004320732 0.9015993 -2.657543
GCLC 0.4470705 5.551234 3.802603 0.0004968104 0.9015993 -2.698486
LIMS2 -0.4112356 6.907945 -3.796930 0.0005051381 0.9015993 -2.703370
SRSF1 0.7310258 6.956680 3.707732 0.0006552418 0.9318207 -2.780077
SUGP1 -0.3208302 7.659869 -3.686711 0.0006964459 0.9318207 -2.798127
PRKD3 0.4431114 4.733332 3.663051 0.0007458146 0.9318207 -2.818429
FAM49A 0.4732923 6.175163 3.606516 0.0008778253 0.9970500 -2.866869

The p-value distribution looks like this, which is very uniform.

<blockquote class="imgur-embed-pub" lang="en" data-id="aMmeLMz"><a href="//imgur.com/aMmeLMz">View post on imgur.com</a></blockquote><script async src="//s.imgur.com/min/embed.js" charset="utf-8"></script>

I also tried to test non-linear relationships, but no difference in the results.

require(splines)
X <- ns(VO2max, df=5)
design <- model.matrix(~X)

The problem is that this really does not make any sense! Is biological impossible that this is true. Can anyone shed light on whats going on here?

I might add that the same is observed using only VO2max, only Age or only BMI as well.

limma • 984 views

ADD COMMENT • link updated 8.2 years ago by Aaron Lun ★ 28k • written 8.2 years ago by Sindre ▴ 70

score 2 · Answer 1 · 2016-02-27

One possible reason is that the continuous covariates are highly consistent across the samples. You could almost treat these covariates as intercept columns that are identical across samples, which means that dropping any single coefficient (as done by running decideTests on the fit object) will have no effect - the null and full models would be nearly the same, such that any difference in the fit would not be significant. Simplifying the model (e.g., by using only the diabetic status) should solve this. In general, unless there's a good reason for why a covariate or factor should be in the model, then it's best to leave it out.

The other potential cause is more obvious, and it's that you just don't have any DE in your data set. You can check this out by making some MDS plots and seeing if samples cluster by any of the covariates or factors of interest. If they do, you should be able to detect DE. If they don't... well, that's the nature of your data. If it doesn't make biological sense, then you should go complain to the people who generated it.