Non-sense results using a continuous variable as predictor in LIMMA
1
0
Entering edit mode
Sindre ▴ 70
@sindre-6693
Last seen 7.9 years ago

I have 41 samples in 3 groups (NGT, IGT and T2D). I want to find genes correlating with VO2max, adjusted for age, sex, BMI and diabetic status. Thus, I have 3 continuous variables and one factor. There are no missing values.

        Age <- as.numeric(pData.cohort2$Age)
        BMI <- as.numeric(pData.cohort2$BMI)
        Status <- factor(pData.cohort2$DiabetesStatus, levels=c("NGT", "IGT", "T2D"))
        VO2max <- as.numeric(pData.cohort2$VO2max)

        design <- model.matrix(~ 0 + log2(VO2max) + log2(Age) + log2(BMI) + Status)

> design
   log2(VO2max) log2(Age) log2(BMI) StatusNGT StatusIGT StatusT2D
1      4.073820  6.042207  4.969933         0         0         1
2      4.576522  6.046578  4.642124         0         0         1
3      4.704872  6.066089  4.644433         0         0         1
4      4.828835  6.089583  4.955592         0         0         1
5      3.893362  6.000000  5.116032         0         0         1
6      4.300856  6.022368  4.286142         0         0         1
7      4.390943  6.022368  4.656496         0         0         1
8      4.807355  6.024586  4.480265         0         0         1
9      4.381975  6.022368  4.811985         0         0         1
10     4.604664  6.000000  4.777157         0         0         1
11     4.891419  6.108524  4.649041         0         0         1
12     4.694880  5.930737  4.635754         0         0         1
13     5.107688  6.022368  4.720826         0         0         1
14     4.282440  6.000000  4.832890         0         0         1
15     4.682011  6.042207  4.688180         0         0         1
16     4.620000  6.046578  5.060047         0         0         1
17     4.230357  6.087463  5.058749         0         0         1
18     5.136684  6.020147  4.663345         0         0         1
19     4.506526  6.066089  4.726831         0         1         0
20     4.383359  6.062712  4.711495         0         1         0
21     4.595742  6.067381  4.965323         0         1         0
22     5.131343  6.022368  4.916954         0         1         0
23     4.961160  6.108524  4.748998         0         1         0
24     4.752749  6.044394  4.221877         0         1         0
25     4.636915  6.066089  4.568032         0         1         0
26     4.700994  6.001073  5.028569         0         1         0
27     5.416840  6.023502  4.536053         1         0         0
28     5.264536  6.046314  4.564988         1         0         0
29     4.897724  6.043493  5.057450         1         0         0
30     4.960234  6.045268  4.331275         1         0         0
31     4.914565  6.023258  4.405312         1         0         0
32     4.537296  6.022807  4.682573         1         0         0
33     4.950935  6.065006  4.598127         1         0         0
34     4.607626  6.045705  4.441616         1         0         0
35     5.142005  6.044394  4.803227         1         0         0
36     4.977738  6.109154  4.534809         1         0         0
37     5.081510  6.043513  4.239551         1         0         0
38     4.646163  6.045244  4.628774         1         0         0
39     5.282810  6.024193  4.278728         1         0         0
40     5.044394  6.065006  4.613532         1         0         0
41     5.101818  6.043137  4.541639         1         0         0

The above is correct. And then ran:

fit<-lmFit(eset,design)
fit<-eBayes(fit)

Now comes the wired part: 

> summary(decideTests(fit))
   log2(VO2max) log2(Age) log2(BMI) StatusNGT StatusIGT StatusT2D
-1            0         0         0         0         0         0
0         12494     12494     12494     12494     12494     12494
1             0         0         0         0         0         0

> topTable(fit, coef="log2(VO2max)", n=20)
             logFC  AveExpr         t      P.Value adj.P.Val         B
MARK4    0.4286648 6.530214  4.150396 0.0001765644 0.9015993 -2.398610
KPNA2    0.4518780 4.681989  3.985930 0.0002890322 0.9015993 -2.540450
CASP7    0.3160133 4.508867  3.877274 0.0003988647 0.9015993 -2.634155
CYP2W1  -0.4906266 4.526780 -3.863370 0.0004155577 0.9015993 -2.646140
HEY2     0.3993083 5.310449  3.850137 0.0004320732 0.9015993 -2.657543
GCLC     0.4470705 5.551234  3.802603 0.0004968104 0.9015993 -2.698486
LIMS2   -0.4112356 6.907945 -3.796930 0.0005051381 0.9015993 -2.703370
SRSF1    0.7310258 6.956680  3.707732 0.0006552418 0.9318207 -2.780077
SUGP1   -0.3208302 7.659869 -3.686711 0.0006964459 0.9318207 -2.798127
PRKD3    0.4431114 4.733332  3.663051 0.0007458146 0.9318207 -2.818429
FAM49A   0.4732923 6.175163  3.606516 0.0008778253 0.9970500 -2.866869

The p-value distribution looks like this, which is very uniform. 

<blockquote class="imgur-embed-pub" lang="en" data-id="aMmeLMz"><a href="//imgur.com/aMmeLMz">View post on imgur.com</a></blockquote><script async src="//s.imgur.com/min/embed.js" charset="utf-8"></script>

I also tried to test non-linear relationships, but no difference in the results.

require(splines)
X <- ns(VO2max, df=5)
design <- model.matrix(~X)

The problem is that this really does not make any sense! Is biological impossible that this is true. Can anyone shed light on whats going on here?

I might add that the same is observed using only VO2max, only Age or only BMI as well.

 

 

limma • 1.1k views
ADD COMMENT
2
Entering edit mode
Aaron Lun ★ 28k
@alun
Last seen 9 hours ago
The city by the bay

One possible reason is that the continuous covariates are highly consistent across the samples. You could almost treat these covariates as intercept columns that are identical across samples, which means that dropping any single coefficient (as done by running decideTests on the fit object) will have no effect - the null and full models would be nearly the same, such that any difference in the fit would not be significant. Simplifying the model (e.g., by using only the diabetic status) should solve this. In general, unless there's a good reason for why a covariate or factor should be in the model, then it's best to leave it out.

The other potential cause is more obvious, and it's that you just don't have any DE in your data set. You can check this out by making some MDS plots and seeing if samples cluster by any of the covariates or factors of interest. If they do, you should be able to detect DE. If they don't... well, that's the nature of your data. If it doesn't make biological sense, then you should go complain to the people who generated it.

ADD COMMENT

Login before adding your answer.

Traffic: 875 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6