Question

Limma- missing values in covariate file

1

Entering edit mode

mheydarpour ▴ 10

@mheydarpour-9430

Last seen 5.3 years ago

I have 1000 candidate genes with their expression (FPKM). I would like to test the gene-expression difference between two groups of Individuals with heart_Failure and normal. In the linear model I have some covariates like "age", "sex", "BMI" and "Hypertension", however there are some missing values (NA) in the covariate file. I created the design matrix as follows:

design <- model.matrix(~age+sex+BMI+ Hypertension+group) ; Note: group (No=normal , Yes=heart_Failure)

after I run the linear model as: fit <- lmFit(expression,design), It gave me an error because of some missing values in the model. How to fix this problem?

All values in "expression" and "group" are complete (no-missing), just I have missing values in covariables. How to fix this issue. Could you please advise by sending an example. Thank you

limma • 2.8k views

ADD COMMENT • link 8.3 years ago mheydarpour ▴ 10

score 1 · Answer 1 · 2016-01-07

1

Entering edit mode

Steve Lianoglou ★ 13k

@steve-lianoglou-2771

Last seen 14 months ago

United States

I'm trying to read between the lines here, but there might be one of two things happening:

The design matrix you have generated is likely not full rank. Do you see warnings about non estimable coefficients? Is the rank of your matrix equal to the number of columns, eg. is Matrix::rankMatrix(design) == ncol(design)?
Or, perhaps, you have levels in your age/sec/BMI/whatever factors that you need to drop? Are any columns in your design all 0's, eg. is any(colSums(design) == 0) equal to TRUE?

In any case, it would be helpful if you showed us more of your data, eg. your design, the pData of your ExpressionSet, etc.

ADD COMMENT • link 8.3 years ago Steve Lianoglou ★ 13k

0

Entering edit mode

Indeed, if you have NA values in the covariates you supply to model.matrix, the function will automatically remove the corresponding samples. This can result in your design matrix not being of full rank.

ADD REPLY • link 8.3 years ago Aaron Lun ★ 28k

score 0 · Answer 2 · 2016-01-07

This is the first 20th rows of my data: (Missing values shows with NA), my data has no "0" values.

ID   sex   age   Hy-T   BMI   group
AB0116   Female   79   Yes   32.84   No
AB0032   Female   NA   No   21.33   No
AB0099   Male   86   NA   24.34   No
AB0016   Female   71   No   29.35   No
AB0069   Female   NA   No   34.38   No
AB0081   NA   49   Yes   69.25   No
AB0090   Female   81   No   26.56   No
AB0163   Female   NA   No   21.19   Yes
AB0151   Female   78   No   32.27   Yes
AB0164   Female   65   No   NA   Yes
AB0009   NA   69   No   33.80   No
AB0031   Female   90   No   27.81   No
AB0017   Male   89   NA   34.05   Yes
AB0018   Female   81   No   27.35   No
AB0055   Female   85   Yes   38.63   No
AB0113   Female   85   Yes   38.52   No
AB0008   Female   NA   Yes   NA   No
AB0050   Female   75   Yes   36.69   Yes
AB0089   Female   64   NA   34.95   No
AB0093   Female   83   Yes   28.72   No

Note: my expression data is normalized and no contain any "0" values.

score 0 · Answer 3 · 2016-01-07

0

Entering edit mode

mheydarpour ▴ 10

@mheydarpour-9430

Last seen 5.3 years ago

Hi James,

I found the following explanation from one of the users in the Limma's archive. Do you think it might be utilize in my case too?

If so, how?

"Why do you want to include weight in the design matrix? It may be more reasonable to include weight in the linear model, i.e. expression ~ condition + weight and then your design matrix will have two columns (if there are two conditions): the first column will contain 0's and 1's depending on the condition (group) of the patient while the second one will be identically 1. And then the weights of the patients will be in the values (i.e. the second argument to lmFit) which I believe allows missing values. But this implicitly assumes that the weight (or some function of it - you can use any transformation you like) contributes additively (and linearly) to the expression (or it's logarithm). Moshe."

ADD COMMENT • link 8.3 years ago mheydarpour ▴ 10

1

Entering edit mode

That quote isn't of use here. The problem is that you have missing data in your covariates, and any NA will automatically remove that subject from your analysis. That's just how it is.

Think about it this way. If I asked you to compute the average weight of 10 people, by sex, and I neglected to tell you the sex of 3 of those people, what would you do with the data for those 3 people? You don't know what sex they are, so you have to simply ignore them and compute the average on the remaining 7 subjects for which you know the sex. This is exactly what R is doing when fitting a linear model - if there are missing covariate data for any subject, it silently removes those subjects from consideration, because that's all it can do.

The only thing you can do is to try to get the missing phenotype data. As Steve mentioned, you could easily come up with a classifier based on something random like the sum of all probesets on the Y chromosome - males should be far higher than females - and infer the sex for those samples that have missing sex designation. As for the age, Hy-T and BMI, your choices are to get the missing data somehow, exclude subjects with missing data, or drop some covariates.

ADD REPLY • link 8.3 years ago James W. MacDonald 65k

score 0 · Answer 4 · 2016-01-07

0

Entering edit mode

mheydarpour ▴ 10

@mheydarpour-9430

Last seen 5.3 years ago

Thanks James, Steve, and Aaron for your time and responses!

ADD COMMENT • link 8.3 years ago mheydarpour ▴ 10