Search
Question: Limma- missing values in covariate file
1
2.9 years ago by
mheydarpour10
mheydarpour10 wrote:

I have 1000 candidate genes with their expression (FPKM). I would like to test the gene-expression difference between two groups of Individuals with heart_Failure and normal. In the linear model I have some covariates like "age", "sex", "BMI" and "Hypertension", however there are some missing values (NA) in the covariate file. I created the design matrix as follows:

design <- model.matrix(~age+sex+BMI+ Hypertension+group) ;   Note: group (No=normal , Yes=heart_Failure)

after I run the linear model as: fit <- lmFit(expression,design), It gave me an error because of some missing values in the model. How to fix this problem?

All values in "expression" and "group" are complete (no-missing), just I have missing values in covariables. How to fix this issue. Could you please advise by sending an example. Thank you

modified 2.9 years ago • written 2.9 years ago by mheydarpour10
1
2.9 years ago by
Denali
Steve Lianoglou12k wrote:

I'm trying to read between the lines here, but there might be one of two things happening:

1. The design matrix you have generated is likely not full rank. Do you see warnings about non estimable coefficients? Is the rank of your matrix equal to the number of columns, eg. is Matrix::rankMatrix(design) == ncol(design)?
2. Or, perhaps, you have levels in your age/sec/BMI/whatever factors that you need to drop? Are any columns in your design all 0's, eg. is any(colSums(design) == 0) equal to TRUE?

In any case, it would be helpful if you showed us more of your data, eg. your design, the pData of your ExpressionSet, etc.

Indeed, if you have NA values in the covariates you supply to model.matrix, the function will automatically remove the corresponding samples. This can result in your design matrix not being of full rank.

0
2.9 years ago by
mheydarpour10
mheydarpour10 wrote:

This is the first 20th rows of my data: (Missing values shows with NA), my data has no "0" values.

ID    sex    age    Hy-T    BMI    group
AB0116    Female    79    Yes    32.84    No
AB0032    Female    NA    No    21.33    No
AB0099    Male    86    NA    24.34    No
AB0016    Female    71    No    29.35    No
AB0069    Female    NA    No    34.38    No
AB0081    NA    49    Yes    69.25    No
AB0090    Female    81    No    26.56    No
AB0163    Female    NA    No    21.19    Yes
AB0151    Female    78    No    32.27    Yes
AB0164    Female    65    No    NA    Yes
AB0009    NA    69    No    33.80    No
AB0031    Female    90    No    27.81    No
AB0017    Male    89    NA    34.05    Yes
AB0018    Female    81    No    27.35    No
AB0055    Female    85    Yes    38.63    No
AB0113    Female    85    Yes    38.52    No
AB0008    Female    NA    Yes    NA    No
AB0050    Female    75    Yes    36.69    Yes
AB0089    Female    64    NA    34.95    No
AB0093    Female    83    Yes    28.72    No

Note: my expression data is normalized and no contain any "0" values.

2

As Steve already noted, you cannot fit a covariate for which you have missing data. In other words, you can't fit sex, age, Hy-T or BMI as independent variables your model unless you are willing to eliminate any subject with missing data for any of those covariates.

The OP could likely build a pretty reliable classifier of sex based on a ratio of expression between some genes found on the X vs Y chromosomes within each sample, though. At least some samples could be rescued that way ...

Also, Aaron put a finer point on problems w/ NA in covariates from the pData which I didn't even think to point out

0
2.9 years ago by
mheydarpour10
mheydarpour10 wrote:

Hi James,

I found the following explanation from one of the users in the Limma's archive. Do you think it might be utilize in my case too?

If so, how?

"Why do you want to include weight in the design matrix? It may be more reasonable to include weight in the linear model, i.e. expression ~ condition + weight and then your design matrix will have two columns (if there are two conditions): the first column will contain 0's and 1's depending on the condition (group) of the patient while the second one will be identically 1. And then the weights of the patients will be in the values (i.e. the second argument to lmFit) which I believe allows missing values. But this implicitly assumes that the weight (or some function of it - you can use any transformation you like) contributes additively (and linearly) to the expression (or it's logarithm). Moshe."

1

That quote isn't of use here. The problem is that you have missing data in your covariates, and any NA will automatically remove that subject from your analysis. That's just how it is.

Think about it this way. If I asked you to compute the average weight of 10 people, by sex, and I neglected to tell you the sex of 3 of those people, what would you do with the data for those 3 people? You don't know what sex they are, so you have to simply ignore them and compute the average on the remaining 7 subjects for which you know the sex. This is exactly what R is doing when fitting a linear model - if there are missing covariate data for any subject, it silently removes those subjects from consideration, because that's all it can do.

The only thing you can do is to try to get the missing phenotype data. As Steve mentioned, you could easily come up with a classifier based on something random like the sum of all probesets on the Y chromosome - males should be far higher than females -  and infer the sex for those samples that have missing sex designation. As for the age, Hy-T and BMI, your choices are to get the missing data somehow, exclude subjects with missing data, or drop some covariates.

0
2.9 years ago by
mheydarpour10
mheydarpour10 wrote:

Thanks James, Steve, and Aaron for your time and responses!