Question

Linear model after imputation (impute.knn function)

0

Entering edit mode

nonCodingGene ▴ 10

@noncodinggene-7018

Last seen 5.8 years ago

Hi.

I'm using this function to impute missing values in my arrays. After using it, I run a linear model for each gene, then I take t-values from each model and I plot in a boxplot. Some of the outliers in the original data (without imputation) are genes with just two data points. I expected to see those genes be "removed" as outliers after using impute.knn, but I keep seeing the boxplots exactly the same with and without imputation. That's not what I think it should happen, I tough that as those genes borrowed info from similar genes the t-value would lessen.

Some genes with no data point at all I get data after imputation, but I think that is beacause it takes the average of the column.

Just in case it helps, I just normalized and then impute and used K=10 and K=100.

Thanks in advance.

impute • 1.7k views

ADD COMMENT • link updated 8.9 years ago by Gordon Smyth 50k • written 8.9 years ago by nonCodingGene ▴ 10

0

Entering edit mode

In my knowledge, impute.knn just replaces NA values with a real values which have been calculated using KNN.

ADD REPLY • link 8.9 years ago NS ▴ 60

0

Entering edit mode

Well. I understand that so, but if no k neighbor are found (because for example a gene has no data at all), the average of the column is used to fill that empty point. I do not understand why the average is also used when out of seven points; a gene has two values and five NA values. Cannot find K neighbors?

Example

Before Knn

R0020C    NA    NA    NA    NA    NA    NA    NA

RPR1    NA    NA    NA    NA    NA    0.0505659923    0.0247487194

snR19    0.1554507910    0.2897188341    2.1134728832    2.568502e+00    NA    NA    NA

After Knn

R0020C    0.0844759030    0.0847822470    0.0848669393    8.463285e-02    8.445348e-02    0.0832525170    0.0849027964

RPR1    0.0844759030    0.0847822470    0.0848669393    8.463285e-02    8.445348e-02    0.0505659923    0.0247487194

snR19    0.1554507910    0.2897188341    2.1134728832    2.568502e+00    8.445348e-02    0.0832525170    0.0849027964

the average for each column is around 0.08574 as it is normalized, which is close to what you get for some NA values.

ADD REPLY • link 8.9 years ago nonCodingGene ▴ 10

Gordon Smyth · Answer 1 · 2015-06-14

1

Entering edit mode

Gordon Smyth 50k

@gordon-smyth

Last seen 2 hours ago

WEHI, Melbourne, Australia

Linear models (for example as implemented in the limma package) can account for missing values precisely. Hence imputation isn't generally required or recommended as a preliminary step.

Imputation is traditionally used when you are planning to use statistical methods such as clustering that rely on complete data.

ADD COMMENT • link 8.9 years ago Gordon Smyth 50k

0

Entering edit mode

¿How it handles with missing data when for example just two points are available for a lm?

And just another question. I've been discussing with a mate whether lmfit actually gives you a moderated-t statistic or not. In the vignetting it doesn't say, as it does for eBayes, but after checking, both show the same t-values.

thanks

ADD REPLY • link updated 8.9 years ago by Gordon Smyth 50k • written 8.9 years ago by nonCodingGene ▴ 10

0

Entering edit mode

How it handles with missing data when for example just two points are available for a lm?

The same way that lm() does. What this means when there are just two observed points depends on how many coefficients there were to be estimated in the linear model.

And just another question. I've been discussing with a mate whether lmfit actually gives you a moderated-t statistic or not. In the vignetting it doesn't say, as it does for eBayes,

Why not read the help page help("lmFit"), which links to help("MArrayLM-class")? That tells you exactly what lmFit produces.

but after checking, both show the same t-values.

No they don't. But eBayes() is always used as a followup to lmFit(), so the two outputs are cumulative.

ADD REPLY • link 8.9 years ago Gordon Smyth 50k