Search
Question: Linear model after imputation (impute.knn function)
0
3.0 years ago by
Biorunner8810
Spain
Biorunner8810 wrote:

Hi.

I'm using this function to impute missing values in my arrays. After using it, I run a linear model for each gene, then I take t-values from each model and I plot in a boxplot. Some of the outliers in the original data (without imputation) are genes with just two data points. I expected to see those genes be "removed" as outliers after using impute.knn, but I keep seeing the boxplots exactly the same with and without imputation. That's not what I think it should happen, I tough that as those genes borrowed info from similar genes the t-value would lessen.

Some genes with no data point at all I get data after imputation, but I think that is beacause it takes the average of the column.

Just in case it helps, I just normalized and then impute and  used K=10 and  K=100.

modified 3.0 years ago by Gordon Smyth33k • written 3.0 years ago by Biorunner8810
In my knowledge, impute.knn just replaces NA values with a real values which have been calculated using KNN.

Well. I understand that so, but if no k neighbor are found (because for example a gene has no data at all), the average of the column is used to fill that empty point. I do not understand why the average is also used when out of seven points; a gene has two values and five NA values. Cannot find K neighbors?

Example

Before Knn

R0020C    NA    NA    NA    NA    NA    NA    NA

RPR1    NA    NA    NA    NA    NA    0.0505659923    0.0247487194

snR19    0.1554507910    0.2897188341    2.1134728832    2.568502e+00    NA    NA    NA

After Knn

R0020C    0.0844759030    0.0847822470    0.0848669393    8.463285e-02    8.445348e-02    0.0832525170    0.0849027964

RPR1    0.0844759030    0.0847822470    0.0848669393    8.463285e-02    8.445348e-02    0.0505659923    0.0247487194

snR19    0.1554507910    0.2897188341    2.1134728832    2.568502e+00    8.445348e-02    0.0832525170    0.0849027964

the average for each column is around 0.08574 as it is normalized, which is close to what you get for some NA values.

1
3.0 years ago by
Gordon Smyth33k
Walter and Eliza Hall Institute of Medical Research, Melbourne, Australia
Gordon Smyth33k wrote:

Linear models (for example as implemented in the limma package) can account for missing values precisely. Hence imputation isn't generally required or recommended as a preliminary step.

Imputation is traditionally used when you are planning to use statistical methods such as clustering that rely on complete data.

¿How it handles with missing data when for example just two points are available for a lm?

And just another question. I've been discussing with a mate whether lmfit actually gives you a moderated-t statistic or not. In the vignetting it doesn't say, as it does for eBayes, but after checking, both  show the same t-values.

thanks

How it handles with missing data when for example just two points are available for a lm?

The same way that lm() does. What this means when there are just two observed points depends on how many coefficients there were to be estimated in the linear model.

And just another question. I've been discussing with a mate whether lmfit actually gives you a moderated-t statistic or not. In the vignetting it doesn't say, as it does for eBayes,

Why not read the help page help("lmFit"), which links to help("MArrayLM-class")? That tells you exactly what lmFit produces.

but after checking, both show the same t-values.

No they don't. But eBayes() is always used as a followup to lmFit(), so the two outputs are cumulative.