Linear model after imputation (impute.knn function)
1
0
Entering edit mode
@noncodinggene-7018
Last seen 3.2 years ago

Hi.

I'm using this function to impute missing values in my arrays. After using it, I run a linear model for each gene, then I take t-values from each model and I plot in a boxplot. Some of the outliers in the original data (without imputation) are genes with just two data points. I expected to see those genes be "removed" as outliers after using impute.knn, but I keep seeing the boxplots exactly the same with and without imputation. That's not what I think it should happen, I tough that as those genes borrowed info from similar genes the t-value would lessen.

Some genes with no data point at all I get data after imputation, but I think that is beacause it takes the average of the column.

Just in case it helps, I just normalized and then impute and  used K=10 and  K=100.

Thanks in advance.

impute • 842 views
ADD COMMENT
0
Entering edit mode
In my knowledge, impute.knn just replaces NA values with a real values which have been calculated using KNN.
ADD REPLY
0
Entering edit mode

Well. I understand that so, but if no k neighbor are found (because for example a gene has no data at all), the average of the column is used to fill that empty point. I do not understand why the average is also used when out of seven points; a gene has two values and five NA values. Cannot find K neighbors? 

 

Example

Before Knn

R0020C    NA    NA    NA    NA    NA    NA    NA

RPR1    NA    NA    NA    NA    NA    0.0505659923    0.0247487194

snR19    0.1554507910    0.2897188341    2.1134728832    2.568502e+00    NA    NA    NA

 

 

After Knn

R0020C    0.0844759030    0.0847822470    0.0848669393    8.463285e-02    8.445348e-02    0.0832525170    0.0849027964

RPR1    0.0844759030    0.0847822470    0.0848669393    8.463285e-02    8.445348e-02    0.0505659923    0.0247487194

snR19    0.1554507910    0.2897188341    2.1134728832    2.568502e+00    8.445348e-02    0.0832525170    0.0849027964

 

 

the average for each column is around 0.08574 as it is normalized, which is close to what you get for some NA values.

ADD REPLY
1
Entering edit mode
@gordon-smyth
Last seen 1 hour ago
WEHI, Melbourne, Australia

Linear models (for example as implemented in the limma package) can account for missing values precisely. Hence imputation isn't generally required or recommended as a preliminary step.

Imputation is traditionally used when you are planning to use statistical methods such as clustering that rely on complete data.

ADD COMMENT
0
Entering edit mode

¿How it handles with missing data when for example just two points are available for a lm?

And just another question. I've been discussing with a mate whether lmfit actually gives you a moderated-t statistic or not. In the vignetting it doesn't say, as it does for eBayes, but after checking, both  show the same t-values.

thanks

ADD REPLY
0
Entering edit mode

How it handles with missing data when for example just two points are available for a lm?

The same way that lm() does. What this means when there are just two observed points depends on how many coefficients there were to be estimated in the linear model.

And just another question. I've been discussing with a mate whether lmfit actually gives you a moderated-t statistic or not. In the vignetting it doesn't say, as it does for eBayes,

Why not read the help page help("lmFit"), which links to help("MArrayLM-class")? That tells you exactly what lmFit produces.

but after checking, both show the same t-values.

No they don't. But eBayes() is always used as a followup to lmFit(), so the two outputs are cumulative.

ADD REPLY

Login before adding your answer.

Traffic: 255 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6