impute.knn produces different results based on matrix order
1
0
Entering edit mode
Tamara • 0
@1c25a735
Last seen 7 days ago
United States

Hi,

I'm working with the impute::impute.knn function and noticed that the imputation result changes depending on the order of the input matrix. However, I would expect that rng.seed would produce the same result. Despite the different results, the spearman and pearson correlation between samples is still very high. Is this a bug? Otherwise, I'd appreciate any help understanding this.

Thanks!


#Load the impute library
> library(impute)

#Load the documentation sample data
> data(khanmiss)
> khan.expr <- khanmiss[-1, -(1:2)]

#Add row names for easier sorting later
> rownames(khan.expr) <- paste0("Gene", 1:2308)

#Check that no random seed exists prior to running imputation on the data set as is
> if(exists(".Random.seed")) rm(.Random.seed)

#Run imputation
> Result1_OriginalOrder <- impute.knn(as.matrix(khan.expr), rng.seed = 500)
Cluster size 2308 broken into 1509 799 
Cluster size 1509 broken into 401 1108 
Done cluster 401 
Done cluster 1108 
Done cluster 1509 
Done cluster 799 

#Create a new row order for the expression matrix
> khan.expr_neworder <- khan.expr[c(2308:1),]

#Clear out any random number state that may be stored
> if(exists(".Random.seed")) rm(.Random.seed)

#Run imputation on the new matrix order
> Result2_ChangedOrder <- impute.knn(as.matrix(khan.expr_neworder), rng.seed = 500)
Cluster size 2308 broken into 1458 850 
Done cluster 1458 
Done cluster 850 

#Extract the imputed results and match row-order between the two data frames
> result1_data <- Result1_OriginalOrder$data
> result2 <- Result2_ChangedOrder$data
> result2_data <- result2[rownames(result1_data),]

#Confirm that these results are different; otherwise all.equal would be TRUE
> all.equal(result1_data, result2_data)
[1] "Mean relative difference: 0.1456079"

#Re-imputing changed order with original order
> if(exists(".Random.seed")) rm(.Random.seed)
> khan.expr_originalorder <- khan.expr_neworder[c(2308:1),]
> all.equal(khan.expr, khan.expr_originalorder)
[1] TRUE

#Run imputation on the new matrix order
> Result3_OriginalOrder <- impute.knn(as.matrix(khan.expr_originalorder), rng.seed = 500)
Cluster size 2308 broken into 1509 799 
Cluster size 1509 broken into 401 1108 
Done cluster 401 
Done cluster 1108 
Done cluster 1509 
Done cluster 799 
> result3_data <- Result3_OriginalOrder$data

# Confirm that the original matrix order produces the same result
> all.equal(result3_data, result1_data)
[1] TRUE

> sessionInfo()
R version 4.3.1 (2023-06-16 ucrt)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows 10 x64 (build 19045)

Matrix products: default


locale:
[1] LC_COLLATE=English_United States.utf8  LC_CTYPE=English_United States.utf8    LC_MONETARY=English_United States.utf8
[4] LC_NUMERIC=C                           LC_TIME=English_United States.utf8    

time zone: America/New_York
tzcode source: internal

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] impute_1.76.0

loaded via a namespace (and not attached):
 [1] digest_0.6.33     fastmap_1.1.1     xfun_0.41         lattice_0.22-5    knitr_1.45        parallel_4.3.1   
 [7] htmltools_0.5.7   rmarkdown_2.25    cli_3.6.1         ape_5.7-1         grid_4.3.1        compiler_4.3.1   
[13] rstudioapi_0.15.0 tools_4.3.1       nlme_3.1-164      phylotools_0.2.2  evaluate_0.23     yaml_2.3.8       
[19] Rcpp_1.0.11       rlang_1.1.2
impute • 262 views
ADD COMMENT
0
Entering edit mode
@james-w-macdonald-5106
Last seen 15 hours ago
United States

Does this help you to understand?

> Result1_OriginalOrder <- impute.knn(as.matrix(khan.expr), rng.seed = 500, maxp = 2308)
> Result2_ChangedOrder <- impute.knn(as.matrix(khan.expr_neworder), rng.seed = 500, maxp = 2308)
> result1_data <- Result1_OriginalOrder$data
> result2 <- Result2_ChangedOrder$data
> result2_data <- result2[rownames(result1_data),]
> all.equal(result1_data, result2_data)
[1] TRUE
0
Entering edit mode

Thanks! I'm still confused why maxp has this effect. From the documentation, imputation is happening gene-wise but if all neighbors are missing for a gene, then the overall column mean for that block of genes- which would be influenced by recursive two-means clustering of at most maxp genes- is used for the imputed value.

What is that case where all neighbors are missing and what is the relationship to rowmax? And should maxp be set as high as possible for the best reproducibility?

ADD REPLY
0
Entering edit mode

From ?impute.knn

 maxp: The largest block of genes imputed using the knn algorithm
          inside 'impute.knn' (default 1500); larger blocks are divided
          by two-means clustering (recursively) prior to imputation. If
          'maxp=p', only knn imputation is done.

When you run with the first ordering it says

Cluster size 2308 broken into 1509 799 
Cluster size 1509 broken into 401 1108 
Done cluster 401 
Done cluster 1108 
Done cluster 1509 
Done cluster 799

and when you reorder it says

Cluster size 2308 broken into 1458 850 
Done cluster 1458 
Done cluster 850

So you are (in the first case) breaking your data into four subsets and then imputing using the 10 nearest neighbors that also happen to be in that subset of genes. When you reorder, for whatever reason you only use two clusters. If you are imputing using a different subset of samples, your expectation should be that you will get different results. If you set maxp = p, then you don't subcluster, so the row order no longer matters.

ADD REPLY

Login before adding your answer.

Traffic: 506 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6