Hello,
Within SNPRelate, I have been trying to compute a dissimilarity matrix from input VCF data using the snpgdsDiss function. The resulting matrix, though, has NaN values for a small number of the 80 or so input samples, and I cannot proceed to compute a clustering (snpgdsHCluster). The VCF data ranges from 1-219 variants per sample, but the lower-sized samples are not exclusively the ones affected. Other than removing the affected samples from the study, is there anything else I can do to create a complete dissimilarity matrix?
vcf_data<- file.path("VCFSorts","multisample.vcf")
gds_data <- file.path("VCFSorts","multisample.gds") if(file.exists(gds_data)){file.remove(gds_data)} snpgdsVCF2GDS(vcf_data, gds_data, method="biallelic.only") snpgdsSummary(gds_data) geno_data <- snpgdsOpen(gds_data) pop_data <- read.xls("Sample Sheet.xlsx", sheet=1,header=TRUE) pop_code <- pop_data[["Group"]] pop_list <- read.gdsn(index.gdsn(geno_data, path="sample.id")) # show that the sample order is the same as the population order print(cbind(pop_data, pop_code, pop_list))
# # run PCA - THIS WORKS FINE pca<-snpgdsPCA(geno_data, num.thread=8) pc.percent <- pca$varprop*100 head(round(pc.percent, 2)) # make a data.frame tab <- data.framesample.id = pca$sample.id, pop = factor(pop_code)[match(pca$sample.id, pop_list)], EV1 = pca$eigenvect[,1], # the first eigenvector EV2 = pca$eigenvect[,2], # the second eigenvector stringsAsFactors = FALSE) plot(tab$EV2, tab$EV1, pch=16, cex=2, col=as.integer(tab$pop), xlab="eigenvector 2", ylab="eigenvector 1") legend("topright", legend=levels(tab$pop), pch=15, cex=1.5 , col=1:nlevels(tab$pop))
# Hierarchical Clustering - FAIL diss<-snpgdsDiss(geno_data, sample.id=NULL,snp.id=NULL,autosome.only=TRUE,remove.monosnp=TRUE,maf=NaN,missing.rate=NaN,num.thread=6,verbose=TRUE) hc<-snpgdsHCluster(diss, sample.id=NULL,need.mat=TRUE,hang=0.25)
Error in hclust(as.dist(dist), method = "average") : NA/NaN/Inf in foreign function call (arg 11)
> sessionInfo() R version 3.2.2 (2015-08-14) Platform: x86_64-pc-linux-gnu (64-bit) Running under: Ubuntu 15.10 locale: [1] LC_CTYPE=en_GB.UTF-8 LC_NUMERIC=C LC_TIME=en_IE.UTF-8 LC_COLLATE=en_GB.UTF-8 LC_MONETARY=en_IE.UTF-8 [6] LC_MESSAGES=en_GB.UTF-8 LC_PAPER=en_IE.UTF-8 LC_NAME=C LC_ADDRESS=C LC_TELEPHONE=C [11] LC_MEASUREMENT=en_IE.UTF-8 LC_IDENTIFICATION=C attached base packages: [1] stats graphics grDevices utils datasets methods base other attached packages: [1] SNPRelate_1.4.0 gdsfmt_1.6.2 gdata_2.17.0 loaded via a namespace (and not attached): [1] tools_3.2.2 gtools_3.5.0 |
|
|
Hello zhengx,
I ran the snpgdsIBS function on the geno_data object, above. Just like snpgdsDiss, the function ran to completion, and yes, there are NaNs in the output. These NaNs are in the same positions in both matrices.
Can you ran "snpgdsSampMissRate" to calculate the missing rate per sample? Then you could identify which samples cause the trouble.
Can you ran "snpgdsSampMissRate" to calculate the missing rate per sample? Then you could identify which samples cause the trouble.