NA values in snpgdsDiss dissimilarity matrix
1
0
Entering edit mode
blackgore ▴ 10
@blackgore-3871
Last seen 8.6 years ago
Ireland

Hello,

Within SNPRelate, I have been trying to compute a dissimilarity matrix from input VCF data using the snpgdsDiss function. The resulting matrix, though, has NaN values for a small number of the 80 or so input samples, and I cannot proceed to compute a clustering (snpgdsHCluster). The VCF data ranges from 1-219 variants per sample, but the lower-sized samples are not exclusively the ones affected. Other than removing the affected samples from the study, is there anything else I can do to create a complete dissimilarity matrix?

vcf_data<- file.path("VCFSorts","multisample.vcf")

gds_data <- file.path("VCFSorts","multisample.gds")
if(file.exists(gds_data)){file.remove(gds_data)}
snpgdsVCF2GDS(vcf_data, gds_data, method="biallelic.only")
snpgdsSummary(gds_data)
geno_data <- snpgdsOpen(gds_data)

pop_code <- pop_data[["Group"]]

# show that the sample order is the same as the population order
print(cbind(pop_data, pop_code, pop_list))


# # run PCA - THIS WORKS FINE
pc.percent <- pca$varprop*100 head(round(pc.percent, 2)) # make a data.frame tab <- data.framesample.id = pca$sample.id,
pop = factor(pop_code)[match(pca$sample.id, pop_list)], EV1 = pca$eigenvect[,1],    # the first eigenvector
EV2 = pca$eigenvect[,2], # the second eigenvector stringsAsFactors = FALSE) plot(tab$EV2, tab$EV1, pch=16, cex=2, col=as.integer(tab$pop), xlab="eigenvector 2", ylab="eigenvector 1")
legend("topright", legend=levels(tab$pop), pch=15, cex=1.5 , col=1:nlevels(tab$pop))

# Hierarchical Clustering  - FAIL
hc<-snpgdsHCluster(diss, sample.id=NULL,need.mat=TRUE,hang=0.25)

Error in hclust(as.dist(dist), method = "average") :
NA/NaN/Inf in foreign function call (arg 11)

> sessionInfo()
R version 3.2.2 (2015-08-14)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Ubuntu 15.10

locale:
[1] LC_CTYPE=en_GB.UTF-8       LC_NUMERIC=C               LC_TIME=en_IE.UTF-8        LC_COLLATE=en_GB.UTF-8     LC_MONETARY=en_IE.UTF-8
[6] LC_MESSAGES=en_GB.UTF-8    LC_PAPER=en_IE.UTF-8       LC_NAME=C                  LC_ADDRESS=C               LC_TELEPHONE=C
[11] LC_MEASUREMENT=en_IE.UTF-8 LC_IDENTIFICATION=C

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base

other attached packages:
[1] SNPRelate_1.4.0 gdsfmt_1.6.2    gdata_2.17.0

loaded via a namespace (and not attached):
[1] tools_3.2.2  gtools_3.5.0


snprelate • 2.4k views
0
Entering edit mode
zhengx ▴ 30
@zhengx-7950
Last seen 4.9 years ago
United States

Are you able to run snpgdsIBS Identity-By-State analysis? Is there any missing value in the result of IBS analysis also?

0
Entering edit mode

Hello zhengx,

I ran the snpgdsIBS function on the geno_data object, above. Just like snpgdsDiss, the function ran to completion, and yes, there are NaNs in the output. These NaNs are in the same positions in both matrices.

0
Entering edit mode

Can you ran "snpgdsSampMissRate" to calculate the missing rate per sample? Then you could identify which samples cause the trouble.

0
Entering edit mode

Can you ran "snpgdsSampMissRate" to calculate the missing rate per sample? Then you could identify which samples cause the trouble.