Question

NA values in snpgdsDiss dissimilarity matrix

0

Entering edit mode

blackgore ▴ 10

@blackgore-3871

Last seen 8.4 years ago

Ireland

Hello,

Within SNPRelate, I have been trying to compute a dissimilarity matrix from input VCF data using the snpgdsDiss function. The resulting matrix, though, has NaN values for a small number of the 80 or so input samples, and I cannot proceed to compute a clustering (snpgdsHCluster). The VCF data ranges from 1-219 variants per sample, but the lower-sized samples are not exclusively the ones affected. Other than removing the affected samples from the study, is there anything else I can do to create a complete dissimilarity matrix?

vcf_data<- file.path("VCFSorts","multisample.vcf")

gds_data <- file.path("VCFSorts","multisample.gds")
if(file.exists(gds_data)){file.remove(gds_data)}
snpgdsVCF2GDS(vcf_data, gds_data, method="biallelic.only")
snpgdsSummary(gds_data)
geno_data <- snpgdsOpen(gds_data)

pop_data <- read.xls("Sample Sheet.xlsx", sheet=1,header=TRUE)
pop_code <- pop_data[["Group"]]
pop_list <- read.gdsn(index.gdsn(geno_data, path="sample.id")) 

# show that the sample order is the same as the population order
print(cbind(pop_data, pop_code, pop_list))


# # run PCA - THIS WORKS FINE
pca<-snpgdsPCA(geno_data, num.thread=8)
pc.percent <- pca$varprop*100
head(round(pc.percent, 2))
 
# make a data.frame
tab <- data.framesample.id = pca$sample.id,
                 pop = factor(pop_code)[match(pca$sample.id, pop_list)],
                 EV1 = pca$eigenvect[,1],    # the first eigenvector
                 EV2 = pca$eigenvect[,2],    # the second eigenvector
                 stringsAsFactors = FALSE)
plot(tab$EV2, tab$EV1, pch=16, cex=2, col=as.integer(tab$pop), xlab="eigenvector 2", ylab="eigenvector 1")
legend("topright", legend=levels(tab$pop), pch=15, cex=1.5 , col=1:nlevels(tab$pop))



# Hierarchical Clustering  - FAIL
diss<-snpgdsDiss(geno_data, sample.id=NULL,snp.id=NULL,autosome.only=TRUE,remove.monosnp=TRUE,maf=NaN,missing.rate=NaN,num.thread=6,verbose=TRUE)
hc<-snpgdsHCluster(diss, sample.id=NULL,need.mat=TRUE,hang=0.25)

Error in hclust(as.dist(dist), method = "average") : 
  NA/NaN/Inf in foreign function call (arg 11)

> sessionInfo()
R version 3.2.2 (2015-08-14)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Ubuntu 15.10

locale:
 [1] LC_CTYPE=en_GB.UTF-8       LC_NUMERIC=C               LC_TIME=en_IE.UTF-8        LC_COLLATE=en_GB.UTF-8     LC_MONETARY=en_IE.UTF-8   
 [6] LC_MESSAGES=en_GB.UTF-8    LC_PAPER=en_IE.UTF-8       LC_NAME=C                  LC_ADDRESS=C               LC_TELEPHONE=C            
[11] LC_MEASUREMENT=en_IE.UTF-8 LC_IDENTIFICATION=C       

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] SNPRelate_1.4.0 gdsfmt_1.6.2    gdata_2.17.0   

loaded via a namespace (and not attached):
[1] tools_3.2.2  gtools_3.5.0

snprelate • 2.3k views

ADD COMMENT • link updated 8.4 years ago by zhengx ▴ 30 • written 8.4 years ago by blackgore ▴ 10

score 0 · Answer 1 · 2015-11-25

0

Entering edit mode

zhengx ▴ 30

@zhengx-7950

Last seen 4.7 years ago

United States

Are you able to run snpgdsIBS Identity-By-State analysis? Is there any missing value in the result of IBS analysis also?

ADD COMMENT • link 8.4 years ago zhengx ▴ 30

0

Entering edit mode

Hello zhengx,

I ran the snpgdsIBS function on the geno_data object, above. Just like snpgdsDiss, the function ran to completion, and yes, there are NaNs in the output. These NaNs are in the same positions in both matrices.

ADD REPLY • link 8.4 years ago blackgore ▴ 10

0

Entering edit mode

Can you ran "snpgdsSampMissRate" to calculate the missing rate per sample? Then you could identify which samples cause the trouble.

ADD REPLY • link 8.4 years ago zhengx ▴ 30

0

Entering edit mode

Can you ran "snpgdsSampMissRate" to calculate the missing rate per sample? Then you could identify which samples cause the trouble.

ADD REPLY • link 8.4 years ago zhengx ▴ 30