Hi,
I want to compute the linkage disequilibrium between the SNPs of some predifined regions of the genome of around Mb using snpStats. I have downloaded 1000 Genomes VCF files but I am experiencing some errors.
I start with a list of target genomic regions in a GRanges. I have created some functions to read the SNPs of the region, transform the VCF object into a SNP matrix and finally compute the LD. Most of these steps are performed using existing BioC functions. The problems appear when I want to compute the LD for a big numbers of SNPs with a big depth.
I tried to compute the LD using the ld function of snpStats and the following problem raised:
> computeLD(GRcomp[2], EUR)) *** caught segfault *** address 0x7fe323878000, cause 'invalid permissions' Traceback: 1: .Call("ld", x, y, as.integer(depth), cstats, symmetric, PACKAGE = "snpStats") 2: ld(snpsVCF$genotypes, depth = 30000, stats = "D.prime") 3: computeLD(GRcomp[2], EUR) Possible actions: 1: abort (with core dump, if enabled) 2: normal R exit 3: exit R without saving workspace 4: exit R saving workspace
This region is quite big (14Mb) and contains many snps (379737)
> as.character(GRcomp[2]) merge2 "2:99698199-114173850"
This is an example of a very big region but I have had problems with smaller regions too. I have tested the wrappers with smaller regions and it works. Is there any limitation in the maximum of SNPs and depth that can be passed to ld? I haven't been able to find anything about it on the web.
I also send you the code of my wrappers. The functions use VCF files of 1000 Genomes:
computeLD <- function(range, samples){ message("Loading VCF") snpsVCF <- getVCFmatrix(range, samples) message("Computing LD") res <- ld(snpsVCF$genotypes, depth = 30000, stats = "D.prime") res }
getVCFmatrix <- function(range, samples = vcfsamples){ vcf <- loadVCFrange(range, samples) snpsVCF <- genotypeToSnpMatrix(vcf) ## Conversion from VCF to SNP matrix produces some SNPs to be NA (multiallelic or bigger than 1) snpsVCF$map$position <- start(rowRanges(vcf)) snpsVCF$map$chromosome <- rep(as.character(seqnames(range)), nrow(snpsVCF$map)) snpsVCF$genotypes <- snpsVCF$genotypes[, !snpsVCF$map$ignore] snpsVCF$map <- snpsVCF$map[!snpsVCF$map$ignore, ] snpsVCF$map$allele.2 <- unlist(snpsVCF$map$allele.2) snpsVCF }
loadVCFrange <- function(range, samples){ vcffile <- paste0("ALL.chr", as.character(seqnames(range)), ".phase3_shapeit2_mvncall_integrated_v5a.20130502.genotypes.vcf.gz") vcfsamples <- samples(scanVcfHeader(vcffile)) samples <- vcfsamples[vcfsamples %in% samples] param <- ScanVcfParam(samples = samples, which = range) vcf <- readVcf(vcffile, "hg19", param) vcf }
I also send you my session info:
R version 3.3.0 (2016-05-03) Platform: x86_64-pc-linux-gnu (64-bit) Running under: CentOS release 6.7 (Final) locale: [1] LC_CTYPE=es_ES.UTF-8 LC_NUMERIC=C [3] LC_TIME=es_ES.UTF-8 LC_COLLATE=es_ES.UTF-8 [5] LC_MONETARY=es_ES.UTF-8 LC_MESSAGES=es_ES.UTF-8 [7] LC_PAPER=es_ES.UTF-8 LC_NAME=C [9] LC_ADDRESS=C LC_TELEPHONE=C [11] LC_MEASUREMENT=es_ES.UTF-8 LC_IDENTIFICATION=C attached base packages: [1] parallel stats4 stats graphics grDevices utils datasets [8] methods base other attached packages: [1] snpStats_1.22.0 Matrix_1.2-6 [3] survival_2.39-4 VariantAnnotation_1.18.1 [5] SummarizedExperiment_1.2.2 Biobase_2.32.0 [7] ggplot2_2.1.0 Rsamtools_1.24.0 [9] Biostrings_2.40.1 XVector_0.12.0 [11] GenomicRanges_1.24.1 GenomeInfoDb_1.8.1 [13] IRanges_2.6.0 S4Vectors_0.10.1 [15] BiocGenerics_0.18.0 loaded via a namespace (and not attached): [1] Rcpp_0.12.5 AnnotationDbi_1.34.3 splines_3.3.0 [4] GenomicAlignments_1.8.1 zlibbioc_1.18.0 munsell_0.4.3 [7] BiocParallel_1.6.2 lattice_0.20-33 BSgenome_1.40.0 [10] colorspace_1.2-6 plyr_1.8.3 tools_3.3.0 [13] grid_3.3.0 gtable_0.2.0 DBI_0.4-1 [16] rtracklayer_1.32.0 bitops_1.0-6 biomaRt_2.28.0 [19] RCurl_1.95-4.8 RSQLite_1.0.0 GenomicFeatures_1.24.2 [22] scales_0.4.0 XML_3.98-1.4
Bests,
Carlos Ruiz