Search
Question: Error when computing ld with snpStats in big samples
1
gravatar for carlos.ruiz
17 months ago by
Spain/Barcelona/CREAL
carlos.ruiz0 wrote:

Hi,

I want to compute the linkage disequilibrium between the SNPs of some predifined regions of the genome of around Mb using snpStats. I have downloaded 1000 Genomes VCF files but I am experiencing some errors.

I start with a list of target genomic regions in a GRanges. I have created some functions to read the SNPs of the region, transform the VCF object into a SNP matrix and finally compute the LD. Most of these steps are performed using existing BioC functions. The problems appear when I want to compute the LD for a big numbers of SNPs with a big depth.

I tried to compute the LD using the ld function of snpStats and the following problem raised:

> computeLD(GRcomp[2], EUR))

*** caught segfault ***
address 0x7fe323878000, cause 'invalid permissions'

Traceback:
 1: .Call("ld", x, y, as.integer(depth), cstats, symmetric, PACKAGE = "snpStats")
 2: ld(snpsVCF$genotypes, depth = 30000, stats = "D.prime")

3: computeLD(GRcomp[2], EUR)

Possible actions:
1: abort (with core dump, if enabled)
2: normal R exit
3: exit R without saving workspace
4: exit R saving workspace

This region is quite big (14Mb) and contains many snps (379737)

>  as.character(GRcomp[2])
                merge2
"2:99698199-114173850"

This is an example of a very big region but I have had problems with smaller regions too. I have tested the wrappers with smaller regions and it works. Is there any limitation in the maximum of SNPs and depth that can be passed to ld? I haven't been able to find anything about it on the web.

I also send you the code of my wrappers. The functions use VCF files of 1000 Genomes:

computeLD <- function(range, samples){
  message("Loading VCF")
  snpsVCF <- getVCFmatrix(range, samples)
  message("Computing LD")
  res <- ld(snpsVCF$genotypes, depth = 30000, stats = "D.prime")
  res
}

getVCFmatrix <- function(range, samples = vcfsamples){
  vcf <- loadVCFrange(range, samples)
  snpsVCF <- genotypeToSnpMatrix(vcf)
  ## Conversion from VCF to SNP matrix produces some SNPs to be NA (multiallelic or bigger than 1)
  snpsVCF$map$position <- start(rowRanges(vcf))
  snpsVCF$map$chromosome <- rep(as.character(seqnames(range)), nrow(snpsVCF$map))
  snpsVCF$genotypes <- snpsVCF$genotypes[, !snpsVCF$map$ignore]
  snpsVCF$map <- snpsVCF$map[!snpsVCF$map$ignore, ]
  snpsVCF$map$allele.2 <- unlist(snpsVCF$map$allele.2)
  snpsVCF
}

loadVCFrange <- function(range, samples){
  vcffile <- paste0("ALL.chr", as.character(seqnames(range)), ".phase3_shapeit2_mvncall_integrated_v5a.20130502.genotypes.vcf.gz")
  vcfsamples <- samples(scanVcfHeader(vcffile))
  samples <- vcfsamples[vcfsamples %in% samples]
  param <- ScanVcfParam(samples = samples, which = range)
  vcf <- readVcf(vcffile, "hg19", param)
  vcf
}

I also send you my session info:

R version 3.3.0 (2016-05-03)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: CentOS release 6.7 (Final)

locale:
 [1] LC_CTYPE=es_ES.UTF-8       LC_NUMERIC=C
 [3] LC_TIME=es_ES.UTF-8        LC_COLLATE=es_ES.UTF-8
 [5] LC_MONETARY=es_ES.UTF-8    LC_MESSAGES=es_ES.UTF-8
 [7] LC_PAPER=es_ES.UTF-8       LC_NAME=C
 [9] LC_ADDRESS=C               LC_TELEPHONE=C
[11] LC_MEASUREMENT=es_ES.UTF-8 LC_IDENTIFICATION=C

attached base packages:
[1] parallel  stats4    stats     graphics  grDevices utils     datasets
[8] methods   base

other attached packages:
 [1] snpStats_1.22.0            Matrix_1.2-6
 [3] survival_2.39-4            VariantAnnotation_1.18.1
 [5] SummarizedExperiment_1.2.2 Biobase_2.32.0
 [7] ggplot2_2.1.0              Rsamtools_1.24.0
 [9] Biostrings_2.40.1          XVector_0.12.0
[11] GenomicRanges_1.24.1       GenomeInfoDb_1.8.1
[13] IRanges_2.6.0              S4Vectors_0.10.1
[15] BiocGenerics_0.18.0

loaded via a namespace (and not attached):
 [1] Rcpp_0.12.5             AnnotationDbi_1.34.3    splines_3.3.0
 [4] GenomicAlignments_1.8.1 zlibbioc_1.18.0         munsell_0.4.3
 [7] BiocParallel_1.6.2      lattice_0.20-33         BSgenome_1.40.0
[10] colorspace_1.2-6        plyr_1.8.3              tools_3.3.0
[13] grid_3.3.0              gtable_0.2.0            DBI_0.4-1
[16] rtracklayer_1.32.0      bitops_1.0-6            biomaRt_2.28.0
[19] RCurl_1.95-4.8          RSQLite_1.0.0           GenomicFeatures_1.24.2
[22] scales_0.4.0            XML_3.98-1.4

Bests,

Carlos Ruiz

ADD COMMENTlink modified 17 months ago • written 17 months ago by carlos.ruiz0
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.2.0
Traffic: 312 users visited in the last hour