VariantAnnotation::readVcf not reading variants
1
0
Entering edit mode
peiwen.li • 0
@3b918f2c
Last seen 16 months ago
Canada

Hi, I am trying to use VariantAnnotation::readVcf to read in a VCF file in chunks. However the resulting VCF file does not have any variants:

Code should be placed in three backticks as shown below

> fl<-"mac3.recode.vcf.bgz"
> tab <- VcfFile(fl, yieldSize=4000)
> open(tab)
> while (nrow(vcf_yield <- readVcf(tab, "Salvelinus")))
  cat("vcf dim:", dim(vcf_yield), "\n")
> close(tab)

> dim(vcf_yield)
[1]   0 138

> geno(vcf_yield)
List of length 11
names(11): GT AD DP GQ MIN_DP PGT PID PL PS RGQ SB

> sessionInfo( )
R version 4.0.2 (2020-06-22)
Platform: x86_64-apple-darwin17.0 (64-bit)
Running under: macOS  10.16

Matrix products: default
LAPACK: /Library/Frameworks/R.framework/Versions/4.0/Resources/lib/libRlapack.dylib

locale:
[1] en_CA.UTF-8/en_CA.UTF-8/en_CA.UTF-8/C/en_CA.UTF-8/en_CA.UTF-8

attached base packages:
[1] stats4    parallel  stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
 [1] VariantAnnotation_1.34.0    Rsamtools_2.4.0             Biostrings_2.56.0          
 [4] XVector_0.28.0              SummarizedExperiment_1.18.2 DelayedArray_0.14.1        
 [7] matrixStats_0.56.0          GenomicFeatures_1.40.1      AnnotationDbi_1.50.3       
[10] Biobase_2.48.0              GenomicRanges_1.40.0        GenomeInfoDb_1.24.2        
[13] IRanges_2.22.2              S4Vectors_0.26.1            BiocGenerics_0.34.0        

loaded via a namespace (and not attached):
 [1] progress_1.2.2           tidyselect_1.1.0         purrr_0.3.4              lattice_0.20-41         
 [5] vctrs_0.3.6              generics_0.1.0           BiocFileCache_1.12.1     rtracklayer_1.48.0      
 [9] yaml_2.2.1               blob_1.2.1               XML_3.99-0.5             rlang_0.4.10            
[13] pillar_1.4.6             glue_1.4.1               DBI_1.1.0                BiocParallel_1.22.0     
[17] rappdirs_0.3.1           bit64_4.0.5              dbplyr_2.1.0             GenomeInfoDbData_1.2.3  
[21] lifecycle_0.2.0          stringr_1.4.0            zlibbioc_1.34.0          memoise_1.1.0           
[25] biomaRt_2.44.4           curl_4.3                 Rcpp_1.0.5               BSgenome_1.56.0         
[29] openssl_1.4.2            bit_4.0.4                hms_0.5.3                askpass_1.1             
[33] digest_0.6.27            stringi_1.4.6            dplyr_1.0.4              grid_4.0.2              
[37] tools_4.0.2              bitops_1.0-6             magrittr_1.5             RCurl_1.98-1.2          
[41] RSQLite_2.2.3            tibble_3.0.3             crayon_1.3.4             pkgconfig_2.0.3         
[45] Matrix_1.2-18            ellipsis_0.3.1           xml2_1.3.2               prettyunits_1.1.1       
[49] assertthat_0.2.1         httr_1.4.2               rstudioapi_0.11          R6_2.4.1                
[53] GenomicAlignments_1.24.0 compiler_4.0.2

My VCF input is an output from GATK's VariantFiltration, which was then put to VCFtools for a series of filtering.

Any idea why this would happen? Thank you!

VariantAnnotation • 476 views
ADD COMMENT
0
Entering edit mode

Hi again. It seems it is a yieldSize problem. I tried with the example dataset and it shows the same error:

> fl <- system.file("extdata", "chr22.vcf.gz", package="VariantAnnotation")
> param <- ScanVcfParam(fixed="ALT", geno=c("GT", "GL"), info=c("LDAF"))
> tab <- VcfFile(fl, yieldSize=4000)
> open(tab)
> while (nrow(vcf_yield <- readVcf(tab, "hg19", param=param)))
+     cat("vcf dim:", dim(vcf_yield), "\n")
vcf dim: 4000 5 
vcf dim: 4000 5 
vcf dim: 2376 5 
> close(tab)
> dim(vcf_yield)
[1] 0 5

If I don't iterate through the VCF with yieldSize, the function works well:

> fl <- system.file("extdata", "chr22.vcf.gz", package="VariantAnnotation")
> vcf <- readVcf(fl, "hg19")
> dim(vcf)
[1] 10376     5

My VCF file is too large to be imported as a whole (requires too much memory). Any ideas on how I can get around with this issue? Thanks!

ADD REPLY
0
Entering edit mode
@james-w-macdonald-5106
Last seen 1 hour ago
United States

You are reading the file in chunks, each time overwriting the previous chunk, until the number of rows of the output is equal to zero. You then check the dimensions of the last chunk you read, which by definition now has zero rows because of how you set your while loop to work. Your expectation should be that the row dimension is zero! Do note that 4000 + 4000 + 2376 = 10376

It's not clear exactly what you are trying to do here. If you want to be able to read the entire VCF into R, and you cannot do it all at once, you will not be able to read it in chunks and then combine. The latter is more memory/computationally expensive than the former. The idea behind reading data in chunks is that you can read in a chunk, do something with it, and then go on to the next chunk so you never have to have the whole VCF in memory at once.

ADD COMMENT

Login before adding your answer.

Traffic: 491 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6