Question

VariantAnnotation::readVcf not reading variants

0

Entering edit mode

peiwen.li • 0

@3b918f2c

Last seen 3.0 years ago

Canada

Hi, I am trying to use VariantAnnotation::readVcf to read in a VCF file in chunks. However the resulting VCF file does not have any variants:

Code should be placed in three backticks as shown below

> fl<-"mac3.recode.vcf.bgz"
> tab <- VcfFile(fl, yieldSize=4000)
> open(tab)
> while (nrow(vcf_yield <- readVcf(tab, "Salvelinus")))
  cat("vcf dim:", dim(vcf_yield), "\n")
> close(tab)

> dim(vcf_yield)
[1]   0 138

> geno(vcf_yield)
List of length 11
names(11): GT AD DP GQ MIN_DP PGT PID PL PS RGQ SB

> sessionInfo( )
R version 4.0.2 (2020-06-22)
Platform: x86_64-apple-darwin17.0 (64-bit)
Running under: macOS  10.16

Matrix products: default
LAPACK: /Library/Frameworks/R.framework/Versions/4.0/Resources/lib/libRlapack.dylib

locale:
[1] en_CA.UTF-8/en_CA.UTF-8/en_CA.UTF-8/C/en_CA.UTF-8/en_CA.UTF-8

attached base packages:
[1] stats4    parallel  stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
 [1] VariantAnnotation_1.34.0    Rsamtools_2.4.0             Biostrings_2.56.0          
 [4] XVector_0.28.0              SummarizedExperiment_1.18.2 DelayedArray_0.14.1        
 [7] matrixStats_0.56.0          GenomicFeatures_1.40.1      AnnotationDbi_1.50.3       
[10] Biobase_2.48.0              GenomicRanges_1.40.0        GenomeInfoDb_1.24.2        
[13] IRanges_2.22.2              S4Vectors_0.26.1            BiocGenerics_0.34.0        

loaded via a namespace (and not attached):
 [1] progress_1.2.2           tidyselect_1.1.0         purrr_0.3.4              lattice_0.20-41         
 [5] vctrs_0.3.6              generics_0.1.0           BiocFileCache_1.12.1     rtracklayer_1.48.0      
 [9] yaml_2.2.1               blob_1.2.1               XML_3.99-0.5             rlang_0.4.10            
[13] pillar_1.4.6             glue_1.4.1               DBI_1.1.0                BiocParallel_1.22.0     
[17] rappdirs_0.3.1           bit64_4.0.5              dbplyr_2.1.0             GenomeInfoDbData_1.2.3  
[21] lifecycle_0.2.0          stringr_1.4.0            zlibbioc_1.34.0          memoise_1.1.0           
[25] biomaRt_2.44.4           curl_4.3                 Rcpp_1.0.5               BSgenome_1.56.0         
[29] openssl_1.4.2            bit_4.0.4                hms_0.5.3                askpass_1.1             
[33] digest_0.6.27            stringi_1.4.6            dplyr_1.0.4              grid_4.0.2              
[37] tools_4.0.2              bitops_1.0-6             magrittr_1.5             RCurl_1.98-1.2          
[41] RSQLite_2.2.3            tibble_3.0.3             crayon_1.3.4             pkgconfig_2.0.3         
[45] Matrix_1.2-18            ellipsis_0.3.1           xml2_1.3.2               prettyunits_1.1.1       
[49] assertthat_0.2.1         httr_1.4.2               rstudioapi_0.11          R6_2.4.1                
[53] GenomicAlignments_1.24.0 compiler_4.0.2

My VCF input is an output from GATK's VariantFiltration, which was then put to VCFtools for a series of filtering.

Any idea why this would happen? Thank you!

VariantAnnotation • 1.2k views

ADD COMMENT • link updated 3.0 years ago by James W. MacDonald 65k • written 3.0 years ago by peiwen.li • 0

0

Entering edit mode

Hi again. It seems it is a yieldSize problem. I tried with the example dataset and it shows the same error:

> fl <- system.file("extdata", "chr22.vcf.gz", package="VariantAnnotation")
> param <- ScanVcfParam(fixed="ALT", geno=c("GT", "GL"), info=c("LDAF"))
> tab <- VcfFile(fl, yieldSize=4000)
> open(tab)
> while (nrow(vcf_yield <- readVcf(tab, "hg19", param=param)))
+     cat("vcf dim:", dim(vcf_yield), "\n")
vcf dim: 4000 5 
vcf dim: 4000 5 
vcf dim: 2376 5 
> close(tab)
> dim(vcf_yield)
[1] 0 5

If I don't iterate through the VCF with yieldSize, the function works well:

> fl <- system.file("extdata", "chr22.vcf.gz", package="VariantAnnotation")
> vcf <- readVcf(fl, "hg19")
> dim(vcf)
[1] 10376     5

My VCF file is too large to be imported as a whole (requires too much memory). Any ideas on how I can get around with this issue? Thanks!

ADD REPLY • link 3.0 years ago peiwen.li • 0

score 0 · Answer 1 · 2021-05-20

You are reading the file in chunks, each time overwriting the previous chunk, until the number of rows of the output is equal to zero. You then check the dimensions of the last chunk you read, which by definition now has zero rows because of how you set your while loop to work. Your expectation should be that the row dimension is zero! Do note that 4000 + 4000 + 2376 = 10376

It's not clear exactly what you are trying to do here. If you want to be able to read the entire VCF into R, and you cannot do it all at once, you will not be able to read it in chunks and then combine. The latter is more memory/computationally expensive than the former. The idea behind reading data in chunks is that you can read in a chunk, do something with it, and then go on to the next chunk so you never have to have the whole VCF in memory at once.