Question

Error of read in large vcf file using readVcf

0

Entering edit mode

xiw588 • 0

@d4773354

Last seen 3.0 years ago

United States

I tried to read in a large compressed multi-vcf file (20GB) into r and then apply somaticsignatures on the dataset in r. I have already tried to read in chunk and selected the columns that I need to use.I requested 80G on campus cluster but still failed to read in the vcf file at the repeat command stage.

Can anyone look into the code and any suggestion is appreciated!


library(VariantAnnotation)
library(BSgenome.Hsapiens.UCSC.hg19)
library(SomaticSignatures)
library(ggplot2)

setwd('~/VCF')
fl="all.vcf.gz"
vcffile = open(VcfFile(fl, yieldSize=5000))
repeat {
  vcf = readVcf(vcffile, "BSgenome.Hsapiens.UCSC.hg19",param=ScanVcfParam(info=c("RAF","AF","INFO"), geno=c("GT","DS")))
  if (nrow(vcf) == 0)
    break
}


seqlevels(vcf)<-c('chr1','chr2','chr3','chr4','chr5','chr6','chr7','chr8','chr9','chr10','chr11','chr12','chr13','chr14','chr15','chr16','chr17','chr18','chr19','chr20','chr21','chr22','chrX')
vr<- as(vcf, "VRanges")
mc<-mutationContext(vr,BSgenome.Hsapiens.UCSC.hg19)
mm<-motifMatrix(mc, normalize=TRUE,group = "KRAS")
mm[mm == 0]<-0.000001
sigs_nmf = identifySignatures(mm, 6, nmfDecomposition)

pdf(file="plotSignatures.pdf", width=10, height=8)
plotObservedSpectrum(sigs_nmf)
dev.off()

sessionInfo( )
R version 4.0.3 (2020-10-10)
Platform: x86_64-apple-darwin17.0 (64-bit)
Running under: macOS Big Sur 10.16

Matrix products: default
LAPACK: /Library/Frameworks/R.framework/Versions/4.0/Resources/lib/libRlapack.dylib

locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8

attached base packages:
[1] stats4    parallel  stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
 [1] SomaticSignatures_2.26.0          BSgenome.Hsapiens.UCSC.hg19_1.4.3 BSgenome_1.58.0                  
 [4] rtracklayer_1.50.0                survminer_0.4.9                   ggpubr_0.4.0                     
 [7] ggplot2_3.3.5                     survivalROC_1.0.3                 survival_3.2-11                  
[10] readxl_1.3.1                      MutationalPatterns_3.0.1          NMF_0.23.0                       
[13] cluster_2.1.2                     rngtools_1.5                      pkgmaker_0.32.2                  
[16] registry_0.5-1                    VariantAnnotation_1.36.0          Rsamtools_2.6.0                  
[19] Biostrings_2.58.0                 XVector_0.30.0                    SummarizedExperiment_1.20.0      
[22] Biobase_2.50.0                    GenomicRanges_1.42.0              GenomeInfoDb_1.26.7              
[25] IRanges_2.24.1                    S4Vectors_0.28.1                  MatrixGenerics_1.2.1             
[28] matrixStats_0.59.0                BiocGenerics_0.36.1               maftools_2.6.05                  

loaded via a namespace (and not attached):
  [1] rappdirs_0.3.3           GGally_2.1.2             tidyr_1.1.3              bit64_4.0.5             
  [5] knitr_1.33               DelayedArray_0.16.3      data.table_1.14.0        rpart_4.1-15            
  [9] RCurl_1.98-1.3           AnnotationFilter_1.14.0  doParallel_1.0.16        generics_0.1.0          
 [13] GenomicFeatures_1.42.3   cowplot_1.1.1            RSQLite_2.2.7            shadowtext_0.0.8        
 [17] proxy_0.4-26             bit_4.0.4                enrichplot_1.10.2        xml2_1.3.2              
 [21] assertthat_0.2.1         viridis_0.6.1            xfun_0.24                hms_1.1.0               
 [25] fansi_0.5.0              progress_1.2.2           dbplyr_2.1.1             km.ci_0.5-2             
 [29] igraph_1.2.6             DBI_1.1.1                geneplotter_1.68.0       htmlwidgets_1.5.3       
 [33] reshape_0.8.8            purrr_0.3.4              ellipsis_0.3.2           dplyr_1.0.7             
 [37] backports_1.2.1          markdown_1.1             annotate_1.68.0          gridBase_0.4-7          
 [41] biomaRt_2.46.3           vctrs_0.3.8              ggalluvial_0.12.3        ensembldb_2.14.1        
 [45] abind_1.4-5              cachem_1.0.5             withr_2.4.2              ggforce_0.3.3           
 [49] checkmate_2.0.0          GenomicAlignments_1.26.0 prettyunits_1.1.1        DOSE_3.16.0             
 [53] lazyeval_0.2.2           crayon_1.4.1             genefilter_1.72.1        pkgconfig_2.0.3         
 [57] labeling_0.4.2           tweenr_1.0.2             ProtGenerics_1.22.0      nnet_7.3-16             
 [61] rlang_0.4.11             lifecycle_1.0.0          table1_1.4.2             downloader_0.4          
 [65] BiocFileCache_1.14.0     dichromat_2.0-0          cellranger_1.1.0         polyclip_1.10-0         
 [69] graph_1.68.0             Matrix_1.3-4             KMsurv_0.1-5             carData_3.0-4           
 [73] zoo_1.8-9                base64enc_0.1-3          pheatmap_1.0.12          png_0.1-7               
 [77] viridisLite_0.4.0        bitops_1.0-7             pROC_1.17.0.1            blob_1.2.1              
 [81] stringr_1.4.0            qvalue_2.22.0            jpeg_0.1-8.1             rstatix_0.7.0           
 [85] ggsignif_0.6.2           scales_1.1.1             memoise_2.0.0            magrittr_2.0.1          
 [89] plyr_1.8.6               zlibbioc_1.36.0          compiler_4.0.3           scatterpie_0.1.6        
 [93] tinytex_0.32             RColorBrewer_1.1-2       pcaMethods_1.82.0        DESeq2_1.30.1           
 [97] cli_2.5.0                htmlTable_2.2.1          Formula_1.2-4            MASS_7.3-54             
[101] tidyselect_1.1.1         stringi_1.6.2            forcats_0.5.1            GOSemSim_2.16.1         
[105] askpass_1.1              locfit_1.5-9.4           latticeExtra_0.6-29      ggrepel_0.9.1           
[109] survMisc_0.5.5           grid_4.0.3               fastmatch_1.1-0          tools_4.0.3             
[113] rio_0.5.27               rstudioapi_0.13          foreach_1.5.1            foreign_0.8-81          
[117] gridExtra_2.3            farver_2.1.0             ggraph_2.0.5             digest_0.6.27           
[121] rvcheck_0.1.8            BiocManager_1.30.16      pracma_2.3.3             ggtext_0.1.1            
[125] Rcpp_1.0.6               gridtext_0.1.4           car_3.0-10               broom_0.7.8             
[129] OrganismDbi_1.32.0       httr_1.4.2               AnnotationDbi_1.52.0     ggbio_1.38.0            
[133] biovizBase_1.38.0        colorspace_2.0-2         XML_3.99-0.6             splines_4.0.3           
[137] RBGL_1.66.0              graphlayouts_0.7.1       xtable_1.8-4             tidygraph_1.2.0         
[141] R6_2.5.0                 Hmisc_4.5-0              pillar_1.6.1             htmltools_0.5.1.1       
[145] glue_1.4.2               fastmap_1.1.0            clusterProfiler_3.18.1   BiocParallel_1.24.1     
[149] codetools_0.2-18         fgsea_1.16.0             utf8_1.2.1               lattice_0.20-44         
[153] tibble_3.1.2             curl_4.3.2               zip_2.2.0                GO.db_3.12.1            
[157] openxlsx_4.2.4           openssl_1.4.4            munsell_0.5.0            DO.db_2.9               
[161] GenomeInfoDbData_1.2.4   iterators_1.0.13         haven_2.4.1              reshape2_1.4.4          
[165] gtable_0.3.0

SomaticSignatures VariantAnnotation • 1.8k views

ADD COMMENT • link updated 3.1 years ago by James W. MacDonald 66k • written 3.1 years ago by xiw588 • 0

score 0 · Answer 1 · 2021-06-30

0

Entering edit mode

James W. MacDonald 66k

@james-w-macdonald-5106

Last seen 2 hours ago

United States

This part

vcffile = open(VcfFile(fl, yieldSize=5000))
repeat {
  vcf = readVcf(vcffile, "BSgenome.Hsapiens.UCSC.hg19",param=ScanVcfParam(info=c("RAF","AF","INFO"), geno=c("GT","DS")))
  if (nrow(vcf) == 0)
    break
}

Goes through the whole VCF file, reads in 5000 rows at a time, successively overwriting what you read in the last time, until you hit the end nrow(vcf) == 0), at which time it exits, leaving you with nothing. Which presumably isn't what you wanted to do. Have you tried simply reading in what you want? Like

vcf <- readVcf("all.vcf.gz", "BSgenome.Hsapiens.UCSC.hg19",param=ScanVcfParam(info=c("RAF","AF","INFO"), geno=c("GT","DS")))

I don't know anything about SomaticSignatures, but maybe it doesn't require the whole VCF at once? In which case you could subset and run it by chromosome or something like that. So instead of using yieldSize you could pass in a GRanges object to define some genomic region (like each chromosome, or chunks thereof, etc).

ADD COMMENT • link 3.1 years ago James W. MacDonald 66k

0

Entering edit mode

Hi James, Thank you very much for your prompt response and correcting my mistake. I tried the way you mentioned here but got job killed due to the lack of memory. So I am wondering if there is any other way to overcome this... Thank you!

ADD REPLY • link 3.1 years ago xiw588 • 0

0

Entering edit mode

Also, Can you explain a bit in detail of pass in a GRanges object and maybe give a toy example of reading in chunk? I didn't quite get this part. Thank you.

ADD REPLY • link 3.1 years ago xiw588 • 0

0

Entering edit mode

Here's some pseudo-code. Like I said, I know nothing about the SomaticSignatures package, other than it wants a VRanges object. Your problem is that you can't read in all the VCF at one shot, so what you want to do is iterate through the VCF and get out what you need, which presumably will be more compact. But how you do that is sort of tricky. As an example, consider this:

> thething <- NULL
> system.time(for(i in 1:10000) thething <- rbind(thething, 1:50))
   user  system elapsed 
  10.75    0.10   10.92 
> thething <- matrix(NA, 10000, 50)
> system.time(for(i in 1:1000) thething[i,] <- 1:50)
   user  system elapsed 
      0       0       0

Iterating through your VCF and then concatenating the new data to the existing is absolutely brutal because you have to create a new object each time, which is why my first example is so slow. But if you pre-allocate the object that you are going to use and then just put the data into it, things go way faster. The trick is that you don't know a priori how big the object needs to be. Well, you could use awk and wc to get the number of rows, which is probably the way to go. So I'd do that, and then you can do something like

vcfdataIWant <- matrix(NA, nrow = <number of required rows>, ncol = <number of required columns>)
## now use your code, with adjustments
iteration <- 1
fl="all.vcf.gz"
vcffile = open(VcfFile(fl, yieldSize=5000))
repeat {
  vcf = readVcf(vcffile, "BSgenome.Hsapiens.UCSC.hg19",param=ScanVcfParam(info=c("RAF","AF","INFO"), geno=c("GT","DS")))
  submatrix <- <here you extract the data you need for your VRanges object>
  vcfdataIWant[iteration:(iteration+nrow(vcf) - 1)] <- submatrix
  if (nrow(vcf) == 0)
    break
}

In order to make this work you will have to write a function that takes the CollapsedVCF object you read in at each iteration and extracts all the information you need into a matrix that you can then dump into your pre-formed matrix. Then when you are done, you can convert the matrix to a VRanges object.

There might be something in SomaticSignatures that will get the data you need, so maybe you don't need to write a function. But that's the general idea.

ADD REPLY • link 3.1 years ago James W. MacDonald 66k