readVcf(remoteFile, "hg19", "region") works on macos, fails on linux and in docker
3
1
Entering edit mode
Paul Shannon ▴ 460
@paul-shannon-5944
Last seen 8 days ago
United States

I can read an indexed, tar.gz vcf file remotely from macos, but not on linux nor from a current docker devel image.

The file is served by a Flask CORS range-enabled webserver. Problem can be reproduced with the 5 lines shown below, which succeed on macos, fails in docker with

[E::hts_open_format] Failed to open file https://igv-data.systemsbiology.net/static/ampad/test/chr2.vcf.gz
roi <- GRanges(seqnames="2", IRanges(start=127084188, end=127084203))
tabixFile <-  TabixFile("https://igv-data.systemsbiology.net/static/ampad/test/chr2.vcf.gz")
x <- readVcf(tabixFile, "hg19", roi)
dim(geno(x)$GT)  # 2 1894

works fine with macos (first sessionInfo()) fails with docker & linux (second session Info. error message in the latter:

*[E::hts_open_format] Failed to open file https://igv-data.systemsbiology.net/static/ampad/test/chr2.vcf.gz
Error in open.BcfFile(BcfFile(file, character(0), ...)) : 
  'open' VCF/BCF failed
  filename: https://igv-data.systemsbiology.net/static/ampad/test/chr2.vcf.gz*

sessionInfo( ) # macos. which runs without error

R version 4.1.0 (2021-05-18)
Platform: x86_64-apple-darwin17.0 (64-bit)
Running under: macOS Mojave 10.14.6

Matrix products: default
BLAS:   /Library/Frameworks/R.framework/Versions/4.1/Resources/lib/libRblas.dylib
LAPACK: /Library/Frameworks/R.framework/Versions/4.1/Resources/lib/libRlapack.dylib

locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8

attached base packages:
[1] stats4    parallel  stats     graphics  grDevices utils     datasets 
[8] methods   base     

other attached packages:
 [1] VariantAnnotation_1.38.0    Rsamtools_2.8.0            
 [3] Biostrings_2.60.1           XVector_0.32.0             
 [5] SummarizedExperiment_1.22.0 Biobase_2.52.0             
 [7] GenomicRanges_1.44.0        GenomeInfoDb_1.28.1        
 [9] IRanges_2.26.0              S4Vectors_0.30.0           
[11] MatrixGenerics_1.4.0        matrixStats_0.59.0         
[13] BiocGenerics_0.38.0        

loaded via a namespace (and not attached):
 [1] Rcpp_1.0.6               lattice_0.20-44          prettyunits_1.1.1       
 [4] png_0.1-7                assertthat_0.2.1         digest_0.6.27           
 [7] utf8_1.2.1               BiocFileCache_2.0.0      R6_2.5.0                
[10] RSQLite_2.2.7            httr_1.4.2               pillar_1.6.1            
[13] zlibbioc_1.38.0          rlang_0.4.11             GenomicFeatures_1.44.0  
[16] progress_1.2.2           curl_4.3.2               rstudioapi_0.13         
[19] blob_1.2.1               Matrix_1.3-4             BiocParallel_1.26.1     
[22] stringr_1.4.0            RCurl_1.98-1.3           bit_4.0.4               
[25] biomaRt_2.48.2           DelayedArray_0.18.0      rtracklayer_1.52.0      
[28] compiler_4.1.0           pkgconfig_2.0.3          tidyselect_1.1.1        
[31] KEGGREST_1.32.0          tibble_3.1.2             GenomeInfoDbData_1.2.6  
[34] XML_3.99-0.6             fansi_0.5.0              crayon_1.4.1            
[37] dplyr_1.0.7              dbplyr_2.1.1             GenomicAlignments_1.28.0
[40] bitops_1.0-7             rappdirs_0.3.3           grid_4.1.0              
[43] lifecycle_1.0.0          DBI_1.1.1                magrittr_2.0.1          
[46] stringi_1.6.2            cachem_1.0.5             xml2_1.3.2              
[49] ellipsis_0.3.2           filelock_1.0.2           vctrs_0.3.8             
[52] generics_0.1.0           rjson_0.2.20             restfulr_0.0.13         
[55] tools_4.1.0              bit64_4.0.5              BSgenome_1.60.0         
[58] glue_1.4.2               purrr_0.3.4              hms_1.1.0               
[61] yaml_2.2.1               fastmap_1.1.0            AnnotationDbi_1.54.1    
[64] memoise_2.0.0            BiocIO_1.2.0    

sessionInfo()  from today's docker devel, runs with openBcfFile error shown above 

R version 4.1.0 (2021-05-18)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Ubuntu 20.04.2 LTS

Matrix products: default
BLAS/LAPACK: /usr/lib/x86_64-linux-gnu/openblas-pthread/libopenblasp-r0.3.8.so

locale:
 [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
 [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8    
 [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=C             
 [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
 [9] LC_ADDRESS=C               LC_TELEPHONE=C            
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       

attached base packages:
[1] stats4    stats     graphics  grDevices utils     datasets  methods  
[8] base     

other attached packages:
 [1] VariantAnnotation_1.39.0    Rsamtools_2.9.1            
 [3] Biostrings_2.61.1           XVector_0.33.0             
 [5] SummarizedExperiment_1.23.1 Biobase_2.53.0             
 [7] GenomicRanges_1.45.0        GenomeInfoDb_1.29.3        
 [9] IRanges_2.27.0              S4Vectors_0.31.0           
[11] MatrixGenerics_1.5.1        matrixStats_0.59.0         
[13] BiocGenerics_0.39.1        

loaded via a namespace (and not attached):
 [1] Rcpp_1.0.7               lattice_0.20-44          prettyunits_1.1.1       
 [4] png_0.1-7                assertthat_0.2.1         digest_0.6.27           
 [7] utf8_1.2.1               BiocFileCache_2.1.1      R6_2.5.0                
[10] RSQLite_2.2.7            httr_1.4.2               pillar_1.6.1            
[13] zlibbioc_1.39.0          rlang_0.4.11             GenomicFeatures_1.45.0  
[16] progress_1.2.2           curl_4.3.2               rstudioapi_0.13         
[19] blob_1.2.1               Matrix_1.3-4             BiocParallel_1.27.1     
[22] stringr_1.4.0            RCurl_1.98-1.3           bit_4.0.4               
[25] biomaRt_2.49.2           DelayedArray_0.19.1      rtracklayer_1.53.0      
[28] compiler_4.1.0           pkgconfig_2.0.3          tidyselect_1.1.1        
[31] KEGGREST_1.33.0          tibble_3.1.2             GenomeInfoDbData_1.2.6  
[34] XML_3.99-0.6             fansi_0.5.0              crayon_1.4.1            
[37] dplyr_1.0.7              dbplyr_2.1.1             GenomicAlignments_1.29.0
[40] bitops_1.0-7             rappdirs_0.3.3           grid_4.1.0              
[43] lifecycle_1.0.0          DBI_1.1.1                magrittr_2.0.1          
[46] stringi_1.6.2            cachem_1.0.5             xml2_1.3.2              
[49] ellipsis_0.3.2           filelock_1.0.2           vctrs_0.3.8             
[52] generics_0.1.0           rjson_0.2.20             restfulr_0.0.13         
[55] tools_4.1.0              bit64_4.0.5              BSgenome_1.61.0         
[58] glue_1.4.2               purrr_0.3.4              hms_1.1.0               
[61] yaml_2.2.1               parallel_4.1.0           fastmap_1.1.0           
[64] AnnotationDbi_1.55.1     memoise_2.0.0            BiocIO_1.3.0
open.BcfFile VariantAnnotation • 250 views
ADD COMMENT
1
Entering edit mode
@james-w-macdonald-5106
Last seen 1 day ago
United States

Seems OK to me.

> roi <- GRanges(seqnames="2", IRanges(start=127084188, end=127084203))
> tabixFile <-  TabixFile("https://igv-data.systemsbiology.net/static/ampad/test/chr2.vcf.gz")
> x <- readVcf(tabixFile, "hg19", roi)
> x
class: CollapsedVCF 
dim: 2 1894 
rowRanges(vcf):
  GRanges with 5 metadata columns: paramRangeID, REF, ALT, QUAL, FILTER
info(vcf):
  DataFrame with 24 columns: AC, AF, AN, BaseQRankSum, DP, DS, END, ExcessHe...
info(header(vcf)):
                       Number Type    Description                              
   AC                  A      Integer Allele count in genotypes, for each AL...
   AF                  A      Float   Allele Frequency, for each ALT allele,...
   AN                  1      Integer Total number of alleles in called geno...
   BaseQRankSum        1      Float   Z-score from Wilcoxon rank sum test of...
   DP                  1      Integer Approximate read depth; some reads may...
   DS                  0      Flag    Were any of the samples downsampled?     
   END                 1      Integer Stop position of the interval            
   ExcessHet           1      Float   Phred-scaled p-value for exact test of...
   FS                  1      Float   Phred-scaled p-value using Fisher's ex...
   HaplotypeScore      1      Float   Consistency of the site with at most t...
   InbreedingCoeff     1      Float   Inbreeding coefficient as estimated fr...
   MLEAC               A      Integer Maximum likelihood expectation (MLE) f...
   MLEAF               A      Float   Maximum likelihood expectation (MLE) f...
   MQ                  1      Float   RMS Mapping Quality                      
   MQ0                 1      Integer Total Mapping Quality Zero Reads         
   MQRankSum           1      Float   Z-score From Wilcoxon rank sum test of...
   NEGATIVE_TRAIN_SITE 0      Flag    This variant was used to build the neg...
   POSITIVE_TRAIN_SITE 0      Flag    This variant was used to build the pos...
   QD                  1      Float   Variant Confidence/Quality by Depth      
   ReadPosRankSum      1      Float   Z-score from Wilcoxon rank sum test of...
   SOR                 1      Float   Symmetric Odds Ratio of 2x2 contingenc...
   VQSLOD              1      Float   Log odds of being a true variant versu...
   VariantType         1      String  Variant type description                 
   culprit             1      String  The annotation which was the worst per...
geno(vcf):
  List of length 12: GT, AB, AD, DP, GQ, MIN_DP, MQ0, PGT, PID, PL, ...
geno(header(vcf)):
          Number Type    Description                                           
   GT     1      String  Genotype                                              
   AB     1      Float   Allele balance for each het genotype                  
   AD     .      Integer Allelic depths for the ref and alt alleles in the o...
   DP     1      Integer Approximate read depth (reads with MQ=255 or with b...
   GQ     1      Integer Genotype Quality                                      
   MIN_DP 1      Integer Minimum DP observed within the GVCF block             
   MQ0    1      Integer Number of Mapping Quality Zero Reads per sample       
   PGT    1      String  Physical phasing haplotype information, describing ...
   PID    1      String  Physical phasing ID information, where each unique ...
   PL     G      Integer Normalized, Phred-scaled likelihoods for genotypes ...
   RGQ    1      Integer Unconditional reference genotype confidence, encode...
   SB     4      Integer Per-sample component statistics which comprise the ...
> sessionInfo()
R version 4.1.0 (2021-05-18)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: CentOS Linux 7 (Core)

Matrix products: default
BLAS/LAPACK: /share/apps/MKL/mkl-2019.3/compilers_and_libraries_2019.3.199/linux/mkl/lib/intel64_lin/libmkl_gf_lp64.so

locale:
 [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
 [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8    
 [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
 [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
 [9] LC_ADDRESS=C               LC_TELEPHONE=C            
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       

attached base packages:
[1] stats4    parallel  stats     graphics  grDevices utils     datasets 
[8] methods   base     

other attached packages:
 [1] VariantAnnotation_1.38.0    Rsamtools_2.8.0            
 [3] Biostrings_2.60.1           XVector_0.32.0             
 [5] SummarizedExperiment_1.22.0 Biobase_2.52.0             
 [7] GenomicRanges_1.44.0        GenomeInfoDb_1.28.0        
 [9] IRanges_2.26.0              S4Vectors_0.30.0           
[11] MatrixGenerics_1.4.0        matrixStats_0.59.0         
[13] BiocGenerics_0.38.0         openxlsx_4.2.3             
[15] HardyWeinberg_1.7.2         Rsolnp_1.16                
[17] mice_3.13.0                

loaded via a namespace (and not attached):
 [1] httr_1.4.2               tidyr_1.1.3              bit64_4.0.5             
 [4] assertthat_0.2.1         BiocFileCache_2.0.0      blob_1.2.1              
 [7] BSgenome_1.60.0          GenomeInfoDbData_1.2.6   yaml_2.2.1              
[10] progress_1.2.2           pillar_1.6.1             RSQLite_2.2.7           
[13] backports_1.2.1          lattice_0.20-44          glue_1.4.2              
[16] digest_0.6.27            Matrix_1.3-3             XML_3.99-0.6            
[19] pkgconfig_2.0.3          broom_0.7.7              biomaRt_2.48.0          
[22] zlibbioc_1.38.0          purrr_0.3.4              BiocParallel_1.26.0     
[25] tibble_3.1.2             KEGGREST_1.32.0          generics_0.1.0          
[28] ellipsis_0.3.2           cachem_1.0.5             GenomicFeatures_1.44.0  
[31] magrittr_2.0.1           crayon_1.4.1             memoise_2.0.0           
[34] fansi_0.5.0              truncnorm_1.0-8          prettyunits_1.1.1       
[37] tools_4.1.0              hms_1.1.0                BiocIO_1.2.0            
[40] lifecycle_1.0.0          stringr_1.4.0            zip_2.2.0               
[43] DelayedArray_0.18.0      AnnotationDbi_1.54.1     compiler_4.1.0          
[46] rlang_0.4.11             grid_4.1.0               RCurl_1.98-1.3          
[49] rstudioapi_0.13          rjson_0.2.20             rappdirs_0.3.3          
[52] bitops_1.0-7             restfulr_0.0.13          curl_4.3.1              
[55] DBI_1.1.1                R6_2.5.0                 GenomicAlignments_1.28.0
[58] rtracklayer_1.52.0       dplyr_1.0.6              fastmap_1.1.0           
[61] bit_4.0.4                utf8_1.2.1               filelock_1.0.2          
[64] stringi_1.6.2            Rcpp_1.0.6               vctrs_0.3.8             
[67] png_0.1-7                dbplyr_2.1.1             tidyselect_1.1.1        
>

Perhaps it's something in the devel version of VariantAnnotation that needs to be ironed out. But doesn't seem to be Linux specific

0
Entering edit mode
Paul Shannon ▴ 460
@paul-shannon-5944
Last seen 8 days ago
United States

On bioc-devel Martin correctly surmised that readVcf(remoteFile) intermittently failed due to something wrong on our web server.

It turns out that we had some expired ssl certificates on the nginx server which dispatched to Flask.

The irregular behavior we saw - where readVcf worked in some settings, and failed in others - is apparently due to different ssl stringencies in different operating systems, including across different versions of linux.

I should have known better than to pin this on VariantAnnotation and Rsamtools.

Thanks to Jim and Martin for giving this some thought, nudging us in the right direction to find a solution.

  • Paul
ADD COMMENT
0
Entering edit mode
Alex • 0
@393bcb60
Last seen 3 months ago
India

thanks for the awesome information.

ADD COMMENT

Login before adding your answer.

Traffic: 139 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6