Incomplete GWAS Catalog Data from makeCurrentGwascat()
Note: this post is also on Biostars. The suggestion I got there was that my internet was failing, but there is no error message to indicate this is the case and I consistently get 6427 records, so I am on the fence about whether this is the reason. If it is, does anyone have advice on a fix or alternative that's not "get better internet"?

I want to query GWAS Catalog using the gwascat package in R. I was surprised to see makeCurrentGwasCat() returns only 6,427 associations when there are many more in GWAS Catalog. Is this what I am meant to be observing, or is something going wrong here?

> cat1 <- makeCurrentGwascat()
running read.delim on
formatting gwaswloc instance...
NOTE: input data had non-ASCII characters replaced by '*'.
Warning message:
In gwdf2GRanges(tab, extractDate = as.character(Sys.Date())) :
  NAs introduced by coercion
> cat1
gwasloc instance with 6427 records and 38 attributes per record.
Extracted:  2021-01-12 
Genome:  GRCh38 
GRanges object with 5 ranges and 3 metadata columns:
      seqnames    ranges strand |                 DISEASE/TRAIT        SNPS   P-VALUE
         <Rle> <IRanges>  <Rle> |                   <character> <character> <numeric>
  [1]       22  41151150      * | General risk tolerance (MTAG)  rs75843224     6e-14
  [2]        1 207861610      * | General risk tolerance (MTAG)    rs984983     6e-14
  [3]        2  59787624      * | General risk tolerance (MTAG)   rs6732097     6e-14
  [4]       12 102069362      * | General risk tolerance (MTAG)  rs17437668     9e-14
  [5]        6  26173250      * | General risk tolerance (MTAG)  rs34661691     9e-14
  seqinfo: 23 sequences from GRCh38 genome

Contrast this to the data that comes with the package from 2016 which has more associations:

gwasloc instance with 22714 records and 36 attributes per record.
Extracted:  2016-01-18 
Genome:  GRCh38 
GRanges object with 5 ranges and 3 metadata columns:
      seqnames    ranges strand |                  DISEASE/TRAIT        SNPS   P-VALUE
         <Rle> <IRanges>  <Rle> |                    <character> <character> <numeric>
  [1]       11  41798900      * | Post-traumatic stress disorder  rs10768747     5e-06
  [2]       15  34768262      * | Post-traumatic stress disorder  rs12232346     2e-06
  [3]        8  96500749      * | Post-traumatic stress disorder   rs2437772     6e-06
  [4]        9  98221544      * | Post-traumatic stress disorder   rs7866350     1e-06
  [5]       15  54423444      * | Post-traumatic stress disorder  rs73419609     6e-06
  seqinfo: 23 sequences from GRCh38 genome

My session info:

> sessionInfo()
R version 3.6.2 (2019-12-12)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows 10 x64 (build 19041)

Matrix products: default

[1] LC_COLLATE=English_United Kingdom.1252  LC_CTYPE=English_United Kingdom.1252    LC_MONETARY=English_United Kingdom.1252 LC_NUMERIC=C                           
[5] LC_TIME=English_United Kingdom.1252    

attached base packages:
[1] parallel  stats4    stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
 [1] gwascat_2.18.0                          Homo.sapiens_1.3.1                      TxDb.Hsapiens.UCSC.hg19.knownGene_3.2.2                    
 [5] GO.db_3.10.0                            OrganismDbi_1.28.0                      GenomicFeatures_1.38.2                  GenomicRanges_1.38.0                   
 [9] GenomeInfoDb_1.22.1                     AnnotationDbi_1.48.0                    IRanges_2.20.2                          S4Vectors_0.24.4                       
[13] Biobase_2.46.0                          BiocGenerics_0.32.0                    

loaded via a namespace (and not attached):
 [1] Rcpp_1.0.5                  lattice_0.20-41             prettyunits_1.1.1           Rsamtools_2.2.3             Biostrings_2.54.0           assertthat_0.2.1           
 [7] digest_0.6.27               asreml_4.1.0.110            BiocFileCache_1.10.2        R6_2.5.0                    RSQLite_2.2.2               httr_1.4.2                 
[13] ggplot2_3.3.3               pillar_1.4.7                zlibbioc_1.32.0             rlang_0.4.10                progress_1.2.2              curl_4.3                   
[19] rstudioapi_0.13             data.table_1.13.6           blob_1.2.1                  Matrix_1.2-18               BiocParallel_1.20.1         stringr_1.4.0              
[25] RCurl_1.98-1.2              bit_4.0.4                   biomaRt_2.42.1              munsell_0.5.0               DelayedArray_0.12.3         compiler_3.6.2             
[31] rtracklayer_1.46.0          pkgconfig_2.0.3             askpass_1.1                 openssl_1.4.3               tidyselect_1.1.0            SummarizedExperiment_1.16.1
[37] tibble_3.0.4                GenomeInfoDbData_1.2.2      matrixStats_0.57.0          XML_3.99-0.3                crayon_1.3.4                dplyr_1.0.2                
[43] dbplyr_2.0.0                GenomicAlignments_1.22.1    bitops_1.0-6                rappdirs_0.3.1              RBGL_1.62.1                 grid_3.6.2                 
[49] gtable_0.3.0                lifecycle_0.2.0             DBI_1.1.0                   magrittr_2.0.1              scales_1.1.1                graph_1.64.0               
[55] stringi_1.5.3               XVector_0.26.0              ellipsis_0.3.1              generics_0.1.0              vctrs_0.3.6                 tools_3.6.2                
[61] bit64_4.0.5                 glue_1.4.2                  purrr_0.3.4                 hms_0.5.3                   colorspace_2.0-0            BiocManager_1.30.10        
[67] memoise_1.1.0

Thanks all.

Your observation is correct. I would advise you to use a current version of R (at least 4.0). This is a correct result:

> library(gwascat)
1/70 packages newly attached/loaded, see sessionInfo() for details.
> options(timeout=360)
> cur = makeCurrentGwascat()
trying URL ''
downloaded 142.6 MB

|==================================================================| 100% 142 MB
Warning: 5260 parsing failures.
  row            col               expected                actual                                file
   72 SNP_ID_CURRENT no trailing characters 2162231-C             '/tmp/Rtmpm8XCTA/file413987a497c74'
 4542 SNP_ID_CURRENT no trailing characters 7769879-?             '/tmp/Rtmpm8XCTA/file413987a497c74'
19088 CHR_POS        no trailing characters 24486138 x 29201690   '/tmp/Rtmpm8XCTA/file413987a497c74'
19089 CHR_POS        no trailing characters 138645814 x 118244643 '/tmp/Rtmpm8XCTA/file413987a497c74'
19090 CHR_POS        no trailing characters 118661955 x 170402454 '/tmp/Rtmpm8XCTA/file413987a497c74'
..... .............. ...................... ..................... ...................................
See problems(...) for more details.

formatting gwaswloc instance...
NOTE: input data had non-ASCII characters replaced by '*'.
> cur
gwasloc instance with 216521 records and 38 attributes per record.
Extracted:  2021-01-13 
metadata()$badpos includes records for which no unique locus was given.
Genome:  GRCh38 
GRanges object with 5 ranges and 3 metadata columns:
      seqnames    ranges strand |          DISEASE/TRAIT        SNPS   P-VALUE
         <Rle> <IRanges>  <Rle> |            <character> <character> <numeric>
  [1]       22  41151150      * | General risk toleran..  rs75843224     6e-14
  [2]        1 207861610      * | General risk toleran..    rs984983     6e-14
  [3]        2  59787624      * | General risk toleran..   rs6732097     6e-14
  [4]       12 102069362      * | General risk toleran..  rs17437668     9e-14
  [5]        6  26173250      * | General risk toleran..  rs34661691     9e-14
  seqinfo: 24 sequences from GRCh38 genome

I cannot guarantee that the timeout option setting given above will help you but it is worth a try. My sessionInfo() result, which corresponds to a valid installation of all packages according to BiocManager::valid(), is

> sessionInfo()
R version 4.0.2 Patched (2020-07-19 r78892)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Ubuntu 20.04 LTS (fossa-melisa X20)

Matrix products: default
BLAS:   /home/stvjc/R-4-0-dist/lib/R/lib/
LAPACK: /home/stvjc/R-4-0-dist/lib/R/lib/

 [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
 [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8    
 [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
 [9] LC_ADDRESS=C               LC_TELEPHONE=C            

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] gwascat_2.22.0 rmarkdown_2.6 

loaded via a namespace (and not attached):
 [1] Rcpp_1.0.5                  lattice_0.20-41            
 [3] prettyunits_1.1.1           Rsamtools_2.6.0            
 [5] Biostrings_2.58.0           assertthat_0.2.1           
 [7] digest_0.6.27               BiocFileCache_1.14.0       
 [9] R6_2.5.0                    GenomeInfoDb_1.26.2        
[11] stats4_4.0.2                RSQLite_2.2.2              
[13] evaluate_0.14               httr_1.4.2                 
[15] pillar_1.4.7                zlibbioc_1.36.0            
[17] rlang_0.4.10                GenomicFeatures_1.42.1     
[19] progress_1.2.2              curl_4.3                   
[21] blob_1.2.1                  S4Vectors_0.28.1           
[23] Matrix_1.3-2                startup_0.15.0             
[25] splines_4.0.2               BiocParallel_1.24.1        
[27] readr_1.4.0                 stringr_1.4.0              
[29] RCurl_1.98-1.2              bit_4.0.4                  
[31] biomaRt_2.46.0              DelayedArray_0.16.0        
[33] rtracklayer_1.50.0          compiler_4.0.2             
[35] xfun_0.20                   askpass_1.1                
[37] pkgconfig_2.0.3             BiocGenerics_0.36.0        
[39] htmltools_0.5.1             openssl_1.4.3              
[41] tidyselect_1.1.0            SummarizedExperiment_1.20.0
[43] tibble_3.0.4                GenomeInfoDbData_1.2.4     
[45] IRanges_2.24.1              matrixStats_0.57.0         
[47] XML_3.99-0.5                crayon_1.3.4               
[49] dplyr_1.0.2                 dbplyr_2.0.0               
[51] GenomicAlignments_1.26.0    bitops_1.0-6               
[53] rappdirs_0.3.1              grid_4.0.2                 
[55] lifecycle_0.2.0             DBI_1.1.0                  
[57] magrittr_2.0.1              stringi_1.5.3              
[59] XVector_0.30.0              xml2_1.3.2                 
[61] snpStats_1.40.0             ellipsis_0.3.1             
[63] generics_0.1.0              vctrs_0.3.6                
[65] tools_4.0.2                 bit64_4.0.5                
[67] BSgenome_1.58.0             Biobase_2.50.0             
[69] glue_1.4.2                  purrr_0.3.4                
[71] hms_0.5.3                   MatrixGenerics_1.2.0       
[73] survival_3.2-7              parallel_4.0.2             
[75] AnnotationDbi_1.52.0        BiocManager_1.30.10        
[77] GenomicRanges_1.42.0        memoise_1.1.0              
[79] knitr_1.30                  VariantAnnotation_1.36.0
Entering edit mode

Updating R and redownloading all of the packages solved my problem. Less of an internet issue, more of a me being lazy issue :) Thanks.


