I believe MafH5.gnomAD.v3.1.2.GRCh38 is supposed to include indels. The example in the manual gscores(mafh5, GRanges("3:46373452-46373484"), type="nonsnrs")
does output rs333. However, I've got an example of indels not reported and I can't figure why.
and the missed indels : 17-80112300-CTTGG-C and 17-80112305-AGAGGAAGGACCCTGGATGC-A
MRE
library("BiocManager")
library("GenomeInfoDb")
library("MafH5.gnomAD.v3.1.2.GRCh38")
test_score <- GRanges(seqnames=17, IRanges(start=80112300:80112310, width=1))
test_result <- gscores(mafdb_hg38,test_score)
as.data.frame(test_result)
output
seqnames start end width strand AF
1 chr17 80112300 80112300 1 * NA
2 chr17 80112301 80112301 1 * NA
3 chr17 80112302 80112302 1 * NA
4 chr17 80112303 80112303 1 * NA
5 chr17 80112304 80112304 1 * NA
6 chr17 80112305 80112305 1 * NA
7 chr17 80112306 80112306 1 * NA
8 chr17 80112307 80112307 1 * NA
9 chr17 80112308 80112308 1 * 7e-06
10 chr17 80112309 80112309 1 * NA
11 chr17 80112310 80112310 1 * NA
I found type="nonsnrs"
but that didn't help.
test_result <- gscores(mafdb_hg38,test_score,type="nonsnrs")
as.data.frame(test_result)
output
seqnames start end width strand AF
1 chr17 80112300 80112300 1 * NA
2 chr17 80112301 80112301 1 * NA
3 chr17 80112302 80112302 1 * NA
4 chr17 80112303 80112303 1 * NA
5 chr17 80112304 80112304 1 * NA
6 chr17 80112305 80112305 1 * NA
7 chr17 80112306 80112306 1 * NA
8 chr17 80112307 80112307 1 * NA
9 chr17 80112308 80112308 1 * NA
10 chr17 80112309 80112309 1 * NA
11 chr17 80112310 80112310 1 * NA
please also include the results of running the following in an R session
sessionInfo( ) sessionInfo( ) R version 4.2.0 (2022-04-22 ucrt) Platform: x86_64-w64-mingw32/x64 (64-bit) Running under: Windows 10 x64 (build 17134)
Matrix products: default
locale:
1 LC_COLLATE=English_United States.1252
[2] LC_CTYPE=English_United States.1252
[3] LC_MONETARY=English_United States.1252
[4] LC_NUMERIC=C
[5] LC_TIME=English_United States.1252
attached base packages: 1 stats4 stats graphics grDevices utils datasets [7] methods base
other attached packages:
1 BSgenome.Hsapiens.UCSC.hg38_1.4.4
[2] BSgenome.Hsapiens.UCSC.hg19_1.4.3
[3] MafH5.gnomAD.v3.1.2.GRCh38_3.15.0
[4] MafDb.gnomAD.r2.1.hs37d5_3.10.0
[5] GenomicScores_2.8.2
[6] BSgenome_1.64.0
[7] rtracklayer_1.56.1
[8] Biostrings_2.64.0
[9] XVector_0.36.0
[10] GenomicRanges_1.48.0
[11] GenomeInfoDb_1.32.2
[12] IRanges_2.30.0
[13] S4Vectors_0.34.0
[14] BiocGenerics_0.42.0
[15] BiocManager_1.30.18
loaded via a namespace (and not attached):
1 MatrixGenerics_1.8.1 Biobase_2.56.0
[3] httr_1.4.3 bit64_4.0.5
[5] AnnotationHub_3.4.0 shiny_1.7.1
[7] assertthat_0.2.1 interactiveDisplayBase_1.34.0
[9] BiocFileCache_2.4.0 blob_1.2.3
[11] GenomeInfoDbData_1.2.8 Rsamtools_2.12.0
[13] yaml_2.3.5 BiocVersion_3.15.2
[15] pillar_1.7.0 RSQLite_2.2.14
[17] lattice_0.20-45 glue_1.6.2
[19] digest_0.6.29 promises_1.2.0.1
[21] htmltools_0.5.2 httpuv_1.6.5
[23] Matrix_1.4-1 XML_3.99-0.10
[25] pkgconfig_2.0.3 zlibbioc_1.42.0
[27] purrr_0.3.4 xtable_1.8-4
[29] HDF5Array_1.24.1 later_1.3.0
[31] BiocParallel_1.30.3 tibble_3.1.7
[33] KEGGREST_1.36.2 generics_0.1.3
[35] ellipsis_0.3.2 cachem_1.0.6
[37] SummarizedExperiment_1.26.1 cli_3.3.0
[39] magrittr_2.0.3 crayon_1.5.1
[41] mime_0.12 memoise_2.0.1
[43] fansi_1.0.3 tools_4.2.0
[45] BiocIO_1.6.0 lifecycle_1.0.1
[47] matrixStats_0.62.0 Rhdf5lib_1.18.2
[49] DelayedArray_0.22.0 AnnotationDbi_1.58.0
[51] compiler_4.2.0 rlang_1.0.2
[53] rhdf5_2.40.0 grid_4.2.0
[55] RCurl_1.98-1.7 rhdf5filters_1.8.0
[57] rstudioapi_0.13 rjson_0.2.21
[59] rappdirs_0.3.3 bitops_1.0-7
[61] restfulr_0.0.15 codetools_0.2-18
[63] DBI_1.1.3 curl_4.3.2
[65] R6_2.5.1 GenomicAlignments_1.32.0
[67] dplyr_1.0.9 fastmap_1.1.0
[69] bit_4.0.4 utf8_1.2.2
[71] filelock_1.0.2 parallel_4.2.0
[73] Rcpp_1.0.8.3 png_0.1-7
[75] vctrs_0.4.1 dbplyr_2.2.1
[77] tidyselect_1.1.2
```
ah... that's why the example in the manual worked! Is there a way to get all scores (in this case indels) that span a specific range (or single bp)? I tried
minoverlap=0
in gscores but I'm still not picking the indel up. thanks!The
gscores()
function is designed to return the scores in the form of a base numeric vector (by setting the argumentscores.only=TRUE
) or a GRanges vector (default) that is _parallel_ to the input GRanges vector. This is because the original motivation for developing GenomicScores was to be able to query genomic scores for a given set of positions, such as when you have a variant set in a VCF file, but not to query what positions have genomic scores.The behavior you are asking would imply to alter that design decision of the function
gscores()
, so I think we need another function for that purpose that you would call first, and then use the result to callgscores()
. Could you please open an issue at the GenomicScores GitHub repo and I'll see what I can do.Ok, that makes sense. I'll open an issue there. Thanks for your help.
For other users interested in this thread, the link to the issue with this feature request is here, which has been recently updated with the addition of this feature in devel version 2.9.5 of GenomicScores through a new function called
wgscores()
. Anyone with comments or suggestions on this new feature, please add them to the conversation in the GitHub issue.