Are subpopulation MAFs available for gnomADv.3.1.1 with any package, like they are in MafDb.gnomAD.r2.1.hs37d5? I'm trying to use Genomic Scores to obtain all variants in a genomic range with MAF in any subpopulation >= cutoff.
I tried MafH5.gnomAD.v3.1.1.GRCh38, but it only seems to have one AF column.
mafdb <- MafDb.gnomAD.r2.1.hs37d5
populations(mafdb)
[1] "AF" "AF_afr" "AF_amr" "AF_asj" "AF_eas"
[6] "AF_fin" "AF_nfe" "AF_nfe_est" "AF_nfe_nwe" "AF_nfe_onf"
[11] "AF_nfe_seu" "AF_oth" "AF_popmax"
mafdb38 <- MafH5.gnomAD.v3.1.1.GRCh38
populations(mafdb38)
[1] "AF"
sessionInfo( )
R version 4.1.3 (2022-03-10)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows 10 x64 (build 17134)
Matrix products: default
locale:
[1] LC_COLLATE=English_United States.1252
[2] LC_CTYPE=English_United States.1252
[3] LC_MONETARY=English_United States.1252
[4] LC_NUMERIC=C
[5] LC_TIME=English_United States.1252
attached base packages:
[1] stats4 stats graphics grDevices utils datasets
[7] methods base
other attached packages:
[1] MafH5.gnomAD.v3.1.1.GRCh38_3.13.0
[2] MafDb.gnomAD.r2.1.hs37d5_3.10.0
[3] GenomicScores_2.6.1
[4] XML_3.99-0.9
[5] BSgenome_1.62.0
[6] Biostrings_2.62.0
[7] XVector_0.34.0
[8] liftOver_1.18.0
[9] Homo.sapiens_1.3.1
[10] TxDb.Hsapiens.UCSC.hg19.knownGene_3.2.2
[11] org.Hs.eg.db_3.14.0
[12] GO.db_3.14.0
[13] OrganismDbi_1.36.0
[14] GenomicFeatures_1.46.5
[15] AnnotationDbi_1.56.2
[16] Biobase_2.54.0
[17] rtracklayer_1.54.0
[18] GenomicRanges_1.46.1
[19] GenomeInfoDb_1.30.1
[20] IRanges_2.28.0
[21] S4Vectors_0.32.3
[22] BiocGenerics_0.40.0
[23] gwascat_2.26.0
loaded via a namespace (and not attached):
[1] bitops_1.0-7 matrixStats_0.61.0
[3] bit64_4.0.5 filelock_1.0.2
[5] progress_1.2.2 httr_1.4.2
[7] tools_4.1.3 utf8_1.2.2
[9] R6_2.5.1 HDF5Array_1.22.1
[11] DBI_1.1.2 rhdf5filters_1.6.0
[13] tidyselect_1.1.2 prettyunits_1.1.1
[15] bit_4.0.4 curl_4.3.2
[17] compiler_4.1.3 graph_1.72.0
[19] cli_3.2.0 xml2_1.3.3
[21] DelayedArray_0.20.0 readr_2.1.2
[23] RBGL_1.70.0 rappdirs_0.3.3
[25] stringr_1.4.0 digest_0.6.29
[27] Rsamtools_2.10.0 htmltools_0.5.2
[29] pkgconfig_2.0.3 MatrixGenerics_1.6.0
[31] dbplyr_2.1.1 fastmap_1.1.0
[33] rlang_1.0.2 rstudioapi_0.13
[35] RSQLite_2.2.11 shiny_1.7.1
[37] BiocIO_1.4.0 generics_0.1.2
[39] BiocParallel_1.28.3 dplyr_1.0.8
[41] VariantAnnotation_1.40.0 RCurl_1.98-1.6
[43] magrittr_2.0.2 GenomeInfoDbData_1.2.7
[45] Matrix_1.4-1 Rhdf5lib_1.16.0
[47] Rcpp_1.0.8.3 fansi_1.0.2
[49] lifecycle_1.0.1 stringi_1.7.6
[51] yaml_2.3.5 SummarizedExperiment_1.24.0
[53] zlibbioc_1.40.0 AnnotationHub_3.2.2
[55] rhdf5_2.38.1 BiocFileCache_2.2.1
[57] grid_4.1.3 blob_1.2.2
[59] promises_1.2.0.1 parallel_4.1.3
[61] snpStats_1.44.0 crayon_1.5.1
[63] lattice_0.20-45 splines_4.1.3
[65] hms_1.1.1 KEGGREST_1.34.0
[67] pillar_1.7.0 rjson_0.2.21
[69] biomaRt_2.50.3 BiocVersion_3.14.0
[71] glue_1.6.2 BiocManager_1.30.16
[73] httpuv_1.6.5 png_0.1-7
[75] vctrs_0.3.8 tzdb_0.2.0
[77] purrr_0.3.4 assertthat_0.2.1
[79] cachem_1.0.6 mime_0.12
[81] xtable_1.8-4 restfulr_0.0.13
[83] later_1.3.0 survival_3.3-1
[85] tibble_3.1.6 GenomicAlignments_1.30.0
[87] memoise_2.0.1 interactiveDisplayBase_1.32.0
[89] ellipsis_0.3.2
Ah, that makes sense. AF_popmax would be be better than global AF for my use case, but still not ideal; I believe AF_popmax excludes samples of Ashkenazi, Finnish, and indeterminate ancestry (unless you are calculating popmax manually using AC for each subpopulation during creation). I'm looking for variants under primers that could cause allelic-dropout and some of our samples come from these population groups. So my use case might be too specific for broad use, but we really need to make sure we aren't biasing our designs at the detriment to these populations. Could different packages be created for each subpopulation? Not sure if it would work, but I think I could just annotate regions with each population MAF and then keep the max (basically create AF_popmax that doesn't exclude any group). Thanks for creating this and for the help!
While it's true that a Google Search for "gnomAD AF_popmax" leads to this hit from the GATK team where they say that AF_popmax excludes samples of Ashkenazi, Finnish, and indeterminate ancestry, the only documentation I could find about AF_popmax in the gnomAD website here for the version 3 data doesn't say anything about excluding populations, which would be weird to neglect. Could it be that in this last version 3 they do not exclude anymore those populations from AF_popmax?
Checking one of their example variants at:
https://gnomad.broadinstitute.org/variant/1-55051215-G-GA?dataset=gnomad_r3
The help text about "Popmax filtering AF" doesn't say anything about excluding populations. If the column AF_popmax would be what you're looking for, this would greatly simplify the job, because it would be just adding one more column to the data.
Hm, I'm not sure if v3 popmax excluded these or no. It does look like they changed how the popmax_AF was calculated, but I don't see a description on the gnomad site that says that these populations were excluded. The UCSC table schema for v2 and v3 both say "Maximum allele frequency across populations (excluding samples of Ashkenazi, Finnish, and indeterminate ancestry)" , but I think this might be an error for v3.
an example here with high Ashkenazi MAF. where the gnomad "Popmax Filtering AF" lists Latino/Admix America instead of Ashkenazi (but I'm not sure if if this filtering AF is the same as the popmax_AF.
The help page describes this "Popmax Filtering AF" as "this is the highest disease-specific maximum credible population AF for which the observed AC is not compatible with pathogenicity", but in the example you give the difference between Ashkenazi and Latino/Admix is of an order of magnitude, so looks like Ashkenazi was excluded. I have opened an issue at the gnomAD browser GH repo to see if we can clear up this.
Assuming gnomAD excludes some populations, something we could also do is to generate our own "popmax" column (I'm always trying to see whether we can minimize the number of packages and storage) at the time processing the gnomAD VCF files, taking the maximum AF among all the populations (African / African American, Amish, Latino / Admixed American, Ashkenazi Jewish, East Asian, European (Finnish), Middle Eastern, European (non-Finnish), South Asian and Other) and then derive the MAF from that maximum AF. Would this fit your use case?
yes, calculating the popmax at the time of processing with all populations would be exactly what I am looking for. Thanks!
I've dug into this a little more. Looking at the gnomAD browser for this variant, AF for Ashkenazi Jewish (asj) = 0.02078 I downloaded the vcf. In the vcf this variant has,
popmax=nfe
,AF_popmax=0.0114985
,AF_asj=0.0207792
Good, this example confirms that gnomAD excludes at least the Ashkenazi Jewish from the popmax calculation. I've already downloaded the latest version 3.1.2 and I'm starting processing it to have the package ready with this updated version and the new popmax column for the next release. I'll write here once it becomes available.
great, thank you
hi, the new annotation package MafH5.gnomAD.v3.1.2.GRCh38 was released yesterday with Bioconductor 3.15. This package includes MAF values for the global population, denoted as "AF" and for a new population denoted as "AF_allpopmax" and which contains the largest MAF value among all populations, see the following example on the variant examined above in this thread:
as you can see, the maximum value is now the one from the Ashkenazi Jewish population. Because, it includes now this additional population, the package size has doubled to about 1.5Gb. I recommend to increase the download timeout to avoid errors, by doing:
prior to installing the package via
BiocManager
.thanks for updating this! I'll grab this one.
Hi Robert, The update works perfectly for my use case.
I've got a shiny app that works great locally, but any idea why it wouldn't behave when I try to deploy the shiny app? I set
options(repos = BiocManager::repositories())
in the console before deploying, triedoptions(rsconnect.max.bundle.size=3145728000)
andoptions(timeout=1500)
, but when I try to deploy I getError building image: Error fetching MafH5.gnomAD.v3.1.2.GRCh38 (3.15.0) source. Error downloading package source: HTTP 599: Connection closed
MRE below. This will deploy if I comment out
library("MafH5.gnomAD.v3.1.2.GRCh38")
hi, I'm not sure if I understand the problem, I can run your MRE on my computer running the latest release of R and Bioconductor without any error. I don't get why, if the package is already installed, your get an error about downloading anything, and particularly the package source. The GenomicScores has a shiny app that you can start, calling the function
igscores()
, and which in principle installs and loads packages like this without any problem, although you often need to increase the timeout to more than the default 300 seconds, as you did.Yes, I get the shiny app to run locally without any errors. It's only when I try to deploy to shinapps.io (
rsconnect::deployApp()
) that I get the error about downloading package source. I don't understand it enough to explain why deployment requires downloading the source vs using the packages that you have locally. I will keep troubleshooting and comment here incase anyone runs into this in the future. Thanks againIf the problem seems to occur only during the deployment at shinyapps.io, you may have more chances getting a good hint about the solution at the RStudio Shinyappsio Community Support forum.