Question

subpopulations available in MafH5.gnomAD.v3.1.1.GRCh38

0

Entering edit mode

Robert • 0

@b14a6f0d

Last seen 21 months ago

United States

Are subpopulation MAFs available for gnomADv.3.1.1 with any package, like they are in MafDb.gnomAD.r2.1.hs37d5? I'm trying to use Genomic Scores to obtain all variants in a genomic range with MAF in any subpopulation >= cutoff.

I tried MafH5.gnomAD.v3.1.1.GRCh38, but it only seems to have one AF column.

mafdb <- MafDb.gnomAD.r2.1.hs37d5
populations(mafdb)

[1] "AF"         "AF_afr"     "AF_amr"     "AF_asj"     "AF_eas"    
 [6] "AF_fin"     "AF_nfe"     "AF_nfe_est" "AF_nfe_nwe" "AF_nfe_onf"
[11] "AF_nfe_seu" "AF_oth"     "AF_popmax" 

mafdb38 <- MafH5.gnomAD.v3.1.1.GRCh38
populations(mafdb38)

[1] "AF"


sessionInfo( )

R version 4.1.3 (2022-03-10)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows 10 x64 (build 17134)

Matrix products: default

locale:
[1] LC_COLLATE=English_United States.1252 
[2] LC_CTYPE=English_United States.1252   
[3] LC_MONETARY=English_United States.1252
[4] LC_NUMERIC=C                          
[5] LC_TIME=English_United States.1252    

attached base packages:
[1] stats4    stats     graphics  grDevices utils     datasets 
[7] methods   base     

other attached packages:
 [1] MafH5.gnomAD.v3.1.1.GRCh38_3.13.0      
 [2] MafDb.gnomAD.r2.1.hs37d5_3.10.0        
 [3] GenomicScores_2.6.1                    
 [4] XML_3.99-0.9                           
 [5] BSgenome_1.62.0                        
 [6] Biostrings_2.62.0                      
 [7] XVector_0.34.0                         
 [8] liftOver_1.18.0                        
 [9] Homo.sapiens_1.3.1                     
[10] TxDb.Hsapiens.UCSC.hg19.knownGene_3.2.2
[11] org.Hs.eg.db_3.14.0                    
[12] GO.db_3.14.0                           
[13] OrganismDbi_1.36.0                     
[14] GenomicFeatures_1.46.5                 
[15] AnnotationDbi_1.56.2                   
[16] Biobase_2.54.0                         
[17] rtracklayer_1.54.0                     
[18] GenomicRanges_1.46.1                   
[19] GenomeInfoDb_1.30.1                    
[20] IRanges_2.28.0                         
[21] S4Vectors_0.32.3                       
[22] BiocGenerics_0.40.0                    
[23] gwascat_2.26.0                         

loaded via a namespace (and not attached):
 [1] bitops_1.0-7                  matrixStats_0.61.0           
 [3] bit64_4.0.5                   filelock_1.0.2               
 [5] progress_1.2.2                httr_1.4.2                   
 [7] tools_4.1.3                   utf8_1.2.2                   
 [9] R6_2.5.1                      HDF5Array_1.22.1             
[11] DBI_1.1.2                     rhdf5filters_1.6.0           
[13] tidyselect_1.1.2              prettyunits_1.1.1            
[15] bit_4.0.4                     curl_4.3.2                   
[17] compiler_4.1.3                graph_1.72.0                 
[19] cli_3.2.0                     xml2_1.3.3                   
[21] DelayedArray_0.20.0           readr_2.1.2                  
[23] RBGL_1.70.0                   rappdirs_0.3.3               
[25] stringr_1.4.0                 digest_0.6.29                
[27] Rsamtools_2.10.0              htmltools_0.5.2              
[29] pkgconfig_2.0.3               MatrixGenerics_1.6.0         
[31] dbplyr_2.1.1                  fastmap_1.1.0                
[33] rlang_1.0.2                   rstudioapi_0.13              
[35] RSQLite_2.2.11                shiny_1.7.1                  
[37] BiocIO_1.4.0                  generics_0.1.2               
[39] BiocParallel_1.28.3           dplyr_1.0.8                  
[41] VariantAnnotation_1.40.0      RCurl_1.98-1.6               
[43] magrittr_2.0.2                GenomeInfoDbData_1.2.7       
[45] Matrix_1.4-1                  Rhdf5lib_1.16.0              
[47] Rcpp_1.0.8.3                  fansi_1.0.2                  
[49] lifecycle_1.0.1               stringi_1.7.6                
[51] yaml_2.3.5                    SummarizedExperiment_1.24.0  
[53] zlibbioc_1.40.0               AnnotationHub_3.2.2          
[55] rhdf5_2.38.1                  BiocFileCache_2.2.1          
[57] grid_4.1.3                    blob_1.2.2                   
[59] promises_1.2.0.1              parallel_4.1.3               
[61] snpStats_1.44.0               crayon_1.5.1                 
[63] lattice_0.20-45               splines_4.1.3                
[65] hms_1.1.1                     KEGGREST_1.34.0              
[67] pillar_1.7.0                  rjson_0.2.21                 
[69] biomaRt_2.50.3                BiocVersion_3.14.0           
[71] glue_1.6.2                    BiocManager_1.30.16          
[73] httpuv_1.6.5                  png_0.1-7                    
[75] vctrs_0.3.8                   tzdb_0.2.0                   
[77] purrr_0.3.4                   assertthat_0.2.1             
[79] cachem_1.0.6                  mime_0.12                    
[81] xtable_1.8-4                  restfulr_0.0.13              
[83] later_1.3.0                   survival_3.3-1               
[85] tibble_3.1.6                  GenomicAlignments_1.30.0     
[87] memoise_2.0.1                 interactiveDisplayBase_1.32.0
[89] ellipsis_0.3.2

GenomicScores MafH5.gnomAD.v3.1.1.GRCh38 • 2.3k views

ADD COMMENT • link updated 22 months ago by Robert Castelo ★ 3.3k • written 2.1 years ago by Robert • 0

score 2 · Accepted Answer · 2022-03-30

2

Entering edit mode

Robert Castelo ★ 3.3k

@rcastelo

Last seen 4 days ago

Barcelona/Universitat Pompeu Fabra

hi,

Due to the steady increase of variants and subpopulations in the gnomAD data, we decided to put only the global AF population in the package that stores the last version of the gnomAD data, lowering its size from the 5.7Gb of the previous gnomAD version in MafDb.gnomAD.r2.1.hs37d5 to the 750Mb of the current last gnomAD version MafH5.gnomAD.v3.1.1.GRCh38 (otherwise would have surpassed the 6Gb), which jointly with the use of an HDF5 backend to access the MAF values for SNVs, it reduces substantially the memory footprint, without loosing much performance. We took this decision to see also whether users, such as you, complain :) and request for other subpopulation data.

From what you say you need, you could use the 'AF_popmax' subpopulation of MAF values, right?

We could either add the 'AF_popmax' subpopulation to the version of the package that will appear in the next release of Bioconductor by April 27th, increasing the size of the package to about 1.3Gb, or add a new MafH5 package of a similar size (750Mb) for this 'AF_popmax' subpopulation. Would any of this meet your needs?

ADD COMMENT • link 2.1 years ago Robert Castelo ★ 3.3k

0

Entering edit mode

Ah, that makes sense. AF_popmax would be be better than global AF for my use case, but still not ideal; I believe AF_popmax excludes samples of Ashkenazi, Finnish, and indeterminate ancestry (unless you are calculating popmax manually using AC for each subpopulation during creation). I'm looking for variants under primers that could cause allelic-dropout and some of our samples come from these population groups. So my use case might be too specific for broad use, but we really need to make sure we aren't biasing our designs at the detriment to these populations. Could different packages be created for each subpopulation? Not sure if it would work, but I think I could just annotate regions with each population MAF and then keep the max (basically create AF_popmax that doesn't exclude any group). Thanks for creating this and for the help!

ADD REPLY • link 2.1 years ago Robert • 0

0

Entering edit mode

While it's true that a Google Search for "gnomAD AF_popmax" leads to this hit from the GATK team where they say that AF_popmax excludes samples of Ashkenazi, Finnish, and indeterminate ancestry, the only documentation I could find about AF_popmax in the gnomAD website here for the version 3 data doesn't say anything about excluding populations, which would be weird to neglect. Could it be that in this last version 3 they do not exclude anymore those populations from AF_popmax?

Checking one of their example variants at:

https://gnomad.broadinstitute.org/variant/1-55051215-G-GA?dataset=gnomad_r3

The help text about "Popmax filtering AF" doesn't say anything about excluding populations. If the column AF_popmax would be what you're looking for, this would greatly simplify the job, because it would be just adding one more column to the data.

ADD REPLY • link 2.1 years ago Robert Castelo ★ 3.3k

0

Entering edit mode

Hm, I'm not sure if v3 popmax excluded these or no. It does look like they changed how the popmax_AF was calculated, but I don't see a description on the gnomad site that says that these populations were excluded. The UCSC table schema for v2 and v3 both say "Maximum allele frequency across populations (excluding samples of Ashkenazi, Finnish, and indeterminate ancestry)" , but I think this might be an error for v3.

an example here with high Ashkenazi MAF. where the gnomad "Popmax Filtering AF" lists Latino/Admix America instead of Ashkenazi (but I'm not sure if if this filtering AF is the same as the popmax_AF.

ADD REPLY • link 2.1 years ago Robert • 0

0

Entering edit mode

The help page describes this "Popmax Filtering AF" as "this is the highest disease-specific maximum credible population AF for which the observed AC is not compatible with pathogenicity", but in the example you give the difference between Ashkenazi and Latino/Admix is of an order of magnitude, so looks like Ashkenazi was excluded. I have opened an issue at the gnomAD browser GH repo to see if we can clear up this.

Assuming gnomAD excludes some populations, something we could also do is to generate our own "popmax" column (I'm always trying to see whether we can minimize the number of packages and storage) at the time processing the gnomAD VCF files, taking the maximum AF among all the populations (African / African American, Amish, Latino / Admixed American, Ashkenazi Jewish, East Asian, European (Finnish), Middle Eastern, European (non-Finnish), South Asian and Other) and then derive the MAF from that maximum AF. Would this fit your use case?

ADD REPLY • link 2.1 years ago Robert Castelo ★ 3.3k

0

Entering edit mode

yes, calculating the popmax at the time of processing with all populations would be exactly what I am looking for. Thanks!

ADD REPLY • link 2.1 years ago Robert • 0

0

Entering edit mode

I've dug into this a little more. Looking at the gnomAD browser for this variant, AF for Ashkenazi Jewish (asj) = 0.02078 I downloaded the vcf. In the vcf this variant has, popmax=nfe,AF_popmax=0.0114985, AF_asj=0.0207792

ADD REPLY • link 2.1 years ago Robert • 0

1

Entering edit mode

Good, this example confirms that gnomAD excludes at least the Ashkenazi Jewish from the popmax calculation. I've already downloaded the latest version 3.1.2 and I'm starting processing it to have the package ready with this updated version and the new popmax column for the next release. I'll write here once it becomes available.

ADD REPLY • link 2.1 years ago Robert Castelo ★ 3.3k

0

Entering edit mode

great, thank you

ADD REPLY • link 2.1 years ago Robert • 0

0

Entering edit mode

hi, the new annotation package MafH5.gnomAD.v3.1.2.GRCh38 was released yesterday with Bioconductor 3.15. This package includes MAF values for the global population, denoted as "AF" and for a new population denoted as "AF_allpopmax" and which contains the largest MAF value among all populations, see the following example on the variant examined above in this thread:

library(MafH5.gnomAD.v3.1.2.GRCh38)
mafh5 <- MafH5.gnomAD.v3.1.2.GRCh38
populations(mafh5)
[1] "AF"           "AF_allpopmax"
gscores(mafh5, GRanges("Y:2916248"), pop="AF_allpopmax")
GRanges object with 1 range and 1 metadata column:
      seqnames    ranges strand | AF_allpopmax
         <Rle> <IRanges>  <Rle> |    <numeric>
  [1]     chrY   2916248      * |         0.02
  -------
  seqinfo: 25 sequences (1 circular) from hg38 genome

as you can see, the maximum value is now the one from the Ashkenazi Jewish population. Because, it includes now this additional population, the package size has doubled to about 1.5Gb. I recommend to increase the download timeout to avoid errors, by doing:

options(timeout=600)

prior to installing the package via BiocManager.

ADD REPLY • link 24 months ago Robert Castelo ★ 3.3k

0

Entering edit mode

thanks for updating this! I'll grab this one.

ADD REPLY • link 24 months ago Robert • 0

0

Entering edit mode

Hi Robert, The update works perfectly for my use case.

I've got a shiny app that works great locally, but any idea why it wouldn't behave when I try to deploy the shiny app? I set options(repos = BiocManager::repositories()) in the console before deploying, tried options(rsconnect.max.bundle.size=3145728000) and options(timeout=1500), but when I try to deploy I get

Error building image: Error fetching MafH5.gnomAD.v3.1.2.GRCh38 (3.15.0) source. Error downloading package source: HTTP 599: Connection closed

MRE below. This will deploy if I comment out library("MafH5.gnomAD.v3.1.2.GRCh38")

library(shiny)
library(BiocManager)
options(repos = BiocManager::repositories())
library(GenomeInfoDb)
library(BSgenome)
#library("MafDb.gnomAD.r2.1.hs37d5")
library("MafH5.gnomAD.v3.1.2.GRCh38")
#library(BSgenome.Hsapiens.UCSC.hg19)
#library(BSgenome.Hsapiens.UCSC.hg38)
#mafdb_hg19 <- MafDb.gnomAD.r2.1.hs37d5
#mafdb_hg38 <- MafH5.gnomAD.v3.1.2.GRCh38
#hg19 <- BSgenome.Hsapiens.UCSC.hg19
#hg38 <- BSgenome.Hsapiens.UCSC.hg38


ui <- fluidPage(
    # Application title
    titlePanel("test"),
    )
server <- function(input, output) {
}
# Run the application 
shinyApp(ui = ui, server = server)

ADD REPLY • link 22 months ago Robert • 0

0

Entering edit mode

hi, I'm not sure if I understand the problem, I can run your MRE on my computer running the latest release of R and Bioconductor without any error. I don't get why, if the package is already installed, your get an error about downloading anything, and particularly the package source. The GenomicScores has a shiny app that you can start, calling the function igscores(), and which in principle installs and loads packages like this without any problem, although you often need to increase the timeout to more than the default 300 seconds, as you did.

ADD REPLY • link 22 months ago Robert Castelo ★ 3.3k

0

Entering edit mode

Yes, I get the shiny app to run locally without any errors. It's only when I try to deploy to shinapps.io (rsconnect::deployApp()) that I get the error about downloading package source. I don't understand it enough to explain why deployment requires downloading the source vs using the packages that you have locally. I will keep troubleshooting and comment here incase anyone runs into this in the future. Thanks again

ADD REPLY • link 22 months ago Robert • 0

0

Entering edit mode

If the problem seems to occur only during the deployment at shinyapps.io, you may have more chances getting a good hint about the solution at the RStudio Shinyappsio Community Support forum.

ADD REPLY • link 22 months ago Robert Castelo ★ 3.3k