Unable to load mm10 genome for use with GenomicDistributions package
0
0
Entering edit mode
rashford • 0
@007db9a5
Last seen 9 hours ago
United States

Hello! I am currently trying to use one of the built-in reference genomes, mm10, in the GenomicDistributionsData package to make different plots from the GenomicDistributions package for my own analysis. From testing out the example here (https://www.bioconductor.org/packages/devel/bioc/vignettes/GenomicDistributions/inst/doc/intro.html#loading-genomic-range-data), everything worked fine if I used the hg19 genome. The problem came when I tried to change the genome to "mm10" like so:

x = calcChromBinsRef(query, "mm10")

# instead of x = calcChromBinsRef(query, "hg19") which works fine and can plot the correct image

From the GenomicDistributionsData Bioconductor documentation and example HTML, I tried running the following lines to see if mm10 was present in the list of genomes/genome parts, but what returned was "character(0) ":

> datasetListIQR = utils::data(package="GenomicDistributionsData")
> datasetList = datasetListIQR$results[,"Item"]
> datasetList
character(0)

Then I just did things one by one and got the return that there were "no data sets found" when I ran the first line:

utils::data(package="GenomicDistributionsData")
no data sets found

I'm not sure if I'm doing something incorrect or if the mm10 genome isn't loaded yet in the GenomicDistributionsPackage?

Any help is appreciated!!

sessionInfo()
R version 4.1.0 (2021-05-18)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows 10 x64 (build 19044)

Matrix products: default

locale:
[1] LC_COLLATE=English_United States.1252  LC_CTYPE=English_United States.1252   
[3] LC_MONETARY=English_United States.1252 LC_NUMERIC=C                          
[5] LC_TIME=English_United States.1252    

attached base packages:
[1] stats4    parallel  stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
 [1] ExperimentHub_2.0.0            AnnotationHub_3.0.2            BiocFileCache_2.0.0           
 [4] dbplyr_2.2.0                   GenomicDistributionsData_1.0.0 ggplot2_3.3.6                 
 [7] biomaRt_2.48.3                 dplyr_1.0.9                    GenomicDistributions_1.0.0    
[10] GenomicRanges_1.44.0           GenomeInfoDb_1.28.4            IRanges_2.26.0                
[13] S4Vectors_0.30.2               BiocGenerics_0.38.0           

loaded via a namespace (and not attached):
 [1] ProtGenerics_1.24.0           bitops_1.0-7                  matrixStats_0.62.0           
 [4] bit64_4.0.5                   filelock_1.0.2                progress_1.2.2               
 [7] httr_1.4.3                    tools_4.1.0                   utf8_1.2.2                   
[10] R6_2.5.1                      lazyeval_0.2.2                DBI_1.1.3                    
[13] colorspace_2.0-3              withr_2.5.0                   tidyselect_1.1.2             
[16] prettyunits_1.1.1             bit_4.0.4                     curl_4.3.2                   
[19] compiler_4.1.0                cli_3.3.0                     Biobase_2.52.0               
[22] xml2_1.3.3                    DelayedArray_0.18.0           rtracklayer_1.52.1           
[25] labeling_0.4.2                scales_1.2.0                  rappdirs_0.3.3               
[28] stringr_1.4.0                 digest_0.6.29                 Rsamtools_2.8.0              
[31] rmarkdown_2.14                XVector_0.32.0                pkgconfig_2.0.3              
[34] htmltools_0.5.2               MatrixGenerics_1.4.3          ensembldb_2.16.4             
[37] BSgenome_1.60.0               fastmap_1.1.0                 rlang_1.0.2                  
[40] rstudioapi_0.13               RSQLite_2.2.14                shiny_1.7.1                  
[43] BiocIO_1.2.0                  farver_2.1.0                  generics_0.1.2               
[46] BiocParallel_1.26.2           RCurl_1.98-1.7                magrittr_2.0.3               
[49] GenomeInfoDbData_1.2.6        Matrix_1.4-1                  Rcpp_1.0.8.3                 
[52] munsell_0.5.0                 fansi_1.0.3                   lifecycle_1.0.1              
[55] stringi_1.7.6                 yaml_2.3.5                    SummarizedExperiment_1.22.0  
[58] zlibbioc_1.38.0               plyr_1.8.7                    grid_4.1.0                   
[61] blob_1.2.3                    promises_1.2.0.1              crayon_1.5.1                 
[64] lattice_0.20-45               Biostrings_2.60.2             GenomicFeatures_1.44.2       
[67] hms_1.1.1                     KEGGREST_1.32.0               knitr_1.39                   
[70] pillar_1.7.0                  rjson_0.2.21                  reshape2_1.4.4               
[73] BiocVersion_3.13.1            XML_3.99-0.10                 glue_1.6.2                   
[76] evaluate_0.15                 data.table_1.14.2             BiocManager_1.30.18          
[79] httpuv_1.6.5                  vctrs_0.4.1                   png_0.1-7                    
[82] gtable_0.3.0                  purrr_0.3.4                   assertthat_0.2.1             
[85] cachem_1.0.6                  xfun_0.31                     mime_0.12                    
[88] xtable_1.8-6                  AnnotationFilter_1.16.0       restfulr_0.0.15              
[91] later_1.3.0                   tibble_3.1.7                  GenomicAlignments_1.28.0     
[94] AnnotationDbi_1.54.1          memoise_2.0.1                 interactiveDisplayBase_1.30.0
[97] ellipsis_0.3.2
GenomeInfoDb GenomeInfoDbData GenomicDistributionsData GenomicDistributions • 116 views
ADD COMMENT
0
Entering edit mode

You need to give more information about what happened when you ran calcChromBinsRef rather than what you did to try to diagnose (for example, there are no datasets in the GenomicDistributionData package - it just helps find data on the ExperimentHub). Anyway, calcChromBinsRef will call getChromSizes, which needs a BSGenome package for mm10 to work. R probably told you something after you tried to run calcChromBinsRef but you don't say what that was. Without that information we can only guess. You want to provide enough information so we don't have to do that.

0
Entering edit mode

Thank you. Apologies for not providing more information; I'm pretty new at this! Below is what happens when I run calcChromBinsRef.

> ## load necessary packages
> library("GenomeInfoDb")
> library("GenomicDistributions")
> library("GenomicDistributionsData")
> library("ExperimentHub")
> library("BSgenome")
> library("GenomicRanges")
> 
> 
> ## from Nate Sheffield's example (https://www.bioconductor.org/packages/devel/bioc/vignettes/GenomicDistributions/inst/doc/intro.html#custom-features-partitions)
> queryFile = system.file("extdata", "vistaEnhancers.bed.gz", package="GenomicDistributions")
> query = rtracklayer::import(queryFile)
> 
> # calculate the distribution:
> x = calcChromBinsRef(query, "mm10")
Error in getReferenceData(refAssembly, tagline = "chromSizes_") : 
  chromSizes_mm10 not available in GenomicDistributions and GenomicDistributionsData packages

session info:

sessionInfo()
R version 4.1.0 (2021-05-18)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows 10 x64 (build 19044)

Matrix products: default

locale:
[1] LC_COLLATE=English_United States.1252  LC_CTYPE=English_United States.1252   
[3] LC_MONETARY=English_United States.1252 LC_NUMERIC=C                          
[5] LC_TIME=English_United States.1252    

attached base packages:
[1] parallel  stats4    stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
 [1] BSgenome_1.60.0                rtracklayer_1.52.1             Biostrings_2.60.2             
 [4] XVector_0.32.0                 ExperimentHub_2.0.0            AnnotationHub_3.0.2           
 [7] BiocFileCache_2.0.0            dbplyr_2.2.0                   GenomicDistributionsData_1.0.0
[10] GenomicDistributions_1.0.0     GenomicRanges_1.44.0           GenomeInfoDb_1.28.4           
[13] IRanges_2.26.0                 S4Vectors_0.30.2               BiocGenerics_0.38.0           

loaded via a namespace (and not attached):
 [1] ProtGenerics_1.24.0           bitops_1.0-7                  matrixStats_0.62.0           
 [4] bit64_4.0.5                   progress_1.2.2                filelock_1.0.2               
 [7] httr_1.4.3                    tools_4.1.0                   utf8_1.2.2                   
[10] R6_2.5.1                      lazyeval_0.2.2                DBI_1.1.3                    
[13] colorspace_2.0-3              prettyunits_1.1.1             tidyselect_1.1.2             
[16] bit_4.0.4                     curl_4.3.2                    compiler_4.1.0               
[19] cli_3.3.0                     Biobase_2.52.0                xml2_1.3.3                   
[22] DelayedArray_0.18.0           scales_1.2.0                  rappdirs_0.3.3               
[25] stringr_1.4.0                 digest_0.6.29                 Rsamtools_2.8.0              
[28] rmarkdown_2.14                pkgconfig_2.0.3               htmltools_0.5.2              
[31] MatrixGenerics_1.4.3          ensembldb_2.16.4              fastmap_1.1.0                
[34] rlang_1.0.2                   rstudioapi_0.13               RSQLite_2.2.14               
[37] shiny_1.7.1                   farver_2.1.0                  BiocIO_1.2.0                 
[40] generics_0.1.2                BiocParallel_1.26.2           dplyr_1.0.9                  
[43] RCurl_1.98-1.7                magrittr_2.0.3                GenomeInfoDbData_1.2.6       
[46] Matrix_1.4-1                  Rcpp_1.0.8.3                  munsell_0.5.0                
[49] fansi_1.0.3                   lifecycle_1.0.1               stringi_1.7.6                
[52] yaml_2.3.5                    SummarizedExperiment_1.22.0   zlibbioc_1.38.0              
[55] plyr_1.8.7                    grid_4.1.0                    blob_1.2.3                   
[58] promises_1.2.0.1              crayon_1.5.1                  lattice_0.20-45              
[61] GenomicFeatures_1.44.2        hms_1.1.1                     KEGGREST_1.32.0              
[64] knitr_1.39                    pillar_1.7.0                  rjson_0.2.21                 
[67] biomaRt_2.48.3                reshape2_1.4.4                XML_3.99-0.10                
[70] glue_1.6.2                    BiocVersion_3.13.1            evaluate_0.15                
[73] data.table_1.14.2             BiocManager_1.30.18           png_0.1-7                    
[76] vctrs_0.4.1                   httpuv_1.6.5                  gtable_0.3.0                 
[79] purrr_0.3.4                   assertthat_0.2.1              cachem_1.0.6                 
[82] ggplot2_3.3.6                 xfun_0.31                     mime_0.12                    
[85] xtable_1.8-6                  AnnotationFilter_1.16.0       restfulr_0.0.15              
[88] later_1.3.0                   tibble_3.1.7                  AnnotationDbi_1.54.1         
[91] GenomicAlignments_1.28.0      memoise_2.0.1                 ellipsis_0.3.2               
[94] interactiveDisplayBase_1.30.0
ADD REPLY
0
Entering edit mode

Weird. It looks like the maintainer has been changing from providing the data as part of the package (which is a bad idea) to hosting the data on the ExperimentHub (good idea!). But the code hasn't been updated to actually use the ExperimentHub. As you are apparently aware, the code currently looks for data in the GenomicDistributions package, and if not there, it looks in GenomicDistributionsData. But what is in GenomicDistributionsData are functions to get the data from the ExperimentHub:

> chromSizes_mm10
function (metadata = FALSE) 
{
    eh <- .get_ExperimentHub()
    if (metadata) {
        eh[ehid]
    }
    else eh[[ehid]]
}
<bytecode: 0x0000027c6f18b510>
<environment: 0x0000027c6f1ab7a8>

## That just gets data from ExperimentHub. Let's run it

> z <- chromSizes_mm10()
see ?GenomicDistributionsData and browseVignettes('GenomicDistributionsData') for documentation
downloading 1 resources
retrieving 1 resource
  |======================================================================| 100%

loading from cache
> z
                chr1                 chr2                 chr3 
           195471971            182113224            160039680 
                chr4                 chr5                 chr6 
           156508116            151834684            149736546 
                chr7                 chr8                 chr9 
           145441459            129401213            124595110 
               chr10                chr11                chr12 
           130694993            122082543            120129022 
               chr13                chr14                chr15 
           120421639            124902244            104043685 
               chr16                chr17                chr18 
            98207768             94987271             90702639 
               chr19                 chrX                 chrY 
            61431566            171031299             91744698 
                chrM chr1_GL456210_random chr1_GL456211_random 
               16299               169725               241735 
chr1_GL456212_random chr1_GL456213_random chr1_GL456221_random 
              153618                39340               206961 
chr4_GL456216_random chr4_GL456350_random chr4_JH584292_random 
               66673               227966                14945 
chr4_JH584293_random chr4_JH584294_random chr4_JH584295_random 
              207968               191905                 1976 
chr5_GL456354_random chr5_JH584296_random chr5_JH584297_random 
              195993               199368               205776 
chr5_JH584298_random chr5_JH584299_random chr7_GL456219_random 
              184189               953012               175968 
chrX_GL456233_random chrY_JH584300_random chrY_JH584301_random 
              336933               182347               259875 
chrY_JH584302_random chrY_JH584303_random       chrUn_GL456239 
              155838               158099                40056 
      chrUn_GL456359       chrUn_GL456360       chrUn_GL456366 
               22974                31704                47073 
      chrUn_GL456367       chrUn_GL456368       chrUn_GL456370 
               42057                20208                26764 
      chrUn_GL456372       chrUn_GL456378       chrUn_GL456379 
               28664                31602                72385 
      chrUn_GL456381       chrUn_GL456382       chrUn_GL456383 
               25871                23158                38659 
      chrUn_GL456385       chrUn_GL456387       chrUn_GL456389 
               35240                24685                28772 
      chrUn_GL456390       chrUn_GL456392       chrUn_GL456393 
               24668                23629                55711 
      chrUn_GL456394       chrUn_GL456396       chrUn_JH584304 
               24323                21240               114452

This is all helpfully explained in the vignette. But so far as I can tell you can't pass that information in. I would contact the maintainer directly and complain. In the intervening period, here's some hack action.

## first get the chromosome sizes
> z <- chromSizes_mm10()
see ?GenomicDistributionsData and browseVignettes('GenomicDistributionsData') for documentation
loading from cache

## now debug getReferenceData so we can intervene
> debug(GenomicDistributions:::getReferenceData)

## Now run calcChromBinsRef

> x = calcChromBinsRef(query, "mm10")
debugging in: getReferenceData(refAssembly, tagline = "chromSizes_")

## We have entered the debugger, and can keep hitting Enter until we get to the step I point out below

debug: {
    datasetId = paste0(tagline, refAssembly)
    dataset = .getDataFromPkg(id = datasetId, "GenomicDistributions")
    if (!is.null(dataset)) 
        return(dataset)
    if (!"GenomicDistributionsData" %in% utils::installed.packages()) 
        stop(paste(datasetId, "not available in GenomicDistributions package", 
            "and GenomicDistributionsData package is not installed"))
    dataset = .getDataFromPkg(id = datasetId, "GenomicDistributionsData")
    if (!is.null(dataset)) 
        return(dataset)
    stop(paste(datasetId, "not available in GenomicDistributions and", 
        "GenomicDistributionsData packages"))
}
Browse[2]>            <-------------------------------------------- Each of these is just the debugger pausing, and I hit Enter to continue to the next step
debug: datasetId = paste0(tagline, refAssembly)
Browse[2]> 
debug: dataset = .getDataFromPkg(id = datasetId, "GenomicDistributions")
Browse[2]> 
debug: if (!is.null(dataset)) return(dataset)

## at this point the dataset is NULL because it's not actually available in the package. 
## We want the function to return, but with the chromosome length data
## so we just replace with the 'z' object from above . It's now not NULL and will return

Browse[2]> dataset <- z

## Now just hit 'c' to finish up.

Browse[2]> c
exiting from: getReferenceData(refAssembly, tagline = "chromSizes_")

## it worked, and now we have our 'x' object

> x
      chr     start       end regionID withinGroupID N
  1: chr1   2727517   3636688        4             4 1
  2: chr1   7273377   8182548        9             9 1
  3: chr1  10000892  10910063       12            12 5
  4: chr1  10910064  11819235       13            13 2
  5: chr1  26365988  27275159       30            30 1
 ---                                                  
483: chrX 136461143 137370883     2856           151 1
484: chrX 138280625 139190365     2858           153 1
485: chrX 139190366 140100106     2859           154 4
486: chrX 147378034 148287774     2868           163 1
487: chrX 150107257 151016997     2871           166 1
1
Entering edit mode

Oh, also. You need to upgrade to the current version of R/Bioconductor. You are two releases behind, and technically we don't support old versions.

0
Entering edit mode

Thank you so much!! Using the debugging approach is working great for me in the meantime while I wait on the author's response to upload the mm10 genome into ExperimentHub.

Also, thank you for letting me know to update R/BioConductor. Did that before trying all of this.

ADD REPLY

Login before adding your answer.

Traffic: 256 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6