Add mask to BSgenome
1
0
Entering edit mode
@shraddhapai-10725
Last seen 16 days ago
Canada

Hi everyone,

I'm trying to add a new mask to a BSgenome MaskedDNAString (specifically, Bismap mappability mask to chr1 of BSgenome.UCSC.hsapiens.hg38).

When I try to add the mask to the sequence I get an error I don't know how to overcome. I've tried to look at documentation for MaskedBSGenome , BSgenome, MaskedXString but cannot figure out how to fix this.

It looks like the problem comes with trying to assign the MaskedDNAString to the BSgenome object.

I do need to have the new mask added to the BSgenome object though, because I'm then using third-party code which requires a masked BSgenome object as one of its inputs.

Would appreciate guidance, thanks! Shraddha

Browse[1]> masks(genome$chr1) <- curmsk
Error in `$<-`(`*tmp*`, chr1, value = new("MaskedDNAString", unmasked = new("DNAString",  :
  no method for assigning subsets of this S4 class
Browse[1]> class(genome)
[1] "MaskedBSgenome"
attr(,"package")
[1] "BSgenome"
Browse[1]> curmsk
MaskCollection of length 5 and width 248956422
masks:
  maskedwidth  maskedratio active    names                               desc
1    18470101 7.419010e-02   TRUE    AGAPS                      assembly gaps
2        5309 2.132502e-05   TRUE      AMB           intra-contig ambiguities
3   119060341 4.782377e-01  FALSE       RM                       RepeatMasker
4     1647959 6.619468e-03  FALSE      TRF Tandem Repeats Finder [period<=12]
5    30883473 1.240517e-01   TRUE BSMAP-UN             Bismap S100 unmappable
all masks together:
  maskedwidth maskedratio
    141695090   0.5691562
all active masks together:
  maskedwidth maskedratio
     30886323   0.1240632

Browse[1]> sessionInfo()
R version 4.4.0 (2024-04-24)
Platform: x86_64-pc-linux-gnu
Running under: Ubuntu 22.04.4 LTS

Matrix products: default
BLAS:   /usr/lib/x86_64-linux-gnu/openblas-pthread/libblas.so.3
LAPACK: /usr/lib/x86_64-linux-gnu/openblas-pthread/libopenblasp-r0.3.20.so;  LAPACK version 3.10.0

locale:
 [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C
 [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8
 [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8
 [7] LC_PAPER=en_US.UTF-8       LC_NAME=C
 [9] LC_ADDRESS=C               LC_TELEPHONE=C
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C

time zone: Etc/UTC
tzcode source: system (glibc)

attached base packages:
[1] parallel  stats4    stats     graphics  grDevices utils     datasets
[8] methods   base

other attached packages:
 [1] BSgenome.Hsapiens.UCSC.hg38.masked_1.4.5
 [2] BSgenome.Hsapiens.UCSC.hg38_1.4.5
 [3] BSgenome_1.72.0
 [4] BiocIO_1.14.0
 [5] Biostrings_2.72.0
 [6] XVector_0.44.0
 [7] doParallel_1.0.17
 [8] iterators_1.0.14
 [9] foreach_1.5.2
[10] ggpubr_0.6.0
[11] ggplot2_3.5.1
[12] regioneR_1.36.0
[13] rtracklayer_1.64.0
[14] GenomicRanges_1.56.0
[15] GenomeInfoDb_1.40.0
[16] IRanges_2.38.0
[17] S4Vectors_0.42.0
[18] BiocGenerics_0.50.0

loaded via a namespace (and not attached):
 [1] SummarizedExperiment_1.34.0 gtable_0.3.5
 [3] rjson_0.2.21                rstatix_0.7.2
 [5] Biobase_2.64.0              lattice_0.22-6
 [7] vctrs_0.6.5                 tools_4.4.0
 [9] bitops_1.0-7                generics_0.1.3
[11] curl_5.2.1                  tibble_3.2.1
[13] fansi_1.0.6                 pkgconfig_2.0.3
[15] Matrix_1.7-0                lifecycle_1.0.4
[17] GenomeInfoDbData_1.2.12     compiler_4.4.0
[19] Rsamtools_2.20.0            munsell_0.5.1
[21] codetools_0.2-20            carData_3.0-5
[23] RCurl_1.98-1.14             yaml_2.3.8
[25] car_3.1-2                   tidyr_1.3.1
[27] pillar_1.9.0                crayon_1.5.2
[29] BiocParallel_1.38.0         DelayedArray_0.30.1
[31] cachem_1.0.8                abind_1.4-5
[33] tidyselect_1.2.1            purrr_1.0.2
[35] dplyr_1.1.4                 restfulr_0.0.15
[37] fastmap_1.1.1               grid_4.4.0
[39] colorspace_2.1-0            cli_3.6.2
[41] SparseArray_1.4.3           magrittr_2.0.3
[43] S4Arrays_1.4.0              XML_3.99-0.16.1
[45] utf8_1.2.4                  broom_1.0.6
[47] withr_3.0.0                 backports_1.4.1
[49] scales_1.3.0                UCSC.utils_1.0.0
[51] httr_1.4.7                  matrixStats_1.3.0
[53] ggsignif_0.6.4              memoise_2.0.1
[55] rlang_1.1.3                 glue_1.7.0
[57] jsonlite_1.8.8              R6_2.5.1
[59] MatrixGenerics_1.16.0       GenomicAlignments_1.40.0
[61] zlibbioc_1.50.0
Mask BSgenome.Hsapiens.UCSC.hg38.masked • 206 views
ADD COMMENT
0
Entering edit mode
@herve-pages-1542
Last seen 2 days ago
Seattle, WA, United States

This operation:

masks(genome$chr1) <- curmsk

is equivalent to:

chr1 <- genome$chr1      # step 1
masks(chr1) <- curmsk    # step 2
genome$chr1 <- chr1      # step 3

You can do step 1 and step 2, but you won't be able to do step 3. That's because you are not allowed to replace the sequences in a BSgenome or MaskedBSgenome object.

You have basically 2 options to work around this:

  1. Do step 1 and step 2 only, that is, load the sequence and put your masks on it. But don't try to alter the MaskedBSgenome object (i.e. don't do step 3).

  2. Forge your own masked BSgenome package using your own masks. This process is documented in the "Advanced BSgenomeForge usage" vignette from the BSgenomeForge package.

Hope this helps,

H.

ADD COMMENT
0
Entering edit mode

Hi Hervé, Thank you for that. OK so I followed the instructions in the BSgenomeForge vignette but don't think my new custom mask can be included.

I got the seed file example from the BSgenome.Hsapiens.UCSC.hg38 genome and tried to add my mask. I get an error:

> pathToSeed <- "hg38.masked.seed.txt"
> forgeMaskedBSgenomeDataPkg("hg38.masked.seed.txt")
Error in makeS4FromList("MaskedBSgenomeDataPkgSeed", x) :
  some names in 'x' are not valid MaskedBSgenomeDataPkgSeed slots (BISMAPfiles_name)

Here is what my seed file looks like:

Package: BSgenome.Hsapiens.UCSC.hg38.BisMap.masked
Title: Full masked genomic sequences for Homo sapiens (UCSC version hg38)
Description: Full genomic sequences for Homo sapiens as provided by UCSC (genome hg38, based on assembly GRCh38.p14 since 2023/01/31). The sequences are the same as in BSgenome.Hsapiens.UCSC.hg38, except that each of them has the 5 following masks on top: (1) the mask of assembly gaps (AGAPS mask), (2) the mask of intra-contig ambiguities (AMB mask), (3) the mask of repeats from RepeatMasker (RM mask),  (4) the mask of repeats from Tandem Repeats Finder (TRF mask), and (5) Masks of Bismap unmappable regions (union of S100+ and S100-) (BISMAP). AGAPS, AMB, and BISMAP masks are "active" by default. The sequences are stored in MaskedDNAString objects.
Version: 1.4.5
RefPkgname: BSgenome.Hsapiens.UCSC.hg38
source_url: https://hgdownload.soe.ucsc.edu/goldenPath/hg38/bigZips/p14/
organism_biocview: Homo_sapiens
nmask_per_seq: 5
SrcDataFiles: AGAPS masks: gap.txt.gz, downloaded from https://hgdownload.soe.ucsc.edu/goldenPath/hg38/database/ on Feb. 7, 2023
    RM masks: hg38.p14.fa.out.gz, downloaded from https://hgdownload.soe.ucsc.edu/goldenPath/hg38/bigZips/p14/ on Feb. 7, 2023
    TRF masks: hg38.p14.trf.bed.gz, downloaded from https://hgdownload.soe.ucsc.edu/goldenPath/hg38/bigZips/p14/ on Feb. 7, 2023
    BISMAP masks: Created by Shraddha Pai on June 4 2024 using makBISMAPmask.R
PkgExamples: mbsg$chr1  # a MaskedDNAString object!
    ## To get rid of the masks altogether:
    unmasked(mbsg$chr1)  # same as BSgenome.Hsapiens.UCSC.hg38$chr1
masks_srcdir: /home/rstudio/isilon/src/ucsc-goldenpath/hg38/masks
AGAPSfiles_name: gap.txt.gz
RMfiles_name: rmsk.txt.gz
TRFfiles_name: simpleRepeat.txt.gz
BISMAPfiles_name: BISMAP_mask.txt.gz

Since I'm writing, here is a head command on my new mask .txt.gz file:

$ zcat BISMAP_mask.txt.gz | head
chr1    0   10008   chr1    0   .
chr1    10912   13226   chr1    0   .
chr1    13424   14598   chr1    0   .
chr1    14702   15143   chr1    0   .
chr1    15302   15719   chr1    0   .
chr1    16194   17392   chr1    0   .
chr1    17589   18748   chr1    0   .
chr1    18946   28488   chr1    0   .
chr1    28683   29296   chr1    0   .
chr1    29410   30792   chr1    0   .
ADD REPLY
0
Entering edit mode

Oops, I forgot that masked BSgenome packages don't support masks other than the 4 predefined types of masks (AGAPS, AMB, RM, TRF). Sorry for that!

So in your case option 1 would seem more appropriate, that is, simply use BSgenome.Hsapiens.UCSC.hg38 or BSgenome.Hsapiens.UCSC.hg38.masked and add the masks on the sequences right after loading them from disk in memory (with $ or [[). The masks won't persist i.e. you'll need to add them again each time you load a given sequence but this is a cheap operation so that should not be a problem.

ADD REPLY
0
Entering edit mode

OK, I'll take that route and change the third-party code. Thanks Hervé, appreciate it.

ADD REPLY

Login before adding your answer.

Traffic: 429 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6