Issues with BSgenome.Mmusculus.UCSC.mm39 and generating motif matrix in Signac ("trying to load regions beyond the boundaries of non-circular sequence "chr1")
1
0
Entering edit mode
Paul • 0
@b9ff1a03
Last seen 18 hours ago
United States

Hello!

I am trying to generate a motif matrix for my ATAC-Seq data using the mm39 genome for annotations in the Signac R package. My code is below, and using this code, I consistently run into this error message: trying to load regions beyond the boundaries of non-circular sequence "chr1"


annotations <- GetGRangesFromEnsDb(ensdb = EnsDb.Mmusculus.v79)
seqlevelsStyle(annotations) <- 'UCSC'
genome(annotations) <- 'mm39'

RNA_integrated <- AddMotifs(
  object = RNA_integrated,
  genome = BSgenome.Mmusculus.UCSC.mm39,
  pfm = pfm
)

The strange thing is that when I revert back to the mm10 mouse genome, I can generate a motif matrix without a single hitch, so I am wondering if there is something I need to change in my code to accommodate for the mm39 genome? Any and all help would be greatly appreciated!

genomes BSgenome.Mmusculus.UCSC.mm39 • 348 views
ADD COMMENT
1
Entering edit mode
@james-w-macdonald-5106
Last seen 56 minutes ago
United States

The EnsDb.Mmusculus.v79 package is mm38, not mm39. If you want a current EnsDb, use AnnotationHub

> library(AnnotationHub)
> hub <- AnnotationHub()
snapshotDate(): 2024-10-28
> z <- query(hub, c("mus musculus","ensdb"))
> z
AnnotationHub with 87 records
# snapshotDate(): 2024-10-28
# $dataprovider: Ensembl
# $species: Mus musculus, Mus musculus musculus, Mus musculus domesticus, Mus musculus castaneus
# $rdataclass: EnsDb
# additional mcols(): taxonomyid, genome, description, coordinate_1_based, maintainer, rdatadateadded,
#   preparerclass, tags, rdatapath, sourceurl, sourcetype 
# retrieve records with, e.g., 'object[["AH53222"]]' 

             title                                        
  AH53222  | Ensembl 87 EnsDb for Mus Musculus            
  AH53726  | Ensembl 88 EnsDb for Mus Musculus            
  AH56691  | Ensembl 89 EnsDb for Mus Musculus            
  AH57770  | Ensembl 90 EnsDb for Mus Musculus            
  AH60788  | Ensembl 91 EnsDb for Mus Musculus            
  ...        ...                                          
  AH116906 | Ensembl 112 EnsDb for Mus musculus           
  AH116907 | Ensembl 112 EnsDb for Mus musculus musculus  
  AH116908 | Ensembl 112 EnsDb for Mus musculus domesticus
  AH116909 | Ensembl 112 EnsDb for Mus musculus           
  AH119358 | Ensembl 113 EnsDb for Mus musculus           

> subset(as(mcols(z), "data.frame")[,c("title","genome")], genome %in% paste0("GRCm", 38:39))
                                      title genome
AH53222   Ensembl 87 EnsDb for Mus Musculus GRCm38
AH53726   Ensembl 88 EnsDb for Mus Musculus GRCm38
AH56691   Ensembl 89 EnsDb for Mus Musculus GRCm38
AH57770   Ensembl 90 EnsDb for Mus Musculus GRCm38
AH60788   Ensembl 91 EnsDb for Mus Musculus GRCm38
AH60992   Ensembl 92 EnsDb for Mus Musculus GRCm38
AH64461   Ensembl 93 EnsDb for Mus Musculus GRCm38
AH64944   Ensembl 94 EnsDb for Mus musculus GRCm38
AH67971   Ensembl 95 EnsDb for Mus musculus GRCm38
AH69210   Ensembl 96 EnsDb for Mus musculus GRCm38
AH73905   Ensembl 97 EnsDb for Mus musculus GRCm38
AH75036   Ensembl 98 EnsDb for Mus musculus GRCm38
AH78811   Ensembl 99 EnsDb for Mus musculus GRCm38
AH79718  Ensembl 100 EnsDb for Mus musculus GRCm38
AH83247  Ensembl 101 EnsDb for Mus musculus GRCm38
AH89211  Ensembl 102 EnsDb for Mus musculus GRCm38
AH89457  Ensembl 103 EnsDb for Mus musculus GRCm39
AH95775  Ensembl 104 EnsDb for Mus musculus GRCm39
AH98078  Ensembl 105 EnsDb for Mus musculus GRCm39
AH100674 Ensembl 106 EnsDb for Mus musculus GRCm39
AH104895 Ensembl 107 EnsDb for Mus musculus GRCm39
AH109367 Ensembl 108 EnsDb for Mus musculus GRCm39
AH109655 Ensembl 109 EnsDb for Mus musculus GRCm39
AH113713 Ensembl 110 EnsDb for Mus musculus GRCm39
AH116340 Ensembl 111 EnsDb for Mus musculus GRCm39
AH116909 Ensembl 112 EnsDb for Mus musculus GRCm39
AH119358 Ensembl 113 EnsDb for Mus musculus GRCm39

You can then use one of those GRCm39 versions by doing ensdb <- hub[["AH119358"]], if e.g., you want the most recent one. There are other strain specific ones as well, but for brevity I am just showing the 'regular' ones.

0
Entering edit mode

Unfortunately I've updated my code with the newest GRCm39 genome, and yet I am still getting the same error.... Is there anything else I can try to fix this error other than just trying all of the GRCm39 versions one by one?

ADD REPLY
1
Entering edit mode

You will probably have to ask the Signac people. I don't know exactly what is going on under the hood for AddMotif, but the EnsDb seems OK to me.

> ensdb <- hub[["AH119358"]]
> z <- GetGRangesFromEnsDb(ensdb)
> seqlevelsStyle(z) <- "UCSC"
> library(BSgenome.Mmusculus.UCSC.mm39)
> zz <- getSeq(Mmusculus, subset(z, seqnames(z) %in% "chr1" & strand(z) != "*"))
> zz
DNAStringSet object of length 92162:
        width seq                                                                              names               
    [1]  1417 AAACATTGGAACTTACCAAATTTTACTGGAGATGAATTG...TAACAAACTAGACACGTAAATTAATCTCTGCCACTCCA ENSMUSE00000866652
    [2]   795 AAACATTGGAACTTACCAAATTTTACTGGAGATGAATTG...ATTTCTATTTTTCAAATAAAGGTGAAAATGAAGTTGTC ENSMUSE00000867897
    [3]  2194 TTAGTTAAGAGCACTGACTGCTCTTGCAAAGGACCCAGG...GCCAAACAATCCAACAGGATCCTCCACTCTTCTTTCAT ENSMUSE00000863980
    [4]  2736 GCACACTACGGTCCATCTCCAACAACCGCAGTGTTGCCA...TGAAGAGAATCACAATAATTTGGGCAGATACTTTGCAG ENSMUSE00000858910
    [5]  2487 GTTTCACAGCAGCAGCCTCCCTTGTGTCCTTGGCTTGGG...GGTGCTTTTTGTTGCATTAAAAGTGCCGGTCAAACTTT ENSMUSE00000448840
    ...   ... ...
[92158]   162 AGTTACCTGAGTGGAACTTTGATGGCTCTAGTACCTTTC...GCTATGTGAAGTTTTCAAGTATAACCGGAAACCTGCAG ENSMUST00000381748
[92159]   147 AGACCAACTTGAGGCACATCTGTAAACGGATAATGGACA...ATTTGGTTGGCCTTCCAATGGCTTCCCTGGACCCCAAG ENSMUST00000381748
[92160]   128 GCCCGTATTACTGCGGTGTGGGAGCAGACAAGGCCTACG...AGATTACGGGGACAAATGCGGAGGTTATGCCTGCCCAG ENSMUST00000381748
[92161]   200 TGGGAATTCCAGATAGGACCCTGTGAGGGGATCCGAATG...TTCAGCACCAAGGCCATGCGGGAGGAGAATGGTCTGAA ENSMUST00000381748
[92162]   319 GTGCATTGAGGAGGCCATTGACAAACTGAGCAAGAGGCA...ACGAAACAGGCGACGAACCCTTCCAATACAAGAACTAA ENSMUST00000381748

Presumably what they are doing is extracting the sequences, which appears to work (there's nothing off the end of chr1), so I don't think it's an issue with any Bioconductor packages/functions.

0
Entering edit mode

This is the traceback I got after running AddMotifs if this is informative?

23: stop(wmsg("trying to load regions beyond the boundaries ", "of non-circular sequence \"", 
        seqname, "\""))
22: loadFUN(x, seqname, ranges)
21: loadFUN(x, seqname, ranges)
20: loadSubseqsFromStrandedSequence(x@single_sequences, seqname, 
        ranges(gr), strand(gr), is_circular)
19: FUN(X[[i]], ...)
18: lapply(seq_len(length(grl)), function(i) {
        gr <- grl[[i]]
        if (length(gr) == 0L) 
            return(DNAStringSet())
        seqlevel <- grl_seqlevels[i]
        is_circular <- isCircular(x)[[seqlevel]]
        idx <- match(seqlevel, x@user_seqnames)
        if (is.na(idx)) 
            stop("invalid sequence name: ", seqlevel)
        seqname <- names(x@user_seqnames)[[idx]]
        if (is.null(snplocs(x, seqname))) {
            subject <- try(get(seqname, envir = x@.seqs_cache, inherits = FALSE), 
                silent = TRUE)
            if (is(subject, "try-error")) {
                ans <- loadSubseqsFromStrandedSequence(x@single_sequences, 
                    seqname, ranges(gr), strand(gr), is_circular)
                return(ans)
            }
            .linkToCachedObject(subject) <- .newLinkToCachedObject(seqname, 
                x@.seqs_cache, x@.link_counts)
     ...
17: lapply(seq_len(length(grl)), function(i) {
        gr <- grl[[i]]
        if (length(gr) == 0L) 
            return(DNAStringSet())
        seqlevel <- grl_seqlevels[i]
        is_circular <- isCircular(x)[[seqlevel]]
        idx <- match(seqlevel, x@user_seqnames)
        if (is.na(idx)) 
            stop("invalid sequence name: ", seqlevel)
        seqname <- names(x@user_seqnames)[[idx]]
        if (is.null(snplocs(x, seqname))) {
            subject <- try(get(seqname, envir = x@.seqs_cache, inherits = FALSE), 
                silent = TRUE)
            if (is(subject, "try-error")) {
                ans <- loadSubseqsFromStrandedSequence(x@single_sequences, 
                    seqname, ranges(gr), strand(gr), is_circular)
                return(ans)
            }
            .linkToCachedObject(subject) <- .newLinkToCachedObject(seqname, 
                x@.seqs_cache, x@.link_counts)
     ...
16: .extractFromBSgenomeSingleSequences(x, sseq_args$names, sseq_args$start, 
        sseq_args$end, sseq_args$width, sseq_args$strand)
15: .local(x, ...)
14: getSeq(genome, subject)
13: getSeq(genome, subject)
12: .local(pwms, subject, ...)
11: matchMotifs(pwms_list, subject, ...)
10: matchMotifs(pwms_list, subject, ...)
9: motifmatchr::matchMotifs(pwms = pwm, subject = features, genome = genome, 
       out = "scores", ...)
8: motifmatchr::matchMotifs(pwms = pwm, subject = features, genome = genome, 
       out = "scores", ...)
7: CreateMotifMatrix(features = object, pwm = pfm, genome = genome, 
       use.counts = FALSE)
6: AddMotifs.default(object = granges(x = object), genome = genome, 
       pfm = pfm, verbose = verbose)
5: AddMotifs(object = granges(x = object), genome = genome, pfm = pfm, 
       verbose = verbose)
4: AddMotifs.ChromatinAssay(object = object[[assay]], genome = genome, 
       pfm = pfm, verbose = verbose)
3: AddMotifs(object = object[[assay]], genome = genome, pfm = pfm, 
       verbose = verbose)
2: AddMotifs.Seurat(object = RNA_integrated, genome = BSgenome.Mmusculus.UCSC.mm39, 
       pfm = pfm)
1: AddMotifs(object = RNA_integrated, genome = BSgenome.Mmusculus.UCSC.mm39, 
       pfm = pfm)
ADD REPLY
0
Entering edit mode

It's not informative except for me to say I was right that it's getSeq that is providing the error. It appears that AddMotifs is passing data off to matchMotifs in the motifmatchr package (both not Bioconductor, so we are off topic to a certain extent), and then matchMotifs does stuff and then calls getSeq. And getSeq says 'not so fast my friend!'

So, things you can do.

1.) set options(error = recover) and when it blows up you can check to see what objects are being passed into matchMotifs and maybe figure out what the problem is. This works sometimes IMO, but sometimes it's just more confusing. 2.) use trace(matchMotifs, browser, signature = c(pwms = "PWMatrixList", subject = "GenomicRanges")) and then run AddMotifs again. This will drop into a browser if I am correct that you are dispatching on that signature. You can then step through until you get to getSeq and then inspect what is being used as the genome and the subject. Note that you can use the <<- function to dump things out to your .GlobalEnv (your workspace), so you could just step through until you get to the relevant step and then do pwms <<- pwms and subject <<- subject, then quit the debugger (using Q), and then you can inspect those two objects to figure out what's wrong.

And if what's wrong makes sense, and provides you with a way to get things to work, then good on ya! Or maybe that will just give you information that you can provide to either the Sajita or Greenleaf lab if it's an obvious bug. But at this point I think it's up to you or the authors of the code that's not working to figure out.

Login before adding your answer.

Traffic: 980 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6