Match all DNAString Occurences in a BSgenome
2
0
Entering edit mode
ewjwallace ▴ 10
@ewjwallace-11328
Last seen 7.7 years ago

I needed to find the (single) occurrences of primers in a genome in a GenomicRanges or data frame format. This is similar to the example given in BSgenome's GenomeSearching vignette, section on "Finding an arbitrary nucleotide pattern in an entire genome", but I wanted a flexible function with a nice output format, and I'm working with a genome I can load into memory. I just wrote this function below which seems to work - any suggestions? If it's helpful to others could it be incorporated into the package?

Edward

 

# matchInGenome
# Use Bioconductor to find all matches of a DNA string in a genome
# Edward Wallace, 20 August 2016

library(Biostrings)
library(GenomicRanges)
library(BSgenome.Scerevisiae.UCSC.sacCer3)

matchInGenome <- function(pattern,genome,...) {
    # find all matches of pattern in genome on both strands
    # output as GRanges
    
    # coerce pattern to DNAString
    pattern <- DNAString(pattern)
    
    if(class(genome) != "BSgenome") stop("genome must be a BSgenome")
    
    # forward matches to genome 
    fwdmatch <- GenomicRangesList(lapply(seqnames(genome), function(chr) {
        irmatch <- matchPattern(pattern,genome[[chr]],...)@ranges
        if(length(irmatch) > 0) {
            return(GRanges(seqnames=chr,ranges=irmatch,strand="+"))
        } else {
            return(GRanges())
        }
    }))
    
    # reverse complement matches to genome
    rcpattern <- reverseComplement(pattern)
    revmatch <- GenomicRangesList(lapply(seqnames(genome), function(chr) {
        irmatch <- matchPattern(rcpattern,genome[[chr]],...)@ranges
        if(length(irmatch) > 0) {
            return(GRanges(seqnames=chr,ranges=irmatch,strand="-"))
        } else {
            return(GRanges())
        }
    }))
    
    # collapse fwd/rev GRangesLists into single GRanges
    allmatches <- unlist(c(fwdmatch,revmatch))
    
    # set seqinfo as in genome
    seqlevels(allmatches) <- seqlevels(genome)
    seqinfo(allmatches) <- seqinfo(genome)
    
    return(allmatches)
}

# test with single match (from YAL003W gene on +strand/chrI)
testpattern1 <- "GACAAGTCATACATTGAAGG"
matchInGenome(testpattern1,BSgenome.Scerevisiae.UCSC.sacCer3)

# test with single match NOT from chrI (from YFL039C gene on -strand/chrVI)
testpattern2 <- "ATTTACTGAATTAACAATGGATTCTG"
matchInGenome(testpattern2,BSgenome.Scerevisiae.UCSC.sacCer3)


# test with many matches (Puf3 consensus binding)
testpatternmany <- "TGTAAATA"
matchInGenome(testpatternmany,BSgenome.Scerevisiae.UCSC.sacCer3)
bsgenome genomicranges function creation • 1.1k views
ADD COMMENT
3
Entering edit mode
@michael-lawrence-3846
Last seen 2.4 years ago
United States

There is a vmatchPattern() method for BSgenome objects that does something very similar. See help("BSgenome-utils").

ADD COMMENT
0
Entering edit mode
ewjwallace ▴ 10
@ewjwallace-11328
Last seen 7.7 years ago

Thank you. I had not grasped from the documentation that vmatchPattern could take a BSgenome as 2nd argument. Mostly because vmatchPattern is not mentioned in BSgenome's GenomeSearching.pdf vignette.

Thanks for taking the time to respond.

Is there any way I could help update the vignette to account for the functionality? 

ADD COMMENT

Login before adding your answer.

Traffic: 812 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6