motif searching with variable length gaps
3
0
Entering edit mode
@houseman-heather-2469
Last seen 11.3 years ago
An embedded and charset-unspecified text was scrubbed... Name: not available Url: https://stat.ethz.ch/pipermail/bioconductor/attachments/20071101/ c220565e/attachment.pl
• 1.1k views
ADD COMMENT
0
Entering edit mode
@herve-pages-1542
Last seen 2 days ago
Seattle, WA, United States
Hi Heather, Can you please give some examples of your motifs? Also showing us the code that you use with cosmo can be useful. Even if the matchPattern() function in Biostrings doesn't let you control the number of gaps, there might be workarounds, it all depends what your motifs really look like. And we need use cases anyway so we know where to put our efforts. Thanks! H. Houseman, Heather wrote: > Dear Bioconductor mailing list: > > I've been using cosmo to look for motifs. I'd like to search for motifs that have a variable length of gaps in the middle. If I specify a range of motif widths with the cosmo function, it uses the width with the lowest BIC value and searches for motifs of only that width. My dilemma is that the motifs I'm looking for are of variable width. > > Thanks in advance for any help! > > Heather > > This email message, including any attachments, is for ...{{dropped:8}}
ADD COMMENT
0
Entering edit mode
Herve, My ultimate goal is to find motifs in different sequences that are similar to the ones below. TACGTGCTGTCTCACACAG GACGTGACTCGGACCACAT TACGTGGGT--TTCCACAG TACGTGAC----CACACAC TACGTGC-------CACAG CACGTGC-------CACAC GGCGTGAGC-----CACCG GGCGTGGGAGCG--CACAG TACGTG------CACACAG To start off, I'm inserting the motifs above into random sequences to see if I can get cosmo to return those motifs. Once I get that procedure to work, I'd like to use it to apply it to "real" sequences and hopefully return motifs that look similar to the ones above. Here's the cosmo code I'm using: res = cosmo(seqs = seqs, minW = 12, maxW = 20, models = "OOPS") Is this more along the lines of multiple sequence alignment and not something that I can use cosmo for? Thanks! Heather -----Original Message----- From: Herve Pages [mailto:hpages@fhcrc.org] Sent: Thursday, November 01, 2007 1:33 PM To: Houseman, Heather Cc: bioconductor at stat.math.ethz.ch Subject: Re: [BioC] motif searching with variable length gaps Hi Heather, Can you please give some examples of your motifs? Also showing us the code that you use with cosmo can be useful. Even if the matchPattern() function in Biostrings doesn't let you control the number of gaps, there might be workarounds, it all depends what your motifs really look like. And we need use cases anyway so we know where to put our efforts. Thanks! H. Houseman, Heather wrote: > Dear Bioconductor mailing list: > > I've been using cosmo to look for motifs. I'd like to search for motifs that have a variable length of gaps in the middle. If I specify a range of motif widths with the cosmo function, it uses the width with the lowest BIC value and searches for motifs of only that width. My dilemma is that the motifs I'm looking for are of variable width. > > Thanks in advance for any help! > > Heather > > This email message, including any attachments, is for ...{{dropped:16}}
ADD REPLY
0
Entering edit mode
Hi Heather, cosmo could find motifs like these as long as the total length of the motif (2 outer parts + gap) is the same in each motif. If that's true, your code should work. Oliver On Nov 1, 2007 11:10 AM, Houseman, Heather <heather.houseman at="" vai.org=""> wrote: > Herve, > > My ultimate goal is to find motifs in different sequences that are similar to the ones below. > > TACGTGCTGTCTCACACAG > GACGTGACTCGGACCACAT > TACGTGGGT--TTCCACAG > TACGTGAC----CACACAC > TACGTGC-------CACAG > CACGTGC-------CACAC > GGCGTGAGC-----CACCG > GGCGTGGGAGCG--CACAG > TACGTG------CACACAG > > To start off, I'm inserting the motifs above into random sequences to see if I can get cosmo to return those motifs. Once I get that procedure to work, I'd like to use it to apply it to "real" sequences and hopefully return motifs that look similar to the ones above. > > Here's the cosmo code I'm using: > > res = cosmo(seqs = seqs, minW = 12, maxW = 20, models = "OOPS") > > Is this more along the lines of multiple sequence alignment and not something that I can use cosmo for? > > Thanks! > > Heather > > -----Original Message----- > From: Herve Pages [mailto:hpages at fhcrc.org] > Sent: Thursday, November 01, 2007 1:33 PM > To: Houseman, Heather > Cc: bioconductor at stat.math.ethz.ch > Subject: Re: [BioC] motif searching with variable length gaps > > Hi Heather, > > Can you please give some examples of your motifs? > > Also showing us the code that you use with cosmo can be useful. > > Even if the matchPattern() function in Biostrings doesn't let you control the number > of gaps, there might be workarounds, it all depends what your motifs really look > like. And we need use cases anyway so we know where to put our efforts. Thanks! > > H. > > > Houseman, Heather wrote: > > Dear Bioconductor mailing list: > > > > I've been using cosmo to look for motifs. I'd like to search for motifs that have a variable length of gaps in the middle. If I specify a range of motif widths with the cosmo function, it uses the width with the lowest BIC value and searches for motifs of only that width. My dilemma is that the motifs I'm looking for are of variable width. > > > > Thanks in advance for any help! > > > > Heather > > > > This email message, including any attachments, is for ...{{dropped:16}} > > > _______________________________________________ > Bioconductor mailing list > Bioconductor at stat.math.ethz.ch > https://stat.ethz.ch/mailman/listinfo/bioconductor > Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor >
ADD REPLY
0
Entering edit mode
Hi Heather, Would this function work for you? matchLRPatterns <- function(Lpattern, Rpattern, maxngaps, subject, Lmismatch=0, Rmismatch=0) { Lmatches <- matchPattern(Lpattern, subject, mismatch=Lmismatch) Rmatches <- matchPattern(Rpattern, subject, mismatch=Rmismatch) ans_start <- ans_end <- integer(0) for (i in seq_len(length(Lmatches))) { ngaps <- start(Rmatches) - end(Lmatches)[i] - 1 jj <- which(0 <= ngaps & ngaps <= maxngaps) ans_start <- c(ans_start, rep(start(Lmatches)[i], length(jj))) ans_end <- c(ans_end, end(Rmatches)[jj]) } views(subject, ans_start, ans_end) } Arguments: o Lpattern, Rpattern: the left and right parts of your motif (for example TACGTGGGT and TTCCACAG for the 3rd motif you gave us: TACGTGGGT --TTCCACAG) o maxngaps: the max number of gaps in the middle i.e the distance between the left and right parts of your motif o subject: a BString (or derived) object containing the subject string (in your case it needs to be a DNAString object) o Lmismatch, Rmismatch: additionally you can choose to allow a given number of mismatches for the left and right parts of the motif So for example if I want to search motif TACGTGGGT--TTCCACAG in chromosome 1 of the mouse: > library(BSgenome.Mmusculus.UCSC.mm9) > chr1 <- Mmusculus$chr1 > matchLRPatterns("TACGTGGGT", "TTCCACAG", maxngaps=2, chr1) Views on a 197195432-letter DNAString subject Subject: NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN...AAGAATTTGGTATTAAACTTAAAACTGGAATTC Views: NONE I don't find any match. But if I allow the number of gaps to be <= 150 instead of 2: > matchLRPatterns("TACGTGGGT", "TTCCACAG", maxngaps=150, chr1) Views on a 197195432-letter DNAString subject Subject: NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN...AAGAATTTGGTATTAAACTTAAAACTGGAATTC Views: start end width [1] 193252084 193252245 162 [TACGTGGGTTCCTGACGATGGT...ATGTGAACTCTTTCTTCCACAG] then I find one match. Note that matchLRPatterns() will return all matches, even overlapping matches: > library(Biostrings) > subject <- DNAString("AAATTAACCCTT") > matchLRPatterns("AA", "TT", 0, subject) Views on a 12-letter DNAString subject Subject: AAATTAACCCTT Views: start end width [1] 2 5 4 [AATT] > matchLRPatterns("AA", "TT", 1, subject) Views on a 12-letter DNAString subject Subject: AAATTAACCCTT Views: start end width [1] 1 5 5 [AAATT] [2] 2 5 4 [AATT] > matchLRPatterns("AA", "TT", 3, subject) Views on a 12-letter DNAString subject Subject: AAATTAACCCTT Views: start end width [1] 1 5 5 [AAATT] [2] 2 5 4 [AATT] [3] 6 12 7 [AACCCTT] > matchLRPatterns("AA", "TT", 7, subject) Views on a 12-letter DNAString subject Subject: AAATTAACCCTT Views: start end width [1] 1 5 5 [AAATT] [2] 2 5 4 [AATT] [3] 2 12 11 [AATTAACCCTT] [4] 6 12 7 [AACCCTT] Also not that matches will always been ordered from left to right. Please let me know if this is not what you want. Cheers, H. Houseman, Heather wrote: > Herve, > > My ultimate goal is to find motifs in different sequences that are similar to the ones below. > > TACGTGCTGTCTCACACAG > GACGTGACTCGGACCACAT > TACGTGGGT--TTCCACAG > TACGTGAC----CACACAC > TACGTGC-------CACAG > CACGTGC-------CACAC > GGCGTGAGC-----CACCG > GGCGTGGGAGCG--CACAG > TACGTG------CACACAG > > To start off, I'm inserting the motifs above into random sequences to see if I can get cosmo to return those motifs. Once I get that procedure to work, I'd like to use it to apply it to "real" sequences and hopefully return motifs that look similar to the ones above. > > Here's the cosmo code I'm using: > > res = cosmo(seqs = seqs, minW = 12, maxW = 20, models = "OOPS") > > Is this more along the lines of multiple sequence alignment and not something that I can use cosmo for? > > Thanks! > > Heather > > -----Original Message----- > From: Herve Pages [mailto:hpages at fhcrc.org] > Sent: Thursday, November 01, 2007 1:33 PM > To: Houseman, Heather > Cc: bioconductor at stat.math.ethz.ch > Subject: Re: [BioC] motif searching with variable length gaps > > Hi Heather, > > Can you please give some examples of your motifs? > > Also showing us the code that you use with cosmo can be useful. > > Even if the matchPattern() function in Biostrings doesn't let you control the number > of gaps, there might be workarounds, it all depends what your motifs really look > like. And we need use cases anyway so we know where to put our efforts. Thanks! > > H. > > > Houseman, Heather wrote: >> Dear Bioconductor mailing list: >> >> I've been using cosmo to look for motifs. I'd like to search for motifs that have a variable length of gaps in the middle. If I specify a range of motif widths with the cosmo function, it uses the width with the lowest BIC value and searches for motifs of only that width. My dilemma is that the motifs I'm looking for are of variable width. >> >> Thanks in advance for any help! >> >> Heather >> >> This email message, including any attachments, is for th...{{dropped:9}} >> >> _______________________________________________ >> Bioconductor mailing list >> Bioconductor at stat.math.ethz.ch >> https://stat.ethz.ch/mailman/listinfo/bioconductor >> Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor > > This email message, including any attachments, is for ...{{dropped:2}}
ADD REPLY
0
Entering edit mode
@oliver-bembom-2470
Last seen 11.3 years ago
Hi Heather, Unfortunately, the cosmo package doesn't have a functionality for this kind of a search. It is always assumed that the shared motif is of the same length. Oliver On Nov 1, 2007 8:40 AM, Houseman, Heather <heather.houseman at="" vai.org=""> wrote: > Dear Bioconductor mailing list: > > I've been using cosmo to look for motifs. I'd like to search for motifs that have a variable length of gaps in the middle. If I specify a range of motif widths with the cosmo function, it uses the width with the lowest BIC value and searches for motifs of only that width. My dilemma is that the motifs I'm looking for are of variable width. > > Thanks in advance for any help! > > Heather > > This email message, including any attachments, is for ...{{dropped:9}}
ADD COMMENT
0
Entering edit mode
@oliver-bembom-2470
Last seen 11.3 years ago
Hi Heather, If your goal at this point is to insert motifs like the ones above into random sequences to see if cosmo can find them, you should be able to use any character you like for the gaps. Since a gap in a motif just says that the specific nucleotide is not important, you could in particular just fill in random A, C, G, T letters. If you don't want to do that, something like "X" or "N" should also work. It sounds like you already got it to work with the "N". I hope this helps. Oliver On Nov 1, 2007 11:38 AM, Houseman, Heather <heather.houseman at="" vai.org=""> wrote: > Hello Oliver, > > What symbol should I use for gaps? I tried a "-" for each gap, but cosmo returned an error. I then tried a "0" for each gap and cosmo didn't return an error, but the motifs it returned didn't contain any 0's, so I assumed it didn't "like" them. I tried an "N" for each gap and it didn't return an error and the motifs contained N's. I realize that an "N" is not the same as a gap. I was just seeing if it would work. > > What strategy do you suggest? Should I use some other function to align the sequences, which would most likely put gaps in them, and then use those sequences with gaps with cosmo to return motifs? But then, I'm back to the question of what symbol to use for the gaps. > > Thanks for your help! > > Heather > > -----Original Message----- > From: Oliver Bembom [mailto:oliver.bembom at gmail.com] > Sent: Thursday, November 01, 2007 2:24 PM > To: Houseman, Heather > > Cc: Herve Pages; bioconductor at stat.math.ethz.ch > Subject: Re: [BioC] motif searching with variable length gaps > > Hi Heather, > > cosmo could find motifs like these as long as the total length of the > motif (2 outer parts + gap) is the same in each motif. If that's true, > your code should work. > > Oliver > > On Nov 1, 2007 11:10 AM, Houseman, Heather <heather.houseman at="" vai.org=""> wrote: > > Herve, > > > > My ultimate goal is to find motifs in different sequences that are similar to the ones below. > > > > TACGTGCTGTCTCACACAG > > GACGTGACTCGGACCACAT > > TACGTGGGT--TTCCACAG > > TACGTGAC----CACACAC > > TACGTGC-------CACAG > > CACGTGC-------CACAC > > GGCGTGAGC-----CACCG > > GGCGTGGGAGCG--CACAG > > TACGTG------CACACAG > > > > To start off, I'm inserting the motifs above into random sequences to see if I can get cosmo to return those motifs. Once I get that procedure to work, I'd like to use it to apply it to "real" sequences and hopefully return motifs that look similar to the ones above. > > > > Here's the cosmo code I'm using: > > > > res = cosmo(seqs = seqs, minW = 12, maxW = 20, models = "OOPS") > > > > Is this more along the lines of multiple sequence alignment and not something that I can use cosmo for? > > > > Thanks! > > > > Heather > > > > -----Original Message----- > > From: Herve Pages [mailto:hpages at fhcrc.org] > > Sent: Thursday, November 01, 2007 1:33 PM > > To: Houseman, Heather > > Cc: bioconductor at stat.math.ethz.ch > > Subject: Re: [BioC] motif searching with variable length gaps > > > > Hi Heather, > > > > Can you please give some examples of your motifs? > > > > Also showing us the code that you use with cosmo can be useful. > > > > Even if the matchPattern() function in Biostrings doesn't let you control the number > > of gaps, there might be workarounds, it all depends what your motifs really look > > like. And we need use cases anyway so we know where to put our efforts. Thanks! > > > > H. > > > > > > Houseman, Heather wrote: > > > Dear Bioconductor mailing list: > > > > > > I've been using cosmo to look for motifs. I'd like to search for motifs that have a variable length of gaps in the middle. If I specify a range of motif widths with the cosmo function, it uses the width with the lowest BIC value and searches for motifs of only that width. My dilemma is that the motifs I'm looking for are of variable width. > > > > > > Thanks in advance for any help! > > > > > > Heather > > > > > > This email message, including any attachments, is for ...{{dropped:16}} > > > > > > _______________________________________________ > > Bioconductor mailing list > > Bioconductor at stat.math.ethz.ch > > https://stat.ethz.ch/mailman/listinfo/bioconductor > > Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor > > > This email message, including any attachments, is for ...{{dropped:3}}
ADD COMMENT

Login before adding your answer.

Traffic: 762 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6