Motif search
1
0
Entering edit mode
nooshin ▴ 300
@nooshin-5239
Last seen 5.4 years ago
Dear mailing list, I'm so new in doing sequence analysis and I have a small project to do, which I need help for. I have a known motif of binding site for one TF in Arabidopsis thaliana and I have to find out if this motif is enriched in 500-bps sequences from a set of genes. I don't know any package or function for this. I would be so thankful if anybody could help me. Thanks, Nooshin
• 2.0k views
ADD COMMENT
0
Entering edit mode
Paul Shannon ▴ 750
@paul-shannon-5161
Last seen 9.6 years ago
> > > I'm so new in doing sequence analysis and I have a small project to do, which I need help for. > I have a known motif of binding site for one TF in Arabidopsis thaliana and I have to find out if this motif is enriched in 500-bps sequences from a set of genes. I don't know any package or function for this. I would be so thankful if anybody could help me. > Having recently learned how to do something very similar, I'd be glad to help. A few clarifying questions: 1) In what form is the binding site expressed? Simple sequence, position weight matrix? 2) Do you need only exact matches to this motif? Or approximate matches also? 3) I presume you will be looking for matches in the promoters of your set of genes. True? - Paul > Thanks, > Nooshin > > _______________________________________________ > Bioconductor mailing list > Bioconductor at r-project.org > https://stat.ethz.ch/mailman/listinfo/bioconductor > Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor
ADD COMMENT
0
Entering edit mode
Thanks for reply. Answer for the questions: 1) for one TF I have position weight matrix and for the second one I have only the sequence. 2) I need both, exact and approximate match 3) Yes, exactly :) -Nooshin On 04/20/2012 04:19 PM, Paul Shannon wrote: >> >> I'm so new in doing sequence analysis and I have a small project to do, which I need help for. >> I have a known motif of binding site for one TF in Arabidopsis thaliana and I have to find out if this motif is enriched in 500-bps sequences from a set of genes. I don't know any package or function for this. I would be so thankful if anybody could help me. >> > Having recently learned how to do something very similar, I'd be glad to help. A few clarifying questions: > > 1) In what form is the binding site expressed? Simple sequence, position weight matrix? > 2) Do you need only exact matches to this motif? Or approximate matches also? > 3) I presume you will be looking for matches in the promoters of your set of genes. True? > > - Paul > >> Thanks, >> Nooshin >> >> _______________________________________________ >> Bioconductor mailing list >> Bioconductor at r-project.org >> https://stat.ethz.ch/mailman/listinfo/bioconductor >> Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor
ADD REPLY
0
Entering edit mode
Hi Nooshin, Here is a script which demonstrates the basics of 1) how to obtain the promoter region of a named gene (paramaterized by upstream & downstream lengths, default 1000 and 100) 2) searching in that sequence for a motif, specified either as a PWM or a simple character string. If you want to do this in bulk, Herve' has some lovely code to make that efficient. But if (as you say) you have just a few genes to look at, this example should be enough to get you going, acquaint you with a few data packages (TxDb, BSgenome.At) and functions (matchPattern and matchPWM). I include a convenience function ('get.promoter') and its test ('test.get.promoter'). This is "strand aware" and uses coordinates from the TxDb object. I think a quick glance will allow you to make sense of this. Execute 'test.get.promoter ()' just to make sure everything is installed properly for you. It should return TRUE. I use a programming idiom here, well-suited to incremental exploration of a programming task, which you may like (or may not). Each step executes, one step at a time, by successive calls to 'run', like this: run (0) # gets the TxDb data into a form where gene coordinates are handy run (1) # sets up the two motifs run (2) # gets the promoter for AT1G01020 run (3) # prints the match to the simple motif sequence run (4) # prints the match to the PWM motif Global variables (yikes!) are used in the run function, so that the values created at each call live on to be used by the next one. Let me know what I am leaving out! We can dig into Herve's techniques for fast whole-genome sequence extraction and search if that's needed. - Paul # sequence and PWM match demo #--------------------------------------------------------------------- --------------------------------------------------- library (RUnit) library (BSgenome.Athaliana.TAIR.TAIR9) library (TxDb.Athaliana.BioMart.plantsmart12) #--------------------------------------------------------------------- --------------------------------------------------- run = function (levels) { if (0 %in% levels) { txdb <<- TxDb.Athaliana.BioMart.plantsmart12 all.gene.info <<- transcriptsBy (txdb, by = "gene") } # 0 if (1 %in% levels) { # a binding # http://jaspar.genereg.net/cgi- bin/jaspar_db.pl?ID=MA0005.1&rm=present&collection=CORE ag.jaspar.pwm <<- rbind (A = c (0, 2, 38, 61, 73, 31, 23, 20, 22, 1, 27), C = c (87, 86, 9, 5, 1, 3, 16, 13, 0, 0, 25), G = c (2, 0, 7, 2, 1, 4, 17, 26, 64, 82, 9), T = c (1, 2, 36, 22, 15, 52, 34, 31, 4, 7, 29)) madeup.motif.seq <<- 'GGTTCTT' } # 1 if (2 %in% levels) { promoter <<- get.promoter all.gene.info, 'AT1G01020', 500, 100) } # 2 if (3 %in% levels) { print (matchPattern (madeup.motif.seq, promoter)) } # 3 if (4 %in% levels) { print (matchPWM (ag.jaspar.pwm, promoter, min.score='90%')) } # 4 } # run #--------------------------------------------------------------------- --------------------------------------------------- # returns a char string from proper strand, reverse complemented if appropriated. get.promoter = function all.gene.info, gene.id, upstream=1000, downstream=100) { if (!gene.id %in% names all.gene.info)) return (NA) gene.info = all.gene.info [[gene.id]] start.loc <<- min (start gene.info)) end.loc <<- max (end gene.info)) strand.name <<- unique (as.character (strand gene.info))) chromosome.name <<- unique (as.character (seqnames gene.info)))[1] has.proper.chromosome.name = length (grep ('Chr', chromosome.name)) == 1 # needs to be of form 'Chr5' if (!has.proper.chromosome.name) chromosome.name <<- paste ('Chr', chromosome.name, sep='') stopifnot chromosome.name %in% names (Athaliana)) chr.seq = Athaliana [[chromosome.name]] if strand.name == '+') { promoter.start = start.loc - upstream promoter.end = start.loc + downstream promoter.seq = subseq (chr.seq, promoter.start, promoter.end) } # + strand if strand.name == '-') { promoter.start = end.loc - upstream promoter.end = end.loc + downstream -1 start.xx <<- promoter.start end.xx <<- promoter.end promoter.seq = reverseComplement (subseq (chr.seq, promoter.start, promoter.end)) } # else, - strand invisible (toString (promoter.seq)) } # get.promoter #--------------------------------------------------------------------- --------------------------------------------------- test.get.promoter = function () { promoter.at1g01020.10 <<- get.promoter all.gene.info, 'AT1G01020', 10, 0) checkEquals (nchar (promoter.at1g01020.10), 10) checkEquals (promoter.at1g01020.10, "GACCCGGACT") promoter.at1g01020.40 <<- get.promoter all.gene.info, 'AT1G01020', 20, 20) checkEquals (nchar (promoter.at1g01020.40), 40) } # test.get.promoter #--------------------------------------------------------------------- --------------------------------------------------- On Apr 20, 2012, at 7:28 AM, nooshin wrote: > > Thanks for reply. > > Answer for the questions: > 1) for one TF I have position weight matrix and for the second one I have only the sequence. > 2) I need both, exact and approximate match > 3) Yes, exactly :) > > -Nooshin > > > On 04/20/2012 04:19 PM, Paul Shannon wrote: >>> >>> I'm so new in doing sequence analysis and I have a small project to do, which I need help for. >>> I have a known motif of binding site for one TF in Arabidopsis thaliana and I have to find out if this motif is enriched in 500-bps sequences from a set of genes. I don't know any package or function for this. I would be so thankful if anybody could help me. >>> >> Having recently learned how to do something very similar, I'd be glad to help. A few clarifying questions: >> >> 1) In what form is the binding site expressed? Simple sequence, position weight matrix? >> 2) Do you need only exact matches to this motif? Or approximate matches also? >> 3) I presume you will be looking for matches in the promoters of your set of genes. True? >> >> - Paul >> >>> Thanks, >>> Nooshin >>> >>> _______________________________________________ >>> Bioconductor mailing list >>> Bioconductor at r-project.org >>> https://stat.ethz.ch/mailman/listinfo/bioconductor >>> Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor >
ADD REPLY

Login before adding your answer.

Traffic: 704 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6