Motif search

0

Entering edit mode

nooshin ▴ 300

@nooshin-5239

Last seen 5.4 years ago

Dear mailing list, I'm so new in doing sequence analysis and I have a small project to do, which I need help for. I have a known motif of binding site for one TF in Arabidopsis thaliana and I have to find out if this motif is enriched in 500-bps sequences from a set of genes. I don't know any package or function for this. I would be so thankful if anybody could help me. Thanks, Nooshin

• 2.0k views

ADD COMMENT • link updated 12.0 years ago by Paul Shannon ▴ 750 • written 12.0 years ago by nooshin ▴ 300

0

Entering edit mode

Paul Shannon ▴ 750

@paul-shannon-5161

Last seen 9.6 years ago

> > > I'm so new in doing sequence analysis and I have a small project to do, which I need help for. > I have a known motif of binding site for one TF in Arabidopsis thaliana and I have to find out if this motif is enriched in 500-bps sequences from a set of genes. I don't know any package or function for this. I would be so thankful if anybody could help me. > Having recently learned how to do something very similar, I'd be glad to help. A few clarifying questions: 1) In what form is the binding site expressed? Simple sequence, position weight matrix? 2) Do you need only exact matches to this motif? Or approximate matches also? 3) I presume you will be looking for matches in the promoters of your set of genes. True? - Paul > Thanks, > Nooshin > > _______________________________________________ > Bioconductor mailing list > Bioconductor at r-project.org > https://stat.ethz.ch/mailman/listinfo/bioconductor > Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor

ADD COMMENT • link 12.0 years ago Paul Shannon ▴ 750

0

Entering edit mode

Thanks for reply. Answer for the questions: 1) for one TF I have position weight matrix and for the second one I have only the sequence. 2) I need both, exact and approximate match 3) Yes, exactly :) -Nooshin On 04/20/2012 04:19 PM, Paul Shannon wrote: >> >> I'm so new in doing sequence analysis and I have a small project to do, which I need help for. >> I have a known motif of binding site for one TF in Arabidopsis thaliana and I have to find out if this motif is enriched in 500-bps sequences from a set of genes. I don't know any package or function for this. I would be so thankful if anybody could help me. >> > Having recently learned how to do something very similar, I'd be glad to help. A few clarifying questions: > > 1) In what form is the binding site expressed? Simple sequence, position weight matrix? > 2) Do you need only exact matches to this motif? Or approximate matches also? > 3) I presume you will be looking for matches in the promoters of your set of genes. True? > > - Paul > >> Thanks, >> Nooshin >> >> _______________________________________________ >> Bioconductor mailing list >> Bioconductor at r-project.org >> https://stat.ethz.ch/mailman/listinfo/bioconductor >> Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor

ADD REPLY • link 12.0 years ago nooshin ▴ 300

0

Entering edit mode

Hi Nooshin, Here is a script which demonstrates the basics of 1) how to obtain the promoter region of a named gene (paramaterized by upstream & downstream lengths, default 1000 and 100) 2) searching in that sequence for a motif, specified either as a PWM or a simple character string. If you want to do this in bulk, Herve' has some lovely code to make that efficient. But if (as you say) you have just a few genes to look at, this example should be enough to get you going, acquaint you with a few data packages (TxDb, BSgenome.At) and functions (matchPattern and matchPWM). I include a convenience function ('get.promoter') and its test ('test.get.promoter'). This is "strand aware" and uses coordinates from the TxDb object. I think a quick glance will allow you to make sense of this. Execute 'test.get.promoter ()' just to make sure everything is installed properly for you. It should return TRUE. I use a programming idiom here, well-suited to incremental exploration of a programming task, which you may like (or may not). Each step executes, one step at a time, by successive calls to 'run', like this: run (0) # gets the TxDb data into a form where gene coordinates are handy run (1) # sets up the two motifs run (2) # gets the promoter for AT1G01020 run (3) # prints the match to the simple motif sequence run (4) # prints the match to the PWM motif Global variables (yikes!) are used in the run function, so that the values created at each call live on to be used by the next one. Let me know what I am leaving out! We can dig into Herve's techniques for fast whole-genome sequence extraction and search if that's needed. - Paul # sequence and PWM match demo #--------------------------------------------------------------------- --------------------------------------------------- library (RUnit) library (BSgenome.Athaliana.TAIR.TAIR9) library (TxDb.Athaliana.BioMart.plantsmart12) #--------------------------------------------------------------------- --------------------------------------------------- run = function (levels) { if (0 %in% levels) { txdb <<- TxDb.Athaliana.BioMart.plantsmart12 all.gene.info <<- transcriptsBy (txdb, by = "gene") } # 0 if (1 %in% levels) { # a binding # http://jaspar.genereg.net/cgi- bin/jaspar_db.pl?ID=MA0005.1&rm=present&collection=CORE ag.jaspar.pwm <<- rbind (A = c (0, 2, 38, 61, 73, 31, 23, 20, 22, 1, 27), C = c (87, 86, 9, 5, 1, 3, 16, 13, 0, 0, 25), G = c (2, 0, 7, 2, 1, 4, 17, 26, 64, 82, 9), T = c (1, 2, 36, 22, 15, 52, 34, 31, 4, 7, 29)) madeup.motif.seq <<- 'GGTTCTT' } # 1 if (2 %in% levels) { promoter <<- get.promoter all.gene.info, 'AT1G01020', 500, 100) } # 2 if (3 %in% levels) { print (matchPattern (madeup.motif.seq, promoter)) } # 3 if (4 %in% levels) { print (matchPWM (ag.jaspar.pwm, promoter, min.score='90%')) } # 4 } # run #--------------------------------------------------------------------- --------------------------------------------------- # returns a char string from proper strand, reverse complemented if appropriated. get.promoter = function all.gene.info, gene.id, upstream=1000, downstream=100) { if (!gene.id %in% names all.gene.info)) return (NA) gene.info = all.gene.info [[gene.id]] start.loc <<- min (start gene.info)) end.loc <<- max (end gene.info)) strand.name <<- unique (as.character (strand gene.info))) chromosome.name <<- unique (as.character (seqnames gene.info)))[1] has.proper.chromosome.name = length (grep ('Chr', chromosome.name)) == 1 # needs to be of form 'Chr5' if (!has.proper.chromosome.name) chromosome.name <<- paste ('Chr', chromosome.name, sep='') stopifnot chromosome.name %in% names (Athaliana)) chr.seq = Athaliana [[chromosome.name]] if strand.name == '+') { promoter.start = start.loc - upstream promoter.end = start.loc + downstream promoter.seq = subseq (chr.seq, promoter.start, promoter.end) } # + strand if strand.name == '-') { promoter.start = end.loc - upstream promoter.end = end.loc + downstream -1 start.xx <<- promoter.start end.xx <<- promoter.end promoter.seq = reverseComplement (subseq (chr.seq, promoter.start, promoter.end)) } # else, - strand invisible (toString (promoter.seq)) } # get.promoter #--------------------------------------------------------------------- --------------------------------------------------- test.get.promoter = function () { promoter.at1g01020.10 <<- get.promoter all.gene.info, 'AT1G01020', 10, 0) checkEquals (nchar (promoter.at1g01020.10), 10) checkEquals (promoter.at1g01020.10, "GACCCGGACT") promoter.at1g01020.40 <<- get.promoter all.gene.info, 'AT1G01020', 20, 20) checkEquals (nchar (promoter.at1g01020.40), 40) } # test.get.promoter #--------------------------------------------------------------------- --------------------------------------------------- On Apr 20, 2012, at 7:28 AM, nooshin wrote: > > Thanks for reply. > > Answer for the questions: > 1) for one TF I have position weight matrix and for the second one I have only the sequence. > 2) I need both, exact and approximate match > 3) Yes, exactly :) > > -Nooshin > > > On 04/20/2012 04:19 PM, Paul Shannon wrote: >>> >>> I'm so new in doing sequence analysis and I have a small project to do, which I need help for. >>> I have a known motif of binding site for one TF in Arabidopsis thaliana and I have to find out if this motif is enriched in 500-bps sequences from a set of genes. I don't know any package or function for this. I would be so thankful if anybody could help me. >>> >> Having recently learned how to do something very similar, I'd be glad to help. A few clarifying questions: >> >> 1) In what form is the binding site expressed? Simple sequence, position weight matrix? >> 2) Do you need only exact matches to this motif? Or approximate matches also? >> 3) I presume you will be looking for matches in the promoters of your set of genes. True? >> >> - Paul >> >>> Thanks, >>> Nooshin >>> >>> _______________________________________________ >>> Bioconductor mailing list >>> Bioconductor at r-project.org >>> https://stat.ethz.ch/mailman/listinfo/bioconductor >>> Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor >

ADD REPLY • link 12.0 years ago Paul Shannon ▴ 750

Login before adding your answer.