from TF DNA binding motif to downstream genes?

0

Entering edit mode

Paul Shannon ★ 1.1k

@paul-shannon-578

Last seen 9.6 years ago

Can I get advice from on good ways to find genes -- perhaps with a high false-positive rate -- whose promoters contain known DNA binding motifs? Hu et al, "Profiling the Human Protein-DNA Interactome Reveals ERK2 as a Transcriptional Repressor of Interferon Signaling" identifies 17,718 PDIs [protein-DNA interactions] between 460 DNA motifs predicted to regulate transcription and 4,191 human proteins of various functional classes. I wish to take those 460 motifs -- many of them only 7 bases long -- and find the genes whose transcription they control. I suspect the answer lies in some artful use of Biostrings, BSgenome (which together provide efficient genome search), along with annotation to find the transcription start site of known genes. But before I start, I think it prudent to get the advice of those who may know more than me. Thanks! - Paul

Transcription Biostrings Transcription Biostrings • 1.4k views

ADD COMMENT • link updated 14.3 years ago by Patrick Aboyoun ★ 1.6k • written 14.3 years ago by Paul Shannon ★ 1.1k

0

Entering edit mode

Patrick Aboyoun ★ 1.6k

@patrick-aboyoun-6734

Last seen 9.6 years ago

United States

Paul, I have made a first attempt at solving the first part of your problem (mapping the motifs to the genome) and plan on making this easier to perform by adding a vmatchPDict method to the BSgenome package in BioC 2.6. For now, here is some code that creates a RangedData object identifying the locations on the genome where the motifs match. You can then use findOverlaps against a RangedData object that contains the annotations that are of interest to you. Feedback is welcome. - Patrick ## load the base libraries library(Biostrings) library(BSgenome) ## load the genome library(BSgenome.Celegans.UCSC.ce2) ## create the motifs data(HNF4alpha) ## ------------------------------------------------------------- ## method for finding motif locations on genome ## the motifId column is an element identifier ## that relates back to the original motif set matchFUN <- function(strings, chr) { posPDict <- strings negPDict <- reverseComplement(strings) posMatches <- matchPDict(pdict = posPDict, subject = chr) posCounts <- elementLengths(posMatches) negMatches <- matchPDict(pdict = negPDict, subject = chr) negCounts <- elementLengths(negMatches) strand <- strand(rep(c("+", "-"), c(sum(posCounts), sum(negCounts)))) motifId <- c(rep(seq_len(length(posMatches)), posCounts), rep(seq_len(length(negMatches)), negCounts)) RangedData(c(unlist(posMatches), unlist(negMatches)), strand = strand, motifId = motifId) } bsParams <- new("BSParams", X = Celegans, FUN = matchFUN, simplify = TRUE) matches <- bsapply(bsParams, strings = HNF4alpha) nms <- names(matches) matches <- do.call(c, unname(matches)) names(matches) <- nms ## ------------------------------------------------------------- > matches RangedData with 183 rows and 2 value columns across 7 spaces space ranges | strand motifId <character> <iranges> | <factor> <integer> 1 chrI [10714238, 10714250] | + 1 2 chrI [ 1746247, 1746259] | + 33 3 chrI [11509260, 11509272] | + 39 4 chrI [ 5249651, 5249663] | + 48 5 chrI [ 5442409, 5442421] | + 64 6 chrI [ 7949495, 7949507] | + 64 7 chrI [ 2788492, 2788504] | + 71 8 chrI [ 3853105, 3853117] | + 71 9 chrI [ 6952606, 6952618] | + 71 10 chrI [10242063, 10242075] | - 1 ... <173 more rows> Paul Shannon wrote: > Can I get advice from on good ways to find genes -- perhaps with a high false-positive rate -- whose promoters contain known DNA binding motifs? > > Hu et al, "Profiling the Human Protein-DNA Interactome Reveals ERK2 as a Transcriptional Repressor of Interferon Signaling" identifies > > 17,718 PDIs [protein-DNA interactions] between 460 DNA motifs predicted to regulate > transcription and 4,191 human proteins of various functional classes. > > I wish to take those 460 motifs -- many of them only 7 bases long -- and find the genes whose transcription they control. > > I suspect the answer lies in some artful use of Biostrings, BSgenome (which together provide efficient genome search), along with annotation to find the transcription start site of known genes. But before I start, I think it prudent to get the advice of those who may know more than me. > > Thanks! > > - Paul > _______________________________________________ > Bioconductor mailing list > Bioconductor at stat.math.ethz.ch > https://stat.ethz.ch/mailman/listinfo/bioconductor > Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor >

ADD COMMENT • link 14.3 years ago Patrick Aboyoun ★ 1.6k

0

Entering edit mode

Hi Patrick, Thanks very much! Running your code just now, I get this: Error in function (classes, fdef, mtable) : unable to find an inherited method for function "dups", for signature "DNAStringSet" traceback () & sessionInfo pasted in below. Is dups perhaps defined in the devel version of Biostrings? - Paul > traceback () 13: stop("unable to find an inherited method for function \"", fdef at generic, "\", for signature ", cnames) 12: function (classes, fdef, mtable) { methods <- .findInheritedMethods(classes, fdef, mtable) if (length(methods) == 1L) return(methods[[1L]]) else if (length(methods) == 0L) { cnames <- paste("\"", sapply(classes, as.character), "\"", sep = "", collapse = ", ") stop("unable to find an inherited method for function \"", fdef at generic, "\", for signature ", cnames) } else stop("Internal error in finding inherited methods; didn't return a unique method") }(list("DNAStringSet"), function (x) standardGeneric("dups"), <environment>) 11: dups(pdict) 10: .matchPDict(pdict, subject, algorithm, max.mismatch, min.mismatch, fixed, verbose) at go.R#27 9: matchPDict(pdict = posPDict, subject = chr) at go.R#27 8: matchPDict(pdict = posPDict, subject = chr) at go.R#27 7: BSParams at FUN(seq, ...) 6: FUN(c("chrI", "chrII", "chrIII", "chrIV", "chrV", "chrX", "chrM" )[[1L]], ...) 5: lapply(X, FUN, ...) 4: sapply(seqnames, processSeqname, ...) 3: sapply(seqnames, processSeqname, ...) 2: bsapply(bsParams, strings = HNF4alpha) at go.R#14 1: run(0) > sessionInfo () R version 2.10.0 (2009-10-26) x86_64-apple-darwin9.8.0 locale: [1] en_US.utf-8/en_US.utf-8/C/C/en_US.utf-8/en_US.utf-8 attached base packages: [1] stats graphics grDevices utils datasets methods base other attached packages: [1] BSgenome.Celegans.UCSC.ce2_1.3.16 BSgenome_1.14.2 Biostrings_2.14.8 IRanges_1.4.8 loaded via a namespace (and not attached): [1] Biobase_2.6.0 tools_2.10.0 On Jan 15, 2010, at 3:00 PM, Patrick Aboyoun wrote: > Paul, > I have made a first attempt at solving the first part of your problem (mapping the motifs to the genome) and plan on making this easier to perform by adding a vmatchPDict method to the BSgenome package in BioC 2.6. For now, here is some code that creates a RangedData object identifying the locations on the genome where the motifs match. You can then use findOverlaps against a RangedData object that contains the annotations that are of interest to you. Feedback is welcome. - Patrick > > > ## load the base libraries > library(Biostrings) > library(BSgenome) > > ## load the genome > library(BSgenome.Celegans.UCSC.ce2) > > ## create the motifs > data(HNF4alpha) > > ## ------------------------------------------------------------- > ## method for finding motif locations on genome > ## the motifId column is an element identifier > ## that relates back to the original motif set > matchFUN <- function(strings, chr) { > posPDict <- strings > negPDict <- reverseComplement(strings) > posMatches <- matchPDict(pdict = posPDict, subject = chr) > posCounts <- elementLengths(posMatches) > negMatches <- matchPDict(pdict = negPDict, subject = chr) > negCounts <- elementLengths(negMatches) > strand <- > strand(rep(c("+", "-"), c(sum(posCounts), sum(negCounts)))) > motifId <- > c(rep(seq_len(length(posMatches)), posCounts), > rep(seq_len(length(negMatches)), negCounts)) RangedData(c(unlist(posMatches), unlist(negMatches)), > strand = strand, motifId = motifId) > } > bsParams <- > new("BSParams", X = Celegans, FUN = matchFUN, simplify = TRUE) > matches <- bsapply(bsParams, strings = HNF4alpha) > nms <- names(matches) > matches <- do.call(c, unname(matches)) > names(matches) <- nms > ## ------------------------------------------------------------- > > > matches > RangedData with 183 rows and 2 value columns across 7 spaces > space ranges | strand motifId > <character> <iranges> | <factor> <integer> > 1 chrI [10714238, 10714250] | + 1 > 2 chrI [ 1746247, 1746259] | + 33 > 3 chrI [11509260, 11509272] | + 39 > 4 chrI [ 5249651, 5249663] | + 48 > 5 chrI [ 5442409, 5442421] | + 64 > 6 chrI [ 7949495, 7949507] | + 64 > 7 chrI [ 2788492, 2788504] | + 71 > 8 chrI [ 3853105, 3853117] | + 71 > 9 chrI [ 6952606, 6952618] | + 71 > 10 chrI [10242063, 10242075] | - 1 > ... > <173 more rows> > > > > Paul Shannon wrote: >> Can I get advice from on good ways to find genes -- perhaps with a high false-positive rate -- whose promoters contain known DNA binding motifs? >> >> Hu et al, "Profiling the Human Protein-DNA Interactome Reveals ERK2 as a Transcriptional Repressor of Interferon Signaling" identifies >> >> 17,718 PDIs [protein-DNA interactions] between 460 DNA motifs predicted to regulate transcription and 4,191 human proteins of various functional classes. >> >> I wish to take those 460 motifs -- many of them only 7 bases long -- and find the genes whose transcription they control. >> >> I suspect the answer lies in some artful use of Biostrings, BSgenome (which together provide efficient genome search), along with annotation to find the transcription start site of known genes. But before I start, I think it prudent to get the advice of those who may know more than me. >> >> Thanks! >> >> - Paul >> _______________________________________________ >> Bioconductor mailing list >> Bioconductor at stat.math.ethz.ch >> https://stat.ethz.ch/mailman/listinfo/bioconductor >> Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor >> >

ADD REPLY • link 14.3 years ago Paul Shannon ★ 1.1k

0

Entering edit mode

Paul, The issue was that the code was depended on using R-devel as well as the latest versions of BioC packages IRanges, Biostrings, and BSgenome. Rather than trying to retrofit something into R 2.10 and BioC 2.5, I went ahead and added a new vmatchPDict method for BSgenome objects into BioC 2.6 for use with R-devel. I just checked the code in so it wont be available from bioconductor.org until Thursday morning at the earliest. If you want it earlier, you will need to get the latest versions of IRanges, Biostrings, and BSgenome from the trunk of Bioconductor's software svn. Below is an example of this new functionality. I am trying to grow the use of RangedData objects as containers for match output and am looking for any feedback on its usability. In particular, I am looking for useful methods that are missing from the packages referenced above so I can fill in the gaps. > suppressMessages(library(BSgenome)) > library(BSgenome.Celegans.UCSC.ce2) > data(HNF4alpha) # a DNAStringSet object > vmatchPDict(HNF4alpha[1:10], Celegans) RangedData with 14 rows and 2 value columns across 7 spaces space ranges | strand index <character> <iranges> | <rle> <rle> 1 chrI [10714238, 10714250] | + 1 2 chrI [10242063, 10242075] | - 1 3 chrI [ 995608, 995620] | - 3 4 chrIII [ 360758, 360770] | + 1 5 chrIII [ 9996856, 9996868] | - 1 6 chrIV [16177061, 16177073] | + 3 7 chrIV [17014321, 17014333] | - 4 8 chrIV [ 6364368, 6364380] | - 10 9 chrV [11914362, 11914374] | + 1 10 chrV [19656881, 19656893] | + 2 ... <4 more rows> > sessionInfo() R version 2.11.0 Under development (unstable) (2010-01-18 r50995) i386-apple-darwin9.8.0 locale: [1] en_US.UTF-8/en_US.UTF-8/C/C/en_US.UTF-8/en_US.UTF-8 attached base packages: [1] stats graphics grDevices utils datasets methods base other attached packages: [1] BSgenome.Celegans.UCSC.ce2_1.3.16 BSgenome_1.15.4 [3] Biostrings_2.15.18 IRanges_1.5.29 loaded via a namespace (and not attached): [1] Biobase_2.7.3 tools_2.11.0 Patrick Paul Shannon wrote: > Hi Patrick, > > Thanks very much! > > Running your code just now, I get this: > > Error in function (classes, fdef, mtable) : > unable to find an inherited method for function "dups", for signature "DNAStringSet" > > traceback () & sessionInfo pasted in below. > Is dups perhaps defined in the devel version of Biostrings? > > - Paul > > > >> traceback () >> > 13: stop("unable to find an inherited method for function \"", fdef at generic, > "\", for signature ", cnames) > 12: function (classes, fdef, mtable) > { > methods <- .findInheritedMethods(classes, fdef, mtable) > if (length(methods) == 1L) > return(methods[[1L]]) > else if (length(methods) == 0L) { > cnames <- paste("\"", sapply(classes, as.character), > "\"", sep = "", collapse = ", ") > stop("unable to find an inherited method for function \"", > fdef at generic, "\", for signature ", cnames) > } > else stop("Internal error in finding inherited methods; didn't return a unique method") > }(list("DNAStringSet"), function (x) > standardGeneric("dups"), <environment>) > 11: dups(pdict) > 10: .matchPDict(pdict, subject, algorithm, max.mismatch, min.mismatch, > fixed, verbose) at go.R#27 > 9: matchPDict(pdict = posPDict, subject = chr) at go.R#27 > 8: matchPDict(pdict = posPDict, subject = chr) at go.R#27 > 7: BSParams at FUN(seq, ...) > 6: FUN(c("chrI", "chrII", "chrIII", "chrIV", "chrV", "chrX", "chrM" > )[[1L]], ...) > 5: lapply(X, FUN, ...) > 4: sapply(seqnames, processSeqname, ...) > 3: sapply(seqnames, processSeqname, ...) > 2: bsapply(bsParams, strings = HNF4alpha) at go.R#14 > 1: run(0) > >> sessionInfo () >> > R version 2.10.0 (2009-10-26) > x86_64-apple-darwin9.8.0 > > locale: > [1] en_US.utf-8/en_US.utf-8/C/C/en_US.utf-8/en_US.utf-8 > > attached base packages: > [1] stats graphics grDevices utils datasets methods base > > other attached packages: > [1] BSgenome.Celegans.UCSC.ce2_1.3.16 BSgenome_1.14.2 Biostrings_2.14.8 IRanges_1.4.8 > > loaded via a namespace (and not attached): > [1] Biobase_2.6.0 tools_2.10.0 > > > On Jan 15, 2010, at 3:00 PM, Patrick Aboyoun wrote: > > >> Paul, >> I have made a first attempt at solving the first part of your problem (mapping the motifs to the genome) and plan on making this easier to perform by adding a vmatchPDict method to the BSgenome package in BioC 2.6. For now, here is some code that creates a RangedData object identifying the locations on the genome where the motifs match. You can then use findOverlaps against a RangedData object that contains the annotations that are of interest to you. Feedback is welcome. - Patrick >> >> >> ## load the base libraries >> library(Biostrings) >> library(BSgenome) >> >> ## load the genome >> library(BSgenome.Celegans.UCSC.ce2) >> >> ## create the motifs >> data(HNF4alpha) >> >> ## ------------------------------------------------------------- >> ## method for finding motif locations on genome >> ## the motifId column is an element identifier >> ## that relates back to the original motif set >> matchFUN <- function(strings, chr) { >> posPDict <- strings >> negPDict <- reverseComplement(strings) >> posMatches <- matchPDict(pdict = posPDict, subject = chr) >> posCounts <- elementLengths(posMatches) >> negMatches <- matchPDict(pdict = negPDict, subject = chr) >> negCounts <- elementLengths(negMatches) >> strand <- >> strand(rep(c("+", "-"), c(sum(posCounts), sum(negCounts)))) >> motifId <- >> c(rep(seq_len(length(posMatches)), posCounts), >> rep(seq_len(length(negMatches)), negCounts)) RangedData(c(unlist(posMatches), unlist(negMatches)), >> strand = strand, motifId = motifId) >> } >> bsParams <- >> new("BSParams", X = Celegans, FUN = matchFUN, simplify = TRUE) >> matches <- bsapply(bsParams, strings = HNF4alpha) >> nms <- names(matches) >> matches <- do.call(c, unname(matches)) >> names(matches) <- nms >> ## ------------------------------------------------------------- >> >> >>> matches >>> >> RangedData with 183 rows and 2 value columns across 7 spaces >> space ranges | strand motifId >> <character> <iranges> | <factor> <integer> >> 1 chrI [10714238, 10714250] | + 1 >> 2 chrI [ 1746247, 1746259] | + 33 >> 3 chrI [11509260, 11509272] | + 39 >> 4 chrI [ 5249651, 5249663] | + 48 >> 5 chrI [ 5442409, 5442421] | + 64 >> 6 chrI [ 7949495, 7949507] | + 64 >> 7 chrI [ 2788492, 2788504] | + 71 >> 8 chrI [ 3853105, 3853117] | + 71 >> 9 chrI [ 6952606, 6952618] | + 71 >> 10 chrI [10242063, 10242075] | - 1 >> ... >> <173 more rows> >> >> >> >> Paul Shannon wrote: >> >>> Can I get advice from on good ways to find genes -- perhaps with a high false-positive rate -- whose promoters contain known DNA binding motifs? >>> >>> Hu et al, "Profiling the Human Protein-DNA Interactome Reveals ERK2 as a Transcriptional Repressor of Interferon Signaling" identifies >>> >>> 17,718 PDIs [protein-DNA interactions] between 460 DNA motifs predicted to regulate transcription and 4,191 human proteins of various functional classes. >>> >>> I wish to take those 460 motifs -- many of them only 7 bases long -- and find the genes whose transcription they control. >>> >>> I suspect the answer lies in some artful use of Biostrings, BSgenome (which together provide efficient genome search), along with annotation to find the transcription start site of known genes. But before I start, I think it prudent to get the advice of those who may know more than me. >>> >>> Thanks! >>> >>> - Paul >>> _______________________________________________ >>> Bioconductor mailing list >>> Bioconductor at stat.math.ethz.ch >>> https://stat.ethz.ch/mailman/listinfo/bioconductor >>> Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor >>> >>> > >

ADD REPLY • link 14.3 years ago Patrick Aboyoun ★ 1.6k

0

Entering edit mode

Depending on what information you have about the TF motifs, you might find matchPWM useful. As the name implies, it uses a position weight matrix. Kasper On Jan 15, 2010, at 18:00 PM, Patrick Aboyoun wrote: > Paul, > I have made a first attempt at solving the first part of your problem (mapping the motifs to the genome) and plan on making this easier to perform by adding a vmatchPDict method to the BSgenome package in BioC 2.6. For now, here is some code that creates a RangedData object identifying the locations on the genome where the motifs match. You can then use findOverlaps against a RangedData object that contains the annotations that are of interest to you. Feedback is welcome. - Patrick > > > ## load the base libraries > library(Biostrings) > library(BSgenome) > > ## load the genome > library(BSgenome.Celegans.UCSC.ce2) > > ## create the motifs > data(HNF4alpha) > > ## ------------------------------------------------------------- > ## method for finding motif locations on genome > ## the motifId column is an element identifier > ## that relates back to the original motif set > matchFUN <- function(strings, chr) { > posPDict <- strings > negPDict <- reverseComplement(strings) > posMatches <- matchPDict(pdict = posPDict, subject = chr) > posCounts <- elementLengths(posMatches) > negMatches <- matchPDict(pdict = negPDict, subject = chr) > negCounts <- elementLengths(negMatches) > strand <- > strand(rep(c("+", "-"), c(sum(posCounts), sum(negCounts)))) > motifId <- > c(rep(seq_len(length(posMatches)), posCounts), > rep(seq_len(length(negMatches)), negCounts)) RangedData(c(unlist(posMatches), unlist(negMatches)), > strand = strand, motifId = motifId) > } > bsParams <- > new("BSParams", X = Celegans, FUN = matchFUN, simplify = TRUE) > matches <- bsapply(bsParams, strings = HNF4alpha) > nms <- names(matches) > matches <- do.call(c, unname(matches)) > names(matches) <- nms > ## ------------------------------------------------------------- > > > matches > RangedData with 183 rows and 2 value columns across 7 spaces > space ranges | strand motifId > <character> <iranges> | <factor> <integer> > 1 chrI [10714238, 10714250] | + 1 > 2 chrI [ 1746247, 1746259] | + 33 > 3 chrI [11509260, 11509272] | + 39 > 4 chrI [ 5249651, 5249663] | + 48 > 5 chrI [ 5442409, 5442421] | + 64 > 6 chrI [ 7949495, 7949507] | + 64 > 7 chrI [ 2788492, 2788504] | + 71 > 8 chrI [ 3853105, 3853117] | + 71 > 9 chrI [ 6952606, 6952618] | + 71 > 10 chrI [10242063, 10242075] | - 1 > ... > <173 more rows> > > > > Paul Shannon wrote: >> Can I get advice from on good ways to find genes -- perhaps with a high false-positive rate -- whose promoters contain known DNA binding motifs? >> >> Hu et al, "Profiling the Human Protein-DNA Interactome Reveals ERK2 as a Transcriptional Repressor of Interferon Signaling" identifies >> >> 17,718 PDIs [protein-DNA interactions] between 460 DNA motifs predicted to regulate transcription and 4,191 human proteins of various functional classes. >> >> I wish to take those 460 motifs -- many of them only 7 bases long -- and find the genes whose transcription they control. >> >> I suspect the answer lies in some artful use of Biostrings, BSgenome (which together provide efficient genome search), along with annotation to find the transcription start site of known genes. But before I start, I think it prudent to get the advice of those who may know more than me. >> >> Thanks! >> >> - Paul >> _______________________________________________ >> Bioconductor mailing list >> Bioconductor at stat.math.ethz.ch >> https://stat.ethz.ch/mailman/listinfo/bioconductor >> Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor >> > > _______________________________________________ > Bioconductor mailing list > Bioconductor at stat.math.ethz.ch > https://stat.ethz.ch/mailman/listinfo/bioconductor > Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor

ADD REPLY • link 14.3 years ago Kasper Daniel Hansen ★ 6.5k

0

Entering edit mode

And there is a matchPWM method for BSgenome objects. See below > library(BSgenome) > help("matchPWM,BSgenome-method") Patrick Kasper Daniel Hansen wrote: > Depending on what information you have about the TF motifs, you might find > matchPWM > useful. As the name implies, it uses a position weight matrix. > > Kasper > > On Jan 15, 2010, at 18:00 PM, Patrick Aboyoun wrote: > > >> Paul, >> I have made a first attempt at solving the first part of your problem (mapping the motifs to the genome) and plan on making this easier to perform by adding a vmatchPDict method to the BSgenome package in BioC 2.6. For now, here is some code that creates a RangedData object identifying the locations on the genome where the motifs match. You can then use findOverlaps against a RangedData object that contains the annotations that are of interest to you. Feedback is welcome. - Patrick >> >> >> ## load the base libraries >> library(Biostrings) >> library(BSgenome) >> >> ## load the genome >> library(BSgenome.Celegans.UCSC.ce2) >> >> ## create the motifs >> data(HNF4alpha) >> >> ## ------------------------------------------------------------- >> ## method for finding motif locations on genome >> ## the motifId column is an element identifier >> ## that relates back to the original motif set >> matchFUN <- function(strings, chr) { >> posPDict <- strings >> negPDict <- reverseComplement(strings) >> posMatches <- matchPDict(pdict = posPDict, subject = chr) >> posCounts <- elementLengths(posMatches) >> negMatches <- matchPDict(pdict = negPDict, subject = chr) >> negCounts <- elementLengths(negMatches) >> strand <- >> strand(rep(c("+", "-"), c(sum(posCounts), sum(negCounts)))) >> motifId <- >> c(rep(seq_len(length(posMatches)), posCounts), >> rep(seq_len(length(negMatches)), negCounts)) RangedData(c(unlist(posMatches), unlist(negMatches)), >> strand = strand, motifId = motifId) >> } >> bsParams <- >> new("BSParams", X = Celegans, FUN = matchFUN, simplify = TRUE) >> matches <- bsapply(bsParams, strings = HNF4alpha) >> nms <- names(matches) >> matches <- do.call(c, unname(matches)) >> names(matches) <- nms >> ## ------------------------------------------------------------- >> >> >>> matches >>> >> RangedData with 183 rows and 2 value columns across 7 spaces >> space ranges | strand motifId >> <character> <iranges> | <factor> <integer> >> 1 chrI [10714238, 10714250] | + 1 >> 2 chrI [ 1746247, 1746259] | + 33 >> 3 chrI [11509260, 11509272] | + 39 >> 4 chrI [ 5249651, 5249663] | + 48 >> 5 chrI [ 5442409, 5442421] | + 64 >> 6 chrI [ 7949495, 7949507] | + 64 >> 7 chrI [ 2788492, 2788504] | + 71 >> 8 chrI [ 3853105, 3853117] | + 71 >> 9 chrI [ 6952606, 6952618] | + 71 >> 10 chrI [10242063, 10242075] | - 1 >> ... >> <173 more rows> >> >> >> >> Paul Shannon wrote: >> >>> Can I get advice from on good ways to find genes -- perhaps with a high false-positive rate -- whose promoters contain known DNA binding motifs? >>> >>> Hu et al, "Profiling the Human Protein-DNA Interactome Reveals ERK2 as a Transcriptional Repressor of Interferon Signaling" identifies >>> >>> 17,718 PDIs [protein-DNA interactions] between 460 DNA motifs predicted to regulate transcription and 4,191 human proteins of various functional classes. >>> >>> I wish to take those 460 motifs -- many of them only 7 bases long -- and find the genes whose transcription they control. >>> >>> I suspect the answer lies in some artful use of Biostrings, BSgenome (which together provide efficient genome search), along with annotation to find the transcription start site of known genes. But before I start, I think it prudent to get the advice of those who may know more than me. >>> >>> Thanks! >>> >>> - Paul >>> _______________________________________________ >>> Bioconductor mailing list >>> Bioconductor at stat.math.ethz.ch >>> https://stat.ethz.ch/mailman/listinfo/bioconductor >>> Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor >>> >>> >> _______________________________________________ >> Bioconductor mailing list >> Bioconductor at stat.math.ethz.ch >> https://stat.ethz.ch/mailman/listinfo/bioconductor >> Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor >> > >

ADD REPLY • link 14.3 years ago Patrick Aboyoun ★ 1.6k

Login before adding your answer.