matching transcription factor binding sites
1
0
Entering edit mode
@hans-ulrich-klein-1945
Last seen 7 months ago
United States
Dear all, I want to locate transcription factor binding sites (tfbs) within a given sequence. The tfbs are derived from databases like transfac or jaspar and are described by matrices. Are there algorithms for locating tfbs matches (e.g. "matinspector") implemented in bioconductor? I could not find one. Regards, Hans-Ulrich
Transcription Transcription • 1.5k views
ADD COMMENT
0
Entering edit mode
@herve-pages-1542
Last seen 16 hours ago
Seattle, WA, United States
Hi Hans, Hans-Ulrich Klein wrote: > Dear all, > > I want to locate transcription factor binding sites (tfbs) within a > given sequence. The tfbs are derived from databases like transfac or > jaspar and are described by matrices. Are there algorithms for locating > tfbs matches (e.g. "matinspector") implemented in bioconductor? I could > not find one. I assume that your matrices are Position Weight Matrices? There is no facility in the Biostrings package for matching PWM to a DNA sequence but that would be easy to add. In fact, I've already fully described how to implement such facility in a separate package and on top of Biostrings basic containers (i.e. DNAString objects) during the lab I gave for the "Advanced R for Bioinformatics" course back in February this year: http://bioconductor.org/workshops/2008 Follow "Advanced R for Bioinformatics" -> "Interfaces to C (Lab)" The simpleMatchPWM_0.99.0.tar.gz package contains the matchPWM() function for finding all matches of a PWM in a given sequence. Unfortunately, the package was depending on a devel version of Biostrings that has changed since then, and those changes broke simpleMatchPWM 0.99.0. Let me know if this is what you are looking for and I'll fix the package (this should be straightforward). Cheers, H. > > Regards, > Hans-Ulrich > > _______________________________________________ > Bioconductor mailing list > Bioconductor at stat.math.ethz.ch > https://stat.ethz.ch/mailman/listinfo/bioconductor > Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor >
ADD COMMENT
0
Entering edit mode
Hi Herve, Herve Pages wrote: > Hans-Ulrich Klein wrote: >> I want to locate transcription factor binding sites (tfbs) within a >> given sequence. The tfbs are derived from databases like transfac or >> jaspar and are described by matrices. Are there algorithms for >> locating tfbs matches (e.g. "matinspector") implemented in >> bioconductor? I could not find one. > > I assume that your matrices are Position Weight Matrices? yes. I meant position weight matrices. > There is no facility in the Biostrings package for matching PWM to > a DNA sequence but that would be easy to add. In fact, I've > already fully described how to implement such facility > in a separate package and on top of Biostrings basic containers (i.e. > DNAString objects) during the lab I gave for the "Advanced R > for Bioinformatics" course back in February this year: > > http://bioconductor.org/workshops/2008 > > Follow "Advanced R for Bioinformatics" -> "Interfaces to C (Lab)" > > The simpleMatchPWM_0.99.0.tar.gz package contains the matchPWM() > function for finding all matches of a PWM in a given sequence. > Unfortunately, the package was depending on a devel version > of Biostrings that has changed since then, and > those changes broke simpleMatchPWM 0.99.0. Let me know if this is what > you are looking for and I'll fix the package (this should > be straightforward). It is quite close to what I am looking for. I have access to the transfac database including a web based tool for finding PWM matches. I am looking for an alternative to the web tool in R for two reasons: 1. I have done preceding analysis in R and will do follow-up analysis in R. It would be nice to avoid the effort for data export and import. 2. I have not found a detailed description of the algorithm used by the web tool. So simpleMatchPWM is at least a good starting point, as it does all the basic score computations. Why not integrate the matchPWM function in the Biostring package? I would appreciate it. However, most algorithms (like MatInspector or the transfac-tool) implement some heuristics to improve results. E.g., they suggest individual cut-off values depending on the length of the pwms. I am not sure whether I have enough time and knowledge to add such functionalities. Best wishes, Hans-Ulrich PS: Has someone experiences with the bioperl package "TFBS"?
ADD REPLY
0
Entering edit mode
Hi Hans-Ulrich, matchPWM() is now part of Biostrings 2.7.47 (in BioC devel, so you need R-2.7). Load Biostrings: library(Biostrings) Suppose 'pwm' contains a Position Weight Matrix, let's say: pwm <- rbind(A=c( 1, 0, 19, 20, 18, 1, 20, 7), C=c( 1, 0, 1, 0, 1, 18, 0, 2), G=c(17, 0, 0, 0, 1, 0, 0, 3), T=c( 1, 20, 0, 0, 0, 1, 0, 8)) Note that this is just a standard integer matrix with the 4 DNA base letters as row names (having these row names is mandatory). Some low-level utility functions are available for manipulating this kind of matrix: > maxWeights(pwm) # the max weight in each column [1] 17 20 19 20 18 18 20 8 > maxScore(pwm) # the max possible score [1] 140 Let's match 'pwm' against Human chr1: library(BSgenome.Hsapiens.UCSC.hg18) chr1 <- Hsapiens$chr1 Number of "best" matches: > countPWM(pwm, chr1, min.score="100%") # takes about 5 seconds on my system [1] 5152 With a lower cut-off value: m <- matchPWM(pwm, chr1, min.score="90%") See the 10 first matches ("first" means "smallest chromosome location", NOT "best" matches): > m[1:10] Views on a 247249719-letter DNAString subject subject: TAACCCTAACCCTAACCCTAACCCTAACCCTAAC...NNNNNNNNNNNNNNNNNNNNN NNNNNNNNNNNNN views: start end width [1] 31931 31938 8 [GTAAACAA] [2] 33324 33331 8 [GTAAACAT] [3] 38425 38432 8 [GTAAACAG] [4] 39177 39184 8 [GTAAACAC] [5] 46971 46978 8 [GTAAACAT] [6] 49952 49959 8 [GTAAACAT] [7] 70381 70388 8 [GTAAACAG] [8] 74359 74366 8 [GTAAACAC] [9] 90714 90721 8 [GTAAACAT] [10] 96544 96551 8 [GTAAACAC] The speed could be improved, maybe by a factor 2 (or more, for longest PWMs). Also maybe an additional argument could be added to let the user control how the returned matches should be sorted ("left-to-right" or "best- first")? Unlike MatInspector or the transfac-tool, there is no facility yet to suggest individual cut-off values depending on the length of the PWMs. See '?matchPWM' for more information (e.g. how to search the minus strand of the chromosome). Please let me know how it works for you. Cheers, H. Hans-Ulrich Klein wrote: > Hi Herve, > > Herve Pages wrote: >> Hans-Ulrich Klein wrote: >>> I want to locate transcription factor binding sites (tfbs) within a >>> given sequence. The tfbs are derived from databases like transfac or >>> jaspar and are described by matrices. Are there algorithms for >>> locating tfbs matches (e.g. "matinspector") implemented in >>> bioconductor? I could not find one. >> >> I assume that your matrices are Position Weight Matrices? > > yes. I meant position weight matrices. > >> There is no facility in the Biostrings package for matching PWM to > > a DNA sequence but that would be easy to add. In fact, I've >> already fully described how to implement such facility >> in a separate package and on top of Biostrings basic containers (i.e. >> DNAString objects) during the lab I gave for the "Advanced R >> for Bioinformatics" course back in February this year: >> >> http://bioconductor.org/workshops/2008 >> >> Follow "Advanced R for Bioinformatics" -> "Interfaces to C (Lab)" >> >> The simpleMatchPWM_0.99.0.tar.gz package contains the matchPWM() >> function for finding all matches of a PWM in a given sequence. >> Unfortunately, the package was depending on a devel version >> of Biostrings that has changed since then, and >> those changes broke simpleMatchPWM 0.99.0. Let me know if this is what >> you are looking for and I'll fix the package (this should >> be straightforward). > > It is quite close to what I am looking for. I have access to the > transfac database including a web based tool for finding PWM matches. I > am looking for an alternative to the web tool in R for two reasons: > > 1. I have done preceding analysis in R and will do follow-up analysis in > R. It would be nice to avoid the effort for data export and import. > 2. I have not found a detailed description of the algorithm used by the > web tool. > > So simpleMatchPWM is at least a good starting point, as it does all the > basic score computations. Why not integrate the matchPWM function in the > Biostring package? I would appreciate it. > However, most algorithms (like MatInspector or the transfac-tool) > implement some heuristics to improve results. E.g., they suggest > individual cut-off values depending on the length of the pwms. I am not > sure whether I have enough time and knowledge to add such functionalities. > > Best wishes, > Hans-Ulrich > > PS: Has someone experiences with the bioperl package "TFBS"?
ADD REPLY

Login before adding your answer.

Traffic: 582 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6