How Does one subset a XStringView or PDict object?
1
0
Entering edit mode
Noah Dowell ▴ 410
@noah-dowell-3791
Last seen 9.6 years ago
Hello to all, I am using the excellent BSGenome and Biostrings packages to look for the variety and number of a transcription factor DNA binding motif across the E. coli genome. From biochemistry and molecular biology experiments we know our favorite transcription factor binds a fairly degenerate motif. I want to look at the number of times a particular motif occurs in the E. coli genome and see if specific motifs map to specific genome locations. Here is a working example of what I have done: library(BSgenome.Ecoli.NCBI.20080805) # create and object to work with one genome: Ecoli str. K-12 substr. MG1655 genome12 <- Ecoli$NC_000913 consensus <- "TGTTCAAAAAATAAGCA" TFmotifDict = DNAStringSet(consensus) ConsMatch = matchPDict(TFmotifDict, genome12, max.mismatch=7) z = extractAllMatches(genome12, TFmotifDict) x = PDict(z) table(patternFrequency(x)) # 1 2 3 4 5 # 17088 128 60 52 80 So this is working great and providing some interesting results but in reading through the archives and vignettes I have not figured out how to subset my motif dictionary into the small class of motifs that occur more than once. See the output of the table function above. I want to get the start and end genome locations and the sequence info for the 128 + 60 + 52 + 80 patterns. I can do the following to get one: x[[61]] Or I can do this: freq = patternFrequency(x) getit = which(freq != 1) But this only tells me which ones they are. This could be a pretty basic R task or something specific to these types of objects but I seem to be stuck with my newbie R skills. Thank you in advance for any help. Best, Noah > sessionInfo() R version 2.12.1 (2010-12-16) Platform: x86_64-pc-linux-gnu (64-bit) locale: [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C [3] LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8 [5] LC_MONETARY=C LC_MESSAGES=en_US.UTF-8 [7] LC_PAPER=en_US.UTF-8 LC_NAME=C [9] LC_ADDRESS=C LC_TELEPHONE=C [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C attached base packages: [1] stats graphics grDevices utils datasets methods base other attached packages: [1] BSgenome.Ecoli.NCBI.20080805_1.3.16 BSgenome_1.16.5 [3] Biostrings_2.16.9 GenomicRanges_1.0.7 [5] IRanges_1.6.11 loaded via a namespace (and not attached): [1] Biobase_2.8.0 tools_2.12.1
Transcription BSgenome Biostrings BSgenome Transcription BSgenome Biostrings BSgenome • 1.1k views
ADD COMMENT
0
Entering edit mode
@martin-morgan-1513
Last seen 6 weeks ago
United States
On 02/04/2011 06:40 PM, Noah Dowell wrote: > Hello to all, > > I am using the excellent BSGenome and Biostrings packages to look for the variety and number of a transcription factor DNA binding motif across the E. coli genome. From biochemistry and molecular biology experiments we know our favorite transcription factor binds a fairly degenerate motif. I want to look at the number of times a particular motif occurs in the E. coli genome and see if specific motifs map to specific genome locations. > > Here is a working example of what I have done: > > library(BSgenome.Ecoli.NCBI.20080805) > > > # create and object to work with one genome: Ecoli str. K-12 substr. MG1655 > > genome12 <- Ecoli$NC_000913 > > consensus <- "TGTTCAAAAAATAAGCA" > > TFmotifDict = DNAStringSet(consensus) > > > ConsMatch = matchPDict(TFmotifDict, genome12, max.mismatch=7) > > z = extractAllMatches(genome12, TFmotifDict) > > x = PDict(z) > > > > table(patternFrequency(x)) > > # 1 2 3 4 5 > # 17088 128 60 52 80 > > So this is working great and providing some interesting results but in reading through the archives and vignettes I have not figured out how to subset my motif dictionary into the small class of motifs that occur more than once. See the output of the table function above. I want to get the start and end genome locations and the sequence info for the 128 + 60 + 52 + 80 patterns. > > I can do the following to get one: > > x[[61]] > > Or I can do this: > > freq = patternFrequency(x) > getit = which(freq != 1) > > But this only tells me which ones they are. > > This could be a pretty basic R task or something specific to these types of objects but I seem to be stuck with my newbie R skills. Thank you in advance for any help. Hi Noah I ended up at unique(tb(x)[patternFrequency(x)==5]) This was mostly from looking at the help page for patternFrequency, guided by a little discovery on those that might be relevant to 'x' with showMethods(class=class(x), where=getNamespace("Biostrings")) (this last is definitely obscure). Martin > Best, > > Noah > > >> sessionInfo() > R version 2.12.1 (2010-12-16) > Platform: x86_64-pc-linux-gnu (64-bit) > > locale: > [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C > [3] LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8 > [5] LC_MONETARY=C LC_MESSAGES=en_US.UTF-8 > [7] LC_PAPER=en_US.UTF-8 LC_NAME=C > [9] LC_ADDRESS=C LC_TELEPHONE=C > [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C > > attached base packages: > [1] stats graphics grDevices utils datasets methods base > > other attached packages: > [1] BSgenome.Ecoli.NCBI.20080805_1.3.16 BSgenome_1.16.5 > [3] Biostrings_2.16.9 GenomicRanges_1.0.7 > [5] IRanges_1.6.11 > > loaded via a namespace (and not attached): > [1] Biobase_2.8.0 tools_2.12.1 > > _______________________________________________ > Bioconductor mailing list > Bioconductor at r-project.org > https://stat.ethz.ch/mailman/listinfo/bioconductor > Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor -- Computational Biology Fred Hutchinson Cancer Research Center 1100 Fairview Ave. N. PO Box 19024 Seattle, WA 98109 Location: M1-B861 Telephone: 206 667-2793
ADD COMMENT
0
Entering edit mode
Thank you Martin! That should work nicely; the patternFrequency man page was one I missed. The showMethods is a good general tip that I can put to use. Best, noah On Feb 4, 2011, at 7:52 PM, Martin Morgan wrote: > On 02/04/2011 06:40 PM, Noah Dowell wrote: >> Hello to all, >> >> I am using the excellent BSGenome and Biostrings packages to look for the variety and number of a transcription factor DNA binding motif across the E. coli genome. From biochemistry and molecular biology experiments we know our favorite transcription factor binds a fairly degenerate motif. I want to look at the number of times a particular motif occurs in the E. coli genome and see if specific motifs map to specific genome locations. >> >> Here is a working example of what I have done: >> >> library(BSgenome.Ecoli.NCBI.20080805) >> >> >> # create and object to work with one genome: Ecoli str. K-12 substr. MG1655 >> >> genome12 <- Ecoli$NC_000913 >> >> consensus <- "TGTTCAAAAAATAAGCA" >> >> TFmotifDict = DNAStringSet(consensus) >> >> >> ConsMatch = matchPDict(TFmotifDict, genome12, max.mismatch=7) >> >> z = extractAllMatches(genome12, TFmotifDict) >> >> x = PDict(z) >> >> >> >> table(patternFrequency(x)) >> >> # 1 2 3 4 5 >> # 17088 128 60 52 80 >> >> So this is working great and providing some interesting results but in reading through the archives and vignettes I have not figured out how to subset my motif dictionary into the small class of motifs that occur more than once. See the output of the table function above. I want to get the start and end genome locations and the sequence info for the 128 + 60 + 52 + 80 patterns. >> >> I can do the following to get one: >> >> x[[61]] >> >> Or I can do this: >> >> freq = patternFrequency(x) >> getit = which(freq != 1) >> >> But this only tells me which ones they are. >> >> This could be a pretty basic R task or something specific to these > types of objects but I seem to be stuck with my newbie R skills. Thank > you in advance for any help. > > Hi Noah > > I ended up at > > unique(tb(x)[patternFrequency(x)==5]) > > This was mostly from looking at the help page for patternFrequency, > guided by a little discovery on those that might be relevant to 'x' with > > showMethods(class=class(x), where=getNamespace("Biostrings")) > > (this last is definitely obscure). > > Martin > >> Best, >> >> Noah >> >> >>> sessionInfo() >> R version 2.12.1 (2010-12-16) >> Platform: x86_64-pc-linux-gnu (64-bit) >> >> locale: >> [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C >> [3] LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8 >> [5] LC_MONETARY=C LC_MESSAGES=en_US.UTF-8 >> [7] LC_PAPER=en_US.UTF-8 LC_NAME=C >> [9] LC_ADDRESS=C LC_TELEPHONE=C >> [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C >> >> attached base packages: >> [1] stats graphics grDevices utils datasets methods base >> >> other attached packages: >> [1] BSgenome.Ecoli.NCBI.20080805_1.3.16 BSgenome_1.16.5 >> [3] Biostrings_2.16.9 GenomicRanges_1.0.7 >> [5] IRanges_1.6.11 >> >> loaded via a namespace (and not attached): >> [1] Biobase_2.8.0 tools_2.12.1 >> >> _______________________________________________ >> Bioconductor mailing list >> Bioconductor at r-project.org >> https://stat.ethz.ch/mailman/listinfo/bioconductor >> Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor > > > -- > Computational Biology > Fred Hutchinson Cancer Research Center > 1100 Fairview Ave. N. PO Box 19024 Seattle, WA 98109 > > Location: M1-B861 > Telephone: 206 667-2793
ADD REPLY

Login before adding your answer.

Traffic: 609 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6