Newcommers question on subsetting IRangesList
1
0
Entering edit mode
Guest User ★ 13k
@guest-user-4897
Last seen 9.6 years ago
Hi, I'm new to R and bioconductor so this is probably a trivial question, but I cannot find a solution for this anywhere. In my workflow, I now utilize a temporary version of of vmatchPattern (found on the net) that allows for indels. This works great, but outputs an IRangesList object that I have issues with when I try to subset it. Here is an example of the output: IRangesList of length 96979 [[1]] IRanges of length 2 start end width [1] 1 7 7 [2] 278 283 6 [[2]] IRanges of length 2 start end width [1] 1 7 7 [2] 281 286 6 [[3]] IRanges of length 2 start end width [1] 1 7 7 [2] 256 261 6 ... <96976 more elements> In this case, the same sequence is found twice in each read. What I would like to extract is the "end" of each first occurrence of the string i.e., 7 in the cases above. say that matchList is the IRangesList object if I use end(matchList) I get a list with both the end of the first and the second occurrence of the string. With every way I try to subset it I get errors. I can get it to work through using as.data.frame but this is very slow when you have millions of matches as in my cases. I hope that this was reasonably clear. Thank you all for your help All the best Tomas -- output of sessionInfo(): R version 3.1.0 (2014-04-10) Platform: x86_64-apple-darwin10.8.0 (64-bit) locale: [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8 attached base packages: [1] parallel stats graphics grDevices utils datasets methods base other attached packages: [1] xlsx_0.5.5 muscle_3.8.31-2 Rlibstree_0.3-2 xlsxjars_0.6.0 rJava_0.9-6 ShortRead_1.22.0 [7] GenomicAlignments_1.0.1 BSgenome_1.32.0 Rsamtools_1.16.0 GenomicRanges_1.16.3 GenomeInfoDb_1.0.2 Biostrings_2.32.0 [13] XVector_0.4.0 IRanges_1.22.6 BiocParallel_0.6.0 BiocGenerics_0.10.0 loaded via a namespace (and not attached): [1] BatchJobs_1.2 BBmisc_1.6 Biobase_2.24.0 bitops_1.0-6 brew_1.0-6 codetools_0.2-8 DBI_0.2-7 digest_0.6.4 [9] fail_1.2 foreach_1.4.2 grid_3.1.0 hwriter_1.3 iterators_1.0.7 lattice_0.20-29 latticeExtra_0.6-26 plyr_1.8.1 [17] RColorBrewer_1.0-5 Rcpp_0.11.1 RSQLite_0.11.4 sendmailR_1.1-2 stats4_3.1.0 stringr_0.6.2 tools_3.1.0 zlibbioc_1.10.0 -- Sent via the guest posting facility at bioconductor.org.
• 643 views
ADD COMMENT
0
Entering edit mode
@michael-lawrence-3846
Last seen 2.4 years ago
United States
You can do this with end(unlist(phead(x, 1))). Looking at the source code for phead() may enlighten you as to how to efficiently subset in these types of situations. In the future, it would be helpful to see the code of your failed attempts, because they would likely be instructive. Michael On Mon, May 12, 2014 at 10:32 AM, Tomas Bjorklund [guest] < guest@bioconductor.org> wrote: > Hi, > > I'm new to R and bioconductor so this is probably a trivial question, but > I cannot find a solution for this anywhere. > > In my workflow, I now utilize a temporary version of of vmatchPattern > (found on the net) that allows for indels. This works great, but outputs an > IRangesList object that I have issues with when I try to subset it. Here is > an example of the output: > > IRangesList of length 96979 > [[1]] > IRanges of length 2 > start end width > [1] 1 7 7 > [2] 278 283 6 > > [[2]] > IRanges of length 2 > start end width > [1] 1 7 7 > [2] 281 286 6 > > [[3]] > IRanges of length 2 > start end width > [1] 1 7 7 > [2] 256 261 6 > > ... > <96976 more elements> > > In this case, the same sequence is found twice in each read. What I would > like to extract is the "end" of each first occurrence of the string i.e., 7 > in the cases above. > > say that matchList is the IRangesList object if I use end(matchList) I get > a list with both the end of the first and the second occurrence of the > string. With every way I try to subset it I get errors. I can get it to > work through using as.data.frame but this is very slow when you have > millions of matches as in my cases. > > I hope that this was reasonably clear. > > Thank you all for your help > > All the best > > Tomas > > > > -- output of sessionInfo(): > > R version 3.1.0 (2014-04-10) > Platform: x86_64-apple-darwin10.8.0 (64-bit) > > locale: > [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8 > > attached base packages: > [1] parallel stats graphics grDevices utils datasets methods > base > > other attached packages: > [1] xlsx_0.5.5 muscle_3.8.31-2 Rlibstree_0.3-2 > xlsxjars_0.6.0 rJava_0.9-6 ShortRead_1.22.0 > [7] GenomicAlignments_1.0.1 BSgenome_1.32.0 Rsamtools_1.16.0 > GenomicRanges_1.16.3 GenomeInfoDb_1.0.2 Biostrings_2.32.0 > [13] XVector_0.4.0 IRanges_1.22.6 BiocParallel_0.6.0 > BiocGenerics_0.10.0 > > loaded via a namespace (and not attached): > [1] BatchJobs_1.2 BBmisc_1.6 Biobase_2.24.0 > bitops_1.0-6 brew_1.0-6 codetools_0.2-8 DBI_0.2-7 > digest_0.6.4 > [9] fail_1.2 foreach_1.4.2 grid_3.1.0 > hwriter_1.3 iterators_1.0.7 lattice_0.20-29 > latticeExtra_0.6-26 plyr_1.8.1 > [17] RColorBrewer_1.0-5 Rcpp_0.11.1 RSQLite_0.11.4 > sendmailR_1.1-2 stats4_3.1.0 stringr_0.6.2 tools_3.1.0 > zlibbioc_1.10.0 > > -- > Sent via the guest posting facility at bioconductor.org. > > _______________________________________________ > Bioconductor mailing list > Bioconductor@r-project.org > https://stat.ethz.ch/mailman/listinfo/bioconductor > Search the archives: > http://news.gmane.org/gmane.science.biology.informatics.conductor > [[alternative HTML version deleted]]
ADD COMMENT

Login before adding your answer.

Traffic: 692 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6