Sorting a GAlignments object by QNAME
2
0
Entering edit mode
Guest User ★ 13k
@guest-user-4897
Last seen 9.6 years ago
Is there a way to sort the records in a GAlignments object by the QNAME, as this object is created with the readGAlignmentsFromBam function where the bam file and its corresponding index file must be sorted by RNAME and POS. Unless I'm missing something the only way I see how can this be done is read the bam into a data.table and sort that. -- output of sessionInfo(): R version 3.0.2 (2013-09-25) Platform: x86_64-w64-mingw32/x64 (64-bit) locale: [1] LC_COLLATE=English_United States.1252 LC_CTYPE=English_United States.1252 LC_MONETARY=English_United States.1252 LC_NUMERIC=C [5] LC_TIME=English_United States.1252 attached base packages: [1] parallel stats graphics grDevices utils datasets methods base other attached packages: [1] doParallel_1.0.3 iterators_1.0.6 foreach_1.4.1 data.table_1.8.10 Rsamtools_1.13.44 Biostrings_2.29.19 GenomicRanges_1.13.45 XVector_0.1.4 [9] IRanges_1.19.38 BiocGenerics_0.7.5 loaded via a namespace (and not attached): [1] bitops_1.0-6 codetools_0.2-8 stats4_3.0.2 tools_3.0.2 zlibbioc_1.7.0 -- Sent via the guest posting facility at bioconductor.org.
• 1.1k views
ADD COMMENT
0
Entering edit mode
@martin-morgan-1513
Last seen 4 days ago
United States
On 09/29/2013 12:32 PM, rubi [guest] wrote: > > Is there a way to sort the records in a GAlignments object by the QNAME, as > this object is created with the readGAlignmentsFromBam function where the bam > file and its corresponding index file must be sorted by RNAME and POS. > > Unless I'm missing something the only way I see how can this be done is read > the bam into a data.table and sort that. Unsorted / sorted by qname files can be read in; likely the part that is tripping you up is the need to specify character() for index, perhaps with yieldSize and obeyQname bf = open(BamFile(fl, character(), yieldSize=1000000, obeyQname=TRUE)) If fl were sorted by qname (?sortBam, byQname=TRUE) then this would guarantee 1000000 qnames per chunk repeat { aln = readGAlignmentsFromBam(bf) if (length(aln) == 0) break ## do work } Since you've got the devel version, see also ?readGAlignmentsListFromBam which will read in mated reads from an RNAME,POS supported file in an iteration like above, with more modest memory requirements than reading in the entire file. Martin > > -- output of sessionInfo(): > > R version 3.0.2 (2013-09-25) Platform: x86_64-w64-mingw32/x64 (64-bit) > > locale: [1] LC_COLLATE=English_United States.1252 LC_CTYPE=English_United > States.1252 LC_MONETARY=English_United States.1252 LC_NUMERIC=C [5] > LC_TIME=English_United States.1252 > > attached base packages: [1] parallel stats graphics grDevices utils > datasets methods base > > other attached packages: [1] doParallel_1.0.3 iterators_1.0.6 > foreach_1.4.1 data.table_1.8.10 Rsamtools_1.13.44 > Biostrings_2.29.19 GenomicRanges_1.13.45 XVector_0.1.4 [9] IRanges_1.19.38 > BiocGenerics_0.7.5 > > loaded via a namespace (and not attached): [1] bitops_1.0-6 > codetools_0.2-8 stats4_3.0.2 tools_3.0.2 zlibbioc_1.7.0 > > -- Sent via the guest posting facility at bioconductor.org. > > _______________________________________________ Bioconductor mailing list > Bioconductor at r-project.org > https://stat.ethz.ch/mailman/listinfo/bioconductor Search the archives: > http://news.gmane.org/gmane.science.biology.informatics.conductor > -- Computational Biology / Fred Hutchinson Cancer Research Center 1100 Fairview Ave. N. PO Box 19024 Seattle, WA 98109 Location: Arnold Building M1 B861 Phone: (206) 667-2793
ADD COMMENT
0
Entering edit mode
@herve-pages-1542
Last seen 11 hours ago
Seattle, WA, United States
Hi Rubi, 'gal[order(seqnames(gal))]' will sort GAlignments object 'gal' by QNAME. H. On 09/29/2013 12:32 PM, rubi [guest] wrote: > > Is there a way to sort the records in a GAlignments object by the QNAME, as this object is created with the readGAlignmentsFromBam function where the bam file and its corresponding index file must be sorted by RNAME and POS. > > Unless I'm missing something the only way I see how can this be done is read the bam into a data.table and sort that. > > -- output of sessionInfo(): > > R version 3.0.2 (2013-09-25) > Platform: x86_64-w64-mingw32/x64 (64-bit) > > locale: > [1] LC_COLLATE=English_United States.1252 LC_CTYPE=English_United States.1252 LC_MONETARY=English_United States.1252 LC_NUMERIC=C > [5] LC_TIME=English_United States.1252 > > attached base packages: > [1] parallel stats graphics grDevices utils datasets methods base > > other attached packages: > [1] doParallel_1.0.3 iterators_1.0.6 foreach_1.4.1 data.table_1.8.10 Rsamtools_1.13.44 Biostrings_2.29.19 GenomicRanges_1.13.45 XVector_0.1.4 > [9] IRanges_1.19.38 BiocGenerics_0.7.5 > > loaded via a namespace (and not attached): > [1] bitops_1.0-6 codetools_0.2-8 stats4_3.0.2 tools_3.0.2 zlibbioc_1.7.0 > > -- > Sent via the guest posting facility at bioconductor.org. > > _______________________________________________ > Bioconductor mailing list > Bioconductor at r-project.org > https://stat.ethz.ch/mailman/listinfo/bioconductor > Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor > -- Hervé Pagès Program in Computational Biology Division of Public Health Sciences Fred Hutchinson Cancer Research Center 1100 Fairview Ave. N, M1-B514 P.O. Box 19024 Seattle, WA 98109-1024 E-mail: hpages at fhcrc.org Phone: (206) 667-5791 Fax: (206) 667-1319
ADD COMMENT
0
Entering edit mode
Hi Hervé and thanks for the response. I assume you meant 'gal[order(names(gal))]' rather than 'gal[order(seqnames(gal))]' in order to sort by QNAME (i.e. read ID)? Using this, however, doesn't seem to sort gal by the same sorting rules that sortBam or samtools sort -n does. Am I missing something? Thanks, rubi On Mon, Sep 30, 2013 at 12:51 AM, Hervé Pagès <hpages@fhcrc.org> wrote: > Hi Rubi, > > 'gal[order(seqnames(gal))]' will sort GAlignments object 'gal' by QNAME. > > H. > > > > On 09/29/2013 12:32 PM, rubi [guest] wrote: > >> >> Is there a way to sort the records in a GAlignments object by the QNAME, >> as this object is created with the readGAlignmentsFromBam function where >> the bam file and its corresponding index file must be sorted by RNAME and >> POS. >> >> Unless I'm missing something the only way I see how can this be done is >> read the bam into a data.table and sort that. >> >> -- output of sessionInfo(): >> >> R version 3.0.2 (2013-09-25) >> Platform: x86_64-w64-mingw32/x64 (64-bit) >> >> locale: >> [1] LC_COLLATE=English_United States.1252 LC_CTYPE=English_United >> States.1252 LC_MONETARY=English_United States.1252 LC_NUMERIC=C >> [5] LC_TIME=English_United States.1252 >> >> attached base packages: >> [1] parallel stats graphics grDevices utils datasets methods >> base >> >> other attached packages: >> [1] doParallel_1.0.3 iterators_1.0.6 foreach_1.4.1 >> data.table_1.8.10 Rsamtools_1.13.44 Biostrings_2.29.19 >> GenomicRanges_1.13.45 XVector_0.1.4 >> [9] IRanges_1.19.38 BiocGenerics_0.7.5 >> >> loaded via a namespace (and not attached): >> [1] bitops_1.0-6 codetools_0.2-8 stats4_3.0.2 tools_3.0.2 >> zlibbioc_1.7.0 >> >> -- >> Sent via the guest posting facility at bioconductor.org. >> >> ______________________________**_________________ >> Bioconductor mailing list >> Bioconductor@r-project.org >> https://stat.ethz.ch/mailman/**listinfo/bioconductor<https: stat.e="" thz.ch="" mailman="" listinfo="" bioconductor=""> >> Search the archives: http://news.gmane.org/gmane.** >> science.biology.informatics.**conductor<http: news.gmane.org="" gmane="" .science.biology.informatics.conductor=""> >> >> > -- > Hervé Pagès > > Program in Computational Biology > Division of Public Health Sciences > > Fred Hutchinson Cancer Research Center > 1100 Fairview Ave. N, M1-B514 > P.O. Box 19024 > Seattle, WA 98109-1024 > > E-mail: hpages@fhcrc.org > Phone: (206) 667-5791 > Fax: (206) 667-1319 > [[alternative HTML version deleted]]
ADD REPLY
0
Entering edit mode
Hi Rubi, On 09/30/2013 07:33 AM, nimrod rubinstein wrote: > Hi Herv? and thanks for the response. > > I assume you meant 'gal[order(names(gal))]' rather than > 'gal[order(seqnames(gal))]' > in order to sort by QNAME (i.e. read ID)? My bad. You're right: seqnames is mapped to the RNAME field, not to the QNAME field. > > Using this, however, doesn't seem to sort gal by the same sorting rules > that sortBam or samtools sort -n does. Am I missing something? 2 reasons for that: (1) The result of 'order(names(gal))' depends on the collating sequence of the locale in use (see ?order and ?Comparison). So different users can get different results when doing 'gal[order(names(gal))]'. I don't know what collating sequence 'samtools sort -n' is using but even users using the same collating sequence might get different results, because of (2). (2) We don't know whether order() uses a "stable" sorting algorithm or not (AFAIK ?order doesn't tell). When a tie occurs (and when sorting by QNAME you can bet many ties will occur), a non- stable algo is free to break the tie in any way, whereas a stable algo is guaranteed to order the elements involved in the tie the same way they show up in the input. To summarize, 'gal[order(names(gal))]' and 'samtools sort -n' would be guaranteed to give the same result if both were using the same collating sequence and a stable sorting algo. I guess it's not the case. HTH, H. > > Thanks, > rubi > > > > On Mon, Sep 30, 2013 at 12:51 AM, Hervé Pagès <hpages at="" fhcrc.org=""> wrote: > >> Hi Rubi, >> >> 'gal[order(seqnames(gal))]' will sort GAlignments object 'gal' by QNAME. >> >> H. >> >> >> >> On 09/29/2013 12:32 PM, rubi [guest] wrote: >> >>> >>> Is there a way to sort the records in a GAlignments object by the QNAME, >>> as this object is created with the readGAlignmentsFromBam function where >>> the bam file and its corresponding index file must be sorted by RNAME and >>> POS. >>> >>> Unless I'm missing something the only way I see how can this be done is >>> read the bam into a data.table and sort that. >>> >>> -- output of sessionInfo(): >>> >>> R version 3.0.2 (2013-09-25) >>> Platform: x86_64-w64-mingw32/x64 (64-bit) >>> >>> locale: >>> [1] LC_COLLATE=English_United States.1252 LC_CTYPE=English_United >>> States.1252 LC_MONETARY=English_United States.1252 LC_NUMERIC=C >>> [5] LC_TIME=English_United States.1252 >>> >>> attached base packages: >>> [1] parallel stats graphics grDevices utils datasets methods >>> base >>> >>> other attached packages: >>> [1] doParallel_1.0.3 iterators_1.0.6 foreach_1.4.1 >>> data.table_1.8.10 Rsamtools_1.13.44 Biostrings_2.29.19 >>> GenomicRanges_1.13.45 XVector_0.1.4 >>> [9] IRanges_1.19.38 BiocGenerics_0.7.5 >>> >>> loaded via a namespace (and not attached): >>> [1] bitops_1.0-6 codetools_0.2-8 stats4_3.0.2 tools_3.0.2 >>> zlibbioc_1.7.0 >>> >>> -- >>> Sent via the guest posting facility at bioconductor.org. >>> >>> ______________________________**_________________ >>> Bioconductor mailing list >>> Bioconductor at r-project.org >>> https://stat.ethz.ch/mailman/**listinfo/bioconductor<https: stat.="" ethz.ch="" mailman="" listinfo="" bioconductor=""> >>> Search the archives: http://news.gmane.org/gmane.** >>> science.biology.informatics.**conductor<http: news.gmane.org="" gman="" e.science.biology.informatics.conductor=""> >>> >>> >> -- >> Hervé Pagès >> >> Program in Computational Biology >> Division of Public Health Sciences >> >> Fred Hutchinson Cancer Research Center >> 1100 Fairview Ave. N, M1-B514 >> P.O. Box 19024 >> Seattle, WA 98109-1024 >> >> E-mail: hpages at fhcrc.org >> Phone: (206) 667-5791 >> Fax: (206) 667-1319 >> > > [[alternative HTML version deleted]] > > > > _______________________________________________ > Bioconductor mailing list > Bioconductor at r-project.org > https://stat.ethz.ch/mailman/listinfo/bioconductor > Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor > -- Hervé Pagès Program in Computational Biology Division of Public Health Sciences Fred Hutchinson Cancer Research Center 1100 Fairview Ave. N, M1-B514 P.O. Box 19024 Seattle, WA 98109-1024 E-mail: hpages at fhcrc.org Phone: (206) 667-5791 Fax: (206) 667-1319
ADD REPLY

Login before adding your answer.

Traffic: 949 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6