Rsamtools: Read a bam line by line
1
0
Entering edit mode
@alex-gutteridge-2935
Last seen 10.3 years ago
United States
Hi, I have an application where I have mapped reads to two different reference sequences and I have these two mappings as bam files. I now want to see for each read whether it mapped to one, both or neither of the two references (and where the mapping is to both see if one was better than the other). All my reading of the samtools docs suggested that there is no facility for indexing by read name as opposed to genomic coordinates, so the first way that occured to me to do this was to sort each bam by read name (which is possible using samtools) and then step through the two bams in parallel checking the alignment information for each read. There are ~120M reads, so reading the two bams into memory all in one go is not practical, but from what I can see Rsamtools doesn't have functions for reading bams either line by line or by chunks of lines. All the ScanBamParam arguments seem to refer to reading chunks by genomic ranges - this isn't appropriate for my application as the genomic ranges between the two references are not necessarily comparable. So my question is really two-fold: 1) Am I right in my reading of the Rsamtools docs? Are there any lower level functions for reading bams I am ignoring that could do what I want? 2) My thought now is to just make sams from the sorted bams and step through using readLines() and my own sam parsing code. Is there an alternative strategy I have overlooked? -- Alex Gutteridge
GO Rsamtools GO Rsamtools • 1.2k views
ADD COMMENT
0
Entering edit mode
@martin-morgan-1513
Last seen 5 months ago
United States
On 07/07/2012 10:10 AM, Alex Gutteridge wrote: > Hi, > > I have an application where I have mapped reads to two different > reference sequences and I have these two mappings as bam files. I now > want to see for each read whether it mapped to one, both or neither of > the two references (and where the mapping is to both see if one was > better than the other). All my reading of the samtools docs suggested > that there is no facility for indexing by read name as opposed to > genomic coordinates, so the first way that occured to me to do this was > to sort each bam by read name (which is possible using samtools) and > then step through the two bams in parallel checking the alignment > information for each read. There are ~120M reads, so reading the two > bams into memory all in one go is not practical, but from what I can see > Rsamtools doesn't have functions for reading bams either line by line or > by chunks of lines. All the ScanBamParam arguments seem to refer to see the development version and the yieldSize argument to BamFile. Martin > reading chunks by genomic ranges - this isn't appropriate for my > application as the genomic ranges between the two references are not > necessarily comparable. > > So my question is really two-fold: > > 1) Am I right in my reading of the Rsamtools docs? Are there any lower > level functions for reading bams I am ignoring that could do what I want? > 2) My thought now is to just make sams from the sorted bams and step > through using readLines() and my own sam parsing code. Is there an > alternative strategy I have overlooked? > -- Computational Biology / Fred Hutchinson Cancer Research Center 1100 Fairview Ave. N. PO Box 19024 Seattle, WA 98109 Location: Arnold Building M1 B861 Phone: (206) 667-2793
ADD COMMENT

Login before adding your answer.

Traffic: 653 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6