I am trying to haplotype long reads after aligning to the genome. To do this I would need to find out what bases are at annotated variant sites, which should be possible using a combination of the alignment position, the sequence and the CIGAR string. I am wondering if there are already facilities in Bioconductor for doing this type of querying before I go writing my own.
Is this somehow related to this: https://support.bioconductor.org/p/125825/ ?
Thanks Herve, this is very close to what I want to do. It does a great job mapping from the reference coordinates to the query coordinate. The only issue now is that GAlignments, as far as I can tell, doesn't have the ability to read in the sequence and quality scores, so to query the actual base I think I need to use scanBam from RSamtools as well.
Also the mapToAlignments interface demands the GAlignments object be named. Is there a reason for this?
readGAlignments()
accepts a ScanBamParam viaparam=
. If the sequence and qualities are in the "what" they will be included in themcols()
.The names are required since the mapped return value needs to use the read identifiers as its seqnames.
What is the meaning of seqnames here? I thought seqname generally referred to the molecule name (chromosome/transcript).
The "molecule" in this case is the sequence read.