How to query a base along an aligned read by genomic position?
1
0
Entering edit mode
Shian Su ▴ 40
@shian-su-9869
Last seen 14 months ago
Walter and Eliza Hall Institute of Medi…

I am trying to haplotype long reads after aligning to the genome. To do this I would need to find out what bases are at annotated variant sites, which should be possible using a combination of the alignment position, the sequence and the CIGAR string. I am wondering if there are already facilities in Bioconductor for doing this type of querying before I go writing my own.

GenomicAlignments BAM CIGAR • 496 views
1
Entering edit mode

Is this somehow related to this: https://support.bioconductor.org/p/125825/ ?

0
Entering edit mode

Thanks Herve, this is very close to what I want to do. It does a great job mapping from the reference coordinates to the query coordinate. The only issue now is that GAlignments, as far as I can tell, doesn't have the ability to read in the sequence and quality scores, so to query the actual base I think I need to use scanBam from RSamtools as well.

Also the mapToAlignments interface demands the GAlignments object be named. Is there a reason for this?

galign <- GAlignments("chr1", pos = 10L, cigar = "2M2D3M2I4M", strand = factor("+", levels=c("+", "-", "*")))
names(galign) <- "read1" # Code will not work without this

q_range <- IRanges(start=12, width=1)
mapToAlignments(q_range, galign)

1
Entering edit mode

readGAlignments() accepts a ScanBamParam via param=. If the sequence and qualities are in the "what" they will be included in the mcols().

The names are required since the mapped return value needs to use the read identifiers as its seqnames.

0
Entering edit mode

The names are required since the mapped return value needs to use the read identifiers as its seqnames.

What is the meaning of seqnames here? I thought seqname generally referred to the molecule name (chromosome/transcript).

1
Entering edit mode

The "molecule" in this case is the sequence read.

1
Entering edit mode
Janet Young ▴ 740
@janet-young-2360
Last seen 2.1 years ago
Fred Hutchinson Cancer Research Center,…

hi Shian,

I did something similar a few months ago. Herve made a useful function called stackStringsFromGAlignments - I've been using that on each SNP individually (I'm looking at a small genome with not many SNPs, so haven't cared much about efficiency). The details of what I did are a little hazy, but it was something like this:

I read in the bam using

aln <- readGAlignments(bamfile, use.names=TRUE, param=ScanBamParam(what="seq"))


and then get genotypes using something like this (SNP is a GRanges object):

snpAln  <- stackStringsFromGAlignments(aln, region=SNP)


then I use this to manipulate the genotypes further:

as.character(snpAln)


I'm about to revisit this project so will keep an eye on this thread for suggestions. There's probably a better way to do it!

Janet