Question

How to query a base along an aligned read by genomic position?

0

Entering edit mode

Shian Su ▴ 40

@shian-su-9869

Last seen 2.1 years ago

Walter and Eliza Hall Institute of Medi…

I am trying to haplotype long reads after aligning to the genome. To do this I would need to find out what bases are at annotated variant sites, which should be possible using a combination of the alignment position, the sequence and the CIGAR string. I am wondering if there are already facilities in Bioconductor for doing this type of querying before I go writing my own.

GenomicAlignments BAM CIGAR • 1.6k views

ADD COMMENT • link updated 4.5 years ago by Janet Young ▴ 740 • written 4.5 years ago by Shian Su ▴ 40

1

Entering edit mode

Is this somehow related to this: https://support.bioconductor.org/p/125825/ ?

ADD REPLY • link 4.5 years ago Hervé Pagès 16k

0

Entering edit mode

Thanks Herve, this is very close to what I want to do. It does a great job mapping from the reference coordinates to the query coordinate. The only issue now is that GAlignments, as far as I can tell, doesn't have the ability to read in the sequence and quality scores, so to query the actual base I think I need to use scanBam from RSamtools as well.

Also the mapToAlignments interface demands the GAlignments object be named. Is there a reason for this?

galign <- GAlignments("chr1", pos = 10L, cigar = "2M2D3M2I4M", strand = factor("+", levels=c("+", "-", "*")))
names(galign) <- "read1" # Code will not work without this

q_range <- IRanges(start=12, width=1)
mapToAlignments(q_range, galign)

ADD REPLY • link 4.5 years ago Shian Su ▴ 40

1

Entering edit mode

readGAlignments() accepts a ScanBamParam via param=. If the sequence and qualities are in the "what" they will be included in the mcols().

The names are required since the mapped return value needs to use the read identifiers as its seqnames.

ADD REPLY • link 4.5 years ago Michael Lawrence ★ 11k

0

Entering edit mode

The names are required since the mapped return value needs to use the read identifiers as its seqnames.

What is the meaning of seqnames here? I thought seqname generally referred to the molecule name (chromosome/transcript).

ADD REPLY • link 4.5 years ago Shian Su ▴ 40

1

Entering edit mode

The "molecule" in this case is the sequence read.

ADD REPLY • link 4.5 years ago Michael Lawrence ★ 11k

score 1 · Answer 1 · 2019-11-04

hi Shian,

I did something similar a few months ago. Herve made a useful function called stackStringsFromGAlignments - I've been using that on each SNP individually (I'm looking at a small genome with not many SNPs, so haven't cared much about efficiency). The details of what I did are a little hazy, but it was something like this:

I read in the bam using

aln <- readGAlignments(bamfile, use.names=TRUE, param=ScanBamParam(what="seq"))

and then get genotypes using something like this (SNP is a GRanges object):

snpAln  <- stackStringsFromGAlignments(aln, region=SNP)

then I use this to manipulate the genotypes further:

as.character(snpAln)

I'm about to revisit this project so will keep an eye on this thread for suggestions. There's probably a better way to do it!

Janet