Question

GenomicAlignments/access correct bp positions near indels

1

Entering edit mode

Daniel.Berner@unibas.ch ▴ 90

@danielbernerunibasch-4268

Last seen 4.1 years ago

Hi all!

My goal is to access the nucleotide content at specific base positions in Illumina reads derived from whole-genome poolseq data, for de novo SNP detection. Alignment to a reference genome is done, so we start with a BAM file. Now one approach I implemented involves reading the BAM into R using ScanBam by retaining "rname", "pos", and "seq". The former two elements together then allow me to filter for those reads overlapping a focal base position in the genome, and the seq element provides, after adequate narrowing (or after substr() the corresponding character object – seems way faster), access to the nucleotides observed at the focal base position. With that, I can discover SNPs.

However, a problem with the above is indels: if the target base position to inspect is downstream of a nearby indel polymorphism, simply narrowing to the focal position will actually capture non-homologous sites, thus generating pseudo-SNPs. An example:

Read1-TGAGAGCCTGGA...
Read2-TGACCTGGA...

Here the positions 4 to 6 of Read1 are missing in Read2, hence it appears that starting with position 4 there are numerous SNPs, although this is just caused by this 3 bp indel.

So I am wondering if this issue could be resolved by considering gaps. Specifically, is it possible to gain, across a short genomic range, access to reads aligned locally by considering gaps? In analogy to the above example this would look like:

Read1-TGAGAGCCTGGA...
Read2-TGA---CCTGGA...

Recognizing that there is a gap (indel), one would not call any SNP in this range…

I imagine a potential solution lies within the GenomicAlignments package (readGappedReads?), but I could not make full sense of the documentation. Any suggestion to deal with this indel issue is appreciated.

danielBerner, U Basel

genomicranges genomicalignments snp • 1.5k views

ADD COMMENT • link updated 7.9 years ago by Hervé Pagès 16k • written 7.9 years ago by Daniel.Berner@unibas.ch ▴ 90

score 1 · Answer 1 · 2017-05-08

Hi Daniel,

This is a pretty common problem and Rsamtools::pileup() does exactly that. The h5vc and VariantTools packages are also worth mentioning here as they provide comprehensive frameworks for variant calling. The VariantDetection view should list all packages that deal with this:

https://bioconductor.org/packages/release/BiocViews.html#___VariantDetection

but, surprisingly, h5vc and VariantTools are not listed :-/

Also FWIW the GenomicAlignments package provides stackStringsFromBam() and pileLettersAt(). They are simpler than Rsamtools::pileup() but the latter is a lot more efficient/powerful/flexible.

Hope this helps,

H.