GenomicAlignments/access correct bp positions near indels
1
1
Entering edit mode
@danielbernerunibasch-4268
Last seen 3.2 years ago

Hi all!

My goal is to access the nucleotide content at specific base positions in Illumina reads derived from whole-genome poolseq data, for de novo SNP detection. Alignment to a reference genome is done, so we start with a BAM file. Now one approach I implemented involves reading the BAM into R using ScanBam by retaining "rname", "pos", and "seq". The former two elements together then allow me to filter for those reads overlapping a focal base position in the genome, and the seq element provides, after adequate narrowing (or after substr() the corresponding character object – seems way faster), access to the nucleotides observed at the focal base position. With that, I can discover SNPs.

However, a problem with the above is indels: if the target base position to inspect is downstream of a nearby indel polymorphism, simply narrowing to the focal position will actually capture non-homologous sites, thus generating pseudo-SNPs. An example:

Read1-TGAGAGCCTGGA...
Read2-TGACCTGGA...

Here the positions 4 to 6 of Read1 are missing in Read2, hence it appears that starting with position 4 there are numerous SNPs, although this is just caused by this 3 bp indel.

So I am wondering if this issue could be resolved by considering gaps. Specifically, is it possible to gain, across a short genomic range, access to reads aligned locally by considering gaps? In analogy to the above example this would look like:

Read1-TGAGAGCCTGGA...
Read2-TGA---CCTGGA...

Recognizing that there is a gap (indel), one would not call any SNP in this range…

I imagine a potential solution lies within the GenomicAlignments package (readGappedReads?), but I could not make full sense of the documentation. Any suggestion to deal with this indel issue is appreciated.

danielBerner, U Basel

genomicranges genomicalignments snp • 1.2k views
ADD COMMENT
1
Entering edit mode
@herve-pages-1542
Last seen 10 hours ago
Seattle, WA, United States

Hi Daniel,

This is a pretty common problem and Rsamtools::pileup() does exactly that. The h5vc and VariantTools packages are also worth mentioning here as they provide comprehensive frameworks for variant calling. The VariantDetection view should list all packages that deal with this:

https://bioconductor.org/packages/release/BiocViews.html#___VariantDetection

but, surprisingly, h5vc and VariantTools are not listed :-/

Also FWIW the GenomicAlignments package provides stackStringsFromBam() and pileLettersAt(). They are simpler than Rsamtools::pileup() but the latter is a lot more efficient/powerful/flexible.

Hope this helps,

H.

ADD COMMENT

Login before adding your answer.

Traffic: 953 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6