Hi all!
My goal is to access the nucleotide content at specific base positions in Illumina reads derived from whole-genome poolseq data, for de novo SNP detection. Alignment to a reference genome is done, so we start with a BAM file. Now one approach I implemented involves reading the BAM into R using ScanBam by retaining "rname", "pos", and "seq". The former two elements together then allow me to filter for those reads overlapping a focal base position in the genome, and the seq element provides, after adequate narrowing (or after substr() the corresponding character object – seems way faster), access to the nucleotides observed at the focal base position. With that, I can discover SNPs.
However, a problem with the above is indels: if the target base position to inspect is downstream of a nearby indel polymorphism, simply narrowing to the focal position will actually capture non-homologous sites, thus generating pseudo-SNPs. An example:
Read1-TGAGAGCCTGGA... Read2-TGACCTGGA...
Here the positions 4 to 6 of Read1 are missing in Read2, hence it appears that starting with position 4 there are numerous SNPs, although this is just caused by this 3 bp indel.
So I am wondering if this issue could be resolved by considering gaps. Specifically, is it possible to gain, across a short genomic range, access to reads aligned locally by considering gaps? In analogy to the above example this would look like:
Read1-TGAGAGCCTGGA... Read2-TGA---CCTGGA...
Recognizing that there is a gap (indel), one would not call any SNP in this range…
I imagine a potential solution lies within the GenomicAlignments package (readGappedReads?), but I could not make full sense of the documentation. Any suggestion to deal with this indel issue is appreciated.
danielBerner, U Basel