Question

inconsistent results from applyPileups and pileLettersAt

1

Entering edit mode

jesper.gadin ▴ 10

@jespergadin-6883

Last seen 5.3 years ago

Hello everyone,

I find inconsistent results from applyPileups and pileLettersAt. I have a placed a full example on github, so you can try the code with data that will reproduce what I am suspecting is a "bug".

#Repo with reproducible example

https://github.com/pappewaio/ExampleData

#Here a short summary of what is happening

One version requires the user to use readGAlignments first and then use pileLettersAt, here I can read in 450 reads into R, but after using pileLettersAt I am left with only 413 counts. What could possibly explain this difference? Is it because some reads are spanning the position I am piling letters at, but not having any sequence information for the position?

The second version uses applyPileups and summarize straight from the bam file, and here I would expect to have at least the 413 counts found by pileLettersAt. But instead I can only find 385 counts. Why is it so?

Until now I have been using version one to get allele counts. But version two is very much faster, so would be very nice to be able to switch to it, without losing counts.

looking forward to hearing your thoughts,
Jesper

> sessionInfo()
R Under development (unstable) (2016-01-08 r69888)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Ubuntu 14.04.4 LTS

locale:
 [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
 [3] LC_TIME=sv_SE.UTF-8        LC_COLLATE=en_US.UTF-8    
 [5] LC_MONETARY=sv_SE.UTF-8    LC_MESSAGES=en_US.UTF-8   
 [7] LC_PAPER=sv_SE.UTF-8       LC_NAME=C                 
 [9] LC_ADDRESS=C               LC_TELEPHONE=C            
[11] LC_MEASUREMENT=sv_SE.UTF-8 LC_IDENTIFICATION=C       

attached base packages:
[1] parallel  stats4    stats     graphics  grDevices utils     datasets
[8] methods   base     

other attached packages:
 [1] GenomicAlignments_1.7.20    SummarizedExperiment_1.1.22
 [3] Biobase_2.31.3              Rsamtools_1.23.6           
 [5] Biostrings_2.39.12          XVector_0.11.7             
 [7] GenomicRanges_1.23.25       GenomeInfoDb_1.7.6         
 [9] IRanges_2.5.40              S4Vectors_0.9.44           
[11] BiocGenerics_0.17.3        

loaded via a namespace (and not attached):
[1] zlibbioc_1.17.1     BiocParallel_1.5.21 bitops_1.0-6

Rsamtools GAlignments pileup • 1.2k views

ADD COMMENT • link updated 8.8 years ago by Martin Morgan 25k • written 8.8 years ago by jesper.gadin ▴ 10

score 1 · Answer 1 · 2016-05-27

Indels and soft clipping influence coverage, so having more reads than bases is not surprising

qseq <- DNAStringSet("ACG")
seqnames <- Rle(factor("chr1"), 1)
start <- 1L
cigar <- "1M1N1M"

with

> pileLettersAt(qseq, seqnames, start, cigar, GRanges("chr1:1-1"))
  A DNAStringSet instance of length 1
    width seq
[1]     1 A
> pileLettersAt(qseq, seqnames, start, cigar, GRanges("chr1:2-2"))
  A DNAStringSet instance of length 1
    width seq
[1]     0 
> pileLettersAt(qseq, seqnames, start, cigar, GRanges("chr1:3-3"))
  A DNAStringSet instance of length 1
    width seq
[1]     1 C

ApplyPileups and pileLettersAt differ because ApplyPileups follows samtools' (m)pileup convention and does not count secondary alignments or reads not passing quality metrics, e.g., with

flag <- scanBamFlag(isUnmappedQuery = FALSE,
                    isSecondaryAlignment=FALSE,
                    isNotPassingQualityControls=FALSE,
                    isDuplicate=FALSE)
param <- ScanBamParam(flag = flag, which = which, what = what)

one has

> nstr

  C   G 
  1 384

I recommend that you use pileup() rather than ApplyPileups(), it is conceptually easier to work with:

scanBamParam <- ScanBamParam(which=my_IGPOI)
pileupParam <- PileupParam(
    max_depth=1000,
    min_base_quality=0L,
    distinguish_strands=FALSE)

and

> pileup(bf, scanBamParam=scanBamParam, pileupParam=pileupParam)
  seqnames       pos nucleotide count               which_label
1    chr14 104029449          C     1 chr14:104029449-104029449
2    chr14 104029449          G   384 chr14:104029449-104029449

These agree with ApplyPileup().