Question: ChIPQC: reads counts in bam do not match ones from samtools flagstat
0
4.7 years ago by
atisou0
Switzerland
atisou0 wrote:

Hi,

Apologies if the information is already written somewhere and I missed it.

I read the ChIPQC vignette, its manual, the ChIPQC Bioc2014 presentation, but could not find any more details about the "Reads" metrics and how the peaks numbers are calculated.

It says in the vignette (page 9) that Reads metrics corresponds to "Total number of reads in the bam file for each sample".

Here is my data obtained with the following commands:

> exp <- ChIPQC(metadata, annotation = "mm10")

> exp

Samples: 6 : samp1 samp2 ... samp5 samp6            Tissue Factor Condition Treatment Replicate Peaks samp1 brain  TF        WT    treat1         1   621
...
        Reads Map% Filt% Dup% ReadL FragL RelCC   SSD    RiP% samp1 2227533  100  12.9    0    51   207  1.59 0.842 0.02210

> samp1 <- QCsample(exp, 1)

> samp1

                    ChIPQCsample Number of Mapped reads: 2227533 Number of Mapped reads passing MapQ filter: 1939371 Percentage Of Reads as Non-Duplicates (NRF): 100(0) Percentage Of Reads in Blacklisted Regions: NA SSD: 0.842153930509319 Fragment Length Cross-Coverage: 0.000468866812610841 Relative Cross-Coverage: 1.58670988654781 Percentage Of Reads in GenomicFeature:                         ProportionOfCounts Peaks                         0.0002536905 LongPromoter20000to2000       0.0000000000 Promoters2000to500            0.0000000000 Promoters500                  0.0000000000 All5utrs                      0.0000000000 Alltranscripts                0.0000000000 Allcds                        0.0000000000 Allintrons                    0.0000000000 All3utrs                      0.0000000000 Percentage Of Reads in Peaks: 0.03 Number of Peaks: 8

...

The output from samtools flagstat is:

32041419 + 0 in total (QC-passed reads + QC-failed reads) 0 + 0 duplicates 30769799 + 0 mapped (96.03%:-nan%) 0 + 0 paired in sequencing 0 + 0 read1 0 + 0 read2 0 + 0 properly paired (-nan%:-nan%) 0 + 0 with itself and mate mapped 0 + 0 singletons (-nan%:-nan%) 0 + 0 with mate mapped to a different chr 0 + 0 with mate mapped to a different chr (mapQ>=5)

So my questions would be:

1 - where do the "Number of Mapped reads: 2227533" given by ChIPQC come from and why do they differ from samtools total number of reads?

2 - where does "Number of Peaks: 8" come from and why does it differ from the "Peaks = 621" given in exp summary (that's what I see as well in the bed file).

3- the RIP% thus corresponds to the number of reads in the peaks N=8, or N=621?

Thanks a lot in advance for you help,

kind regards,

H.

ps: thanks a lot for making this package available, it is extremely helpful!

ps2: this ChIP does not seem to have worked anyway (very low SSD, very few specific peaks etc), hence the need of thorough QC checks.

chipqc • 1.2k views
modified 4.7 years ago by Thomas Carroll400 • written 4.7 years ago by atisou0
Answer: ChIPQC: reads counts in bam do not match ones from samtools flagstat
0
4.7 years ago by
United States/New York/The Rockefeller University
Thomas Carroll400 wrote:

Hi,

I had already a look at flagtagcounts(), but this still does not explain the discrepancy in reads counts between ChIPQC and samtools flagstat.

flagtagcounts(exp) samp1 UnMapped             0 Mapped         2227533 Duplicates           0 MapQPass       1939371 MapQPassAndDup       0

What worries me is that ChIPQC counts ~2M reads in total, while there are ~30M reads in total in the input bam file.

Why are there ~28M reads missing? or did I miss something important of ChIPQC?

1
Hi, In your case here, ChIPQC is analysing only 1 chromosome, the first chromsome in peak file. This was done to save time as working across all chromosomes from multiple files which can take a longer time. So your counts will be only from the chromosome analysed From the manual: chromosomes Specification of which chromosomes to use for computing QC statistics. If missing, the first chromosome which has a peak is checked. If NULL, all chromosomes will be checked (which may be time-consuming). This can be a character string (e.g. “chr18”) or a vector or list of character strings. If it is an integer or vector of integers, the chromosomes will be checked based on the order that they are listed in a peak set. To run across all chromosomes, then use something like. exp <- ChIPQC(metadata, annotation = "mm10",chromosomes=NULL). I will clarify the entry from manual above and change the default here to something sensible (i.e missing will run all chromosomes not just 1 example chromsome) best, tom On Tue, Mar 17, 2015 at 3:38 PM, atisou [bioc] <noreply@bioconductor.org> wrote: > Activity on a post you are following on support.bioconductor.org > > User atisou <https: support.bioconductor.org="" u="" 7468=""/> wrote Comment: > ChIPQC: reads counts in bam do not match ones from samtools flagstat > <https: support.bioconductor.org="" p="" 65738="" #65748="">: > > Hi, > > I had already a look at flagtagcounts(), but this still not explain the > discrepancy in reads counts between ChIPQC and samtools flagstat. > > flagtagcounts(exp) > samp1 > UnMapped 0 > Mapped 2227533 > Duplicates 0 > MapQPass 1939371 > MapQPassAndDup 0 > > > > What worries me is that ChIPQC counts ~2M reads in total, while there are > ~30M reads in total in the input bam file. > > Why are there ~28M reads missing? or did I miss something important of > ChIPQC? > > > > > > ------------------------------ > > You may reply via email or visit > C: ChIPQC: reads counts in bam do not match ones from samtools flagstat >

Ok, makes sense now. I did see that parameter description, but somehow, it did not hit my brain, apologies.

Thanks a lot again,

Kind regards,

H.