Question

ChIPQC and/or Diffbind - retrieving normalized read counts for peaks from single sample

1

Entering edit mode

JD ▴ 10

@jd-9512

Last seen 8.8 years ago

Hello all,

I am looking for advice on the best way to count the number of reads in each peak of a bed file on a
sample-by-sample basis, rather than from a consensus peak set. In other words, I have a peak file and bam file
for sample 1 and I want to be able to count reads in each peak, normalize by rpkm or rpm, and then export the peak table
with counts.

I understand how to do this using dba.count, but I would like to do this on a case-by-case basis, so that I
can look at the distribution of read counts for each sample in my set (10 total samples) using only the peaks
called in that sample, prior to generating and counting from the consensus set.

Is there a way to do this using either ChIPQC or DiffBind? I really like these two packages and will definitely
use them for downstream analyses on the same samples, so thought I would start there. It seems like one option is
to create a ChIPQC sample object for each sample and retrieve the peaks. However, I am not clear what the "Counts"
vector is in this object. I think it is total reads in the peak? rather than normalized to library size? I guess
I could also retrieve the total number of reads, construct a new data frame and do the calcs in R without that much
extra work.

Any help is greatly appreciated!

DiffBind ChIPQC • 2.5k views

ADD COMMENT • link 8.8 years ago JD ▴ 10

0

Entering edit mode

JD ▴ 10

@jd-9512

Last seen 8.8 years ago

Hi Tom,

Thanks for your helpful and quick response. I'm now extracting the counts using your steps and then tacking a column for total reads onto the GR object to do the rpm calculation. It is working perfectly and exactly what I was looking for.

One more quick question - my .bam files are paired-end. Does ChIPQC detect this automatically and deal with it for generating mySample$Counts?

Thank you!

JD

ADD COMMENT • link 8.8 years ago JD ▴ 10

0

Entering edit mode

hi JD,

Sorry for the delay. ChIPQC doesnt support paired end data directly and I suspect many of the quality metrics would be biased when using both read pairs. ChIPQC doesn't account for paired-end reads when producing its counts.

It should be possible to select first read in pair using a filter with samtools view and working with the resulting BAM file in ChIPQC.

ChiPQC and paired-end data

This will work in ChIPQC but I'm not sure if this fits your data as you will be throwing out all the pair information. Was there a reason for paired-end ChIP-seq for this experiment?

best,

tom

ADD REPLY • link 8.8 years ago Thomas Carroll ▴ 420

1

Entering edit mode

Filtering can also be done with Rsamtools::filterBam, using param = ScanBamParam(flag=scanBamFlag(isFirstMateRead=TRUE, isSecondMateRead=FALSE)).

ADD REPLY • link 8.8 years ago Martin Morgan 25k

0

Entering edit mode

Hi Tom,

Thanks again for the useful information. I will try the samtools method. This is actually an ATAC-seq experiment, my first one. My -seq experience thus far has been exclusively with ChIP-seq and I agree with you, generally no need to do paired-end. Otherwise, I've been using a similar analysis pipeline/tools for my ATAC data (in terms of filtering reads, mapping, peak-calling and motif-finding) as I used to use for TF ChIP-seqs.

Here, with ATAC, we thought getting fragment sizes could be useful but this is also my first time working with PE data (and also it turned out that with the NextSeq, a PE run was the same cost as SR).

I appreciate all of your help!

JD

ADD REPLY • link 8.8 years ago JD ▴ 10

0

Entering edit mode

.

ADD REPLY • link 8.8 years ago Thomas Carroll ▴ 420

score 1 · Accepted Answer · 2016-02-05

1

Entering edit mode

Rory Stark ★ 5.2k

@rory-stark-5741

Last seen 4 weeks ago

Cambridge, UK

There is no straightforward ay to do this in DiffBind. You are on the right track regarding your thoughts of how to do this one sample at a time, but I'm not sure it would save you much over just using the built-in Bioconductor tools for counting reads against a GRanges object. If you do go the DiffBind route, you can set the score to DBA_SCORE_RPKM to return rpkm counts. ChIPQC actually can count use the peakset for each sample separately, but I would have to think a bit about how to easily extract that info in a usable way.

-Rory

ADD COMMENT • link 8.8 years ago Rory Stark ★ 5.2k

0

Entering edit mode

Hi Rory,

Thanks for the quick reply. I'll go the Bioconductor route.

ADD REPLY • link 8.8 years ago JD ▴ 10

score 1 · Accepted Answer · 2016-02-05

hi JD,

This should be straightforward in ChIPQC.

Using ChIPQCsample will generate the counts in peaks as you require.

Working from a single sample ChIPQCsample object (here - mySample) you can extract counts in peaks as you would a GRanges.

> mySample

mySample
					ChIPQCsample
Number of Mapped reads: 109762060
Number of Mapped reads passing MapQ filter: 92791099
Percentage Of Reads as Non-Duplicates (NRF): 76.64(0.23)
Percentage Of Reads in Blacklisted Regions: 8
SSD: 3.22220172276254
Fragment Length Cross-Coverage: 0.000149372723785495
Relative Cross-Coverage: 0.927703588370934
Percentage Of Reads in GenomicFeature: 
                        ProportionOfCounts
Peaks                           0.53696809
BlackList                       0.09395869
LongPromoter20000to2000         0.26191179
Promoters2000to500              0.04924805
Promoters500                    0.12160107
All5utrs                        0.04664122
Alltranscripts                  0.54511688
Allcds                          0.03104567
Allintrons                      0.47491474
All3utrs                        0.01729326
Percentage Of Reads in Peaks: 53.7
Number of Peaks: 72437
GRanges object with 72437 ranges and 2 metadata columns:
          seqnames             ranges strand   |    Counts bedRangesSummitsTemp
             <Rle>          <IRanges>  <Rle>   | <integer>            <numeric>
      [1]    chr10 [3012435, 3013637]      *   |       588              3012995
      [2]    chr10 [3085419, 3086485]      *   |       128              3085948
      [3]    chr10 [3111976, 3115029]      *   |      1305              3113536
      [4]    chr10 [3131104, 3132199]      *   |       171              3131615
      [5]    chr10 [3133045, 3136316]      *   |      1526              3134275
      ...      ...                ...    ... ...       ...                  ...
  [72433]     chrY [2032538, 2033663]      *   |       146              2033087
  [72434]     chrY [2033844, 2034551]      *   |        68              2034087
  [72435]     chrY [2039354, 2040469]      *   |       137              2039925
  [72436]     chrY [2130004, 2131074]      *   |       133              2130551
  [72437]     chrY [2136830, 2137936]      *   |       127              2137402
  -------
  seqinfo: 21 sequences from an unspecified genome; no seqlengths

> mySample$Counts

> mySample$Counts[1:10]
 [1]  588  128 1305  171 1526  779  838  104  292  343

To get information on total reads you could use flagtagcounts() function.

> flagtagcounts(mySample)

      UnMapped         Mapped     Duplicates       MapQPass MapQPassAndDup 
             0      109762060       32587537       92791099       25641371

Or rip() function for Read In Peaks.

> rip(mySample)

49825859

So you could adjust your counts using a combination of these as required to normalise to library size or total reads in peaks.

Good to remember that ChIPQCsample() sample will use all chromosomes in your dataset (those longer than window length used in CrossCoverage calculation) for count of total reads etc unless specified in function call.

best,

tom