Question

Extracting raw read counts in DiffBind consensus peak set

0

Entering edit mode

hasse.bossenbroek • 0

@hassebossenbroek-23193

Last seen 4.4 years ago

Hi,

I'm trying to extract the consensus peakset that DiffBind constructs, with the raw counts in these regions. I have done this two different ways:

1) After running dba.count(), dba.contrast() and dba.analyze() with default settings:

counts_1 <- dba.report(DBA, th=1, bCounts = TRUE, bNormalized = FALSE, DataType = DBA_DATA_FRAME)

2) (I had to run dba.count() normally first, otherwise the second line of code did not run)

DBA <- dba.count(DBA)
DBA <- dba.count(DBA, peaks=NULL, score=DBA_SCORE_READS)
counts_2 <- dba.peakset(DBA, bRetrieve = TRUE, DataType = DBA_DATA_FRAME)

These two methods both yield an object with 104976 consensus peak regions. However, some of these contain 1s in every column (so there are no reads in any of my samples). My first question is, why is this? If these regions have no reads in any sample, why do they make it into the consensus peak set?

My second question is: The number of these regions without reads in different with the two methods. With method 1, I find 11476 such regions; with method 2, there are 9686, even though the total number of regions is the same. What causes this discrepancy, and what is the right way to extract raw read counts in consensus peaks?

Thank you in advance for your answer!

DiffBind • 2.6k views

ADD COMMENT • link updated 4.5 years ago by Rory Stark ★ 5.2k • written 4.5 years ago by hasse.bossenbroek • 0

score 1 · Answer 1 · 2020-06-05

1

Entering edit mode

Rory Stark ★ 5.2k

@rory-stark-5741

Last seen 5 weeks ago

Cambridge, UK

I'm not sure how you called peaks initially? The defaults for determining a consensus peakset are based strictly on peak overlaps, not on counts within peaks.

The default counting method may undercount overlapping regions if you don't supply a fragmentSize and/or have paired-end reads. One thing to try is to call dba.count() with bUseSummarizeOverlaps=TRUE, and compare the results.

The main difference in count scores between the default scoring method DBA_SCORE_TMM_MINUS_FULL and DBA_SCORE_READS, besides the normalization, is that reads in the control track are subtracted (that's the MINUS part). So you have more peaks with no counts because there are more control counts that ChIP counts overlapping that interval.

Also, you should be able to specify score=DBA_SCORE_READS directly to the initial call to dba.count() (but not peaks=NULL, as that tells the function to use the current counts to recompute the score).

ADD COMMENT • link 4.5 years ago Rory Stark ★ 5.2k

0

Entering edit mode

Hi Rory,

Thank you for your answer! I used bUseSummarizeOverlaps=TRUE, and that did get rid of the empty rows. I think the problem was that my reads are paired-end. Please could you explain why the default counting method undercounts overlapping regions if the reads are paired-end? And also, how can I set this in DBA$config? (Apparently the default is singleEnd = TRUE, but I couldn't figure out how to change this).

Thank you very much for your help!

ADD REPLY • link 4.5 years ago hasse.bossenbroek • 0

0

Entering edit mode

Set

myDBA$config$singleEnd <- FALSE

before calling

dba.count(myDBA, bUseSummarizeOverlaps=TRUE)

ADD REPLY • link 4.5 years ago Rory Stark ★ 5.2k

0

Entering edit mode

Hi Rory,

Thank you for your answer. I have done some comparisons, and I have found that I get far fewer differentially accessible regions when I set DBA$config$singleEnd = FALSE than when I don't set this (I use bUseSummarizeOverlaps=TRUE in both cases). Can you explain why that would be the case? And what would you recommend for paired-end data - what is the benefit of setting DBA$config$singleEnd = FALSE?

Thank you very much!

ADD REPLY • link 4.5 years ago hasse.bossenbroek • 0

0

Entering edit mode

There could be a number of reasons for this. It could be that in paired-end mode the reads are overlapping multiple features and not being counted. It would be interesting to look at the difference in the read count matrices when singleEnd is FALSE vs when it is TRUE. If you were to send me DBA objects with the counts done in each way I could have a look.

ADD REPLY • link 4.4 years ago Rory Stark ★ 5.2k

0

Entering edit mode

Hi Rory,

Thank you for getting back to me. I'm not sure what the best way to get these objects to you is (I don't think I can upload them here), so I'm emailing them to you. I hope that is okay.

Thank you for your help with this! Best wishes, Hasse

ADD REPLY • link 4.4 years ago hasse.bossenbroek • 0

0

Entering edit mode

If the bam files contain paired-end reads, and you set bUseSummarizeOverlaps=TRUE, then you need to make sure that singleEnd=FALSE. If not, each end of the read will be counted separately, so you'll get twice as many counts as you should (each fragment will be counted twice). The counts won't come out to exactly double when `singleEnd=TRUE, as in some cases each end will overlap a different peak (these would be discarded in the paired-end case as they can not be uniquely assigned to a single peak interval).

ADD REPLY • link 4.4 years ago Rory Stark ★ 5.2k

score 0 · Answer 2 · 2020-06-04

If you want to do this outside of DiffBind, there's another way of doing it.

Call peaks(e.g. macs2). Convert the peak files to bed files.
Merge all the bed files from your samples with bedtools to make a consensus peakset.
Covert the merged peak file(*.bed file) into an *.saf file, to input into featureCounts(from SubRead package). Check this link how to make a saf file from bed file using simple awk command https://www.biostars.org/p/228636/.
Get the raw read count matrix for your consensus peakset by using that one saf file and all the bam files in featureCounts to get raw count matrix.

You could probably write a bash script to do all these steps in one go.