Extracting raw read counts in DiffBind consensus peak set
2
0
Entering edit mode
@hassebossenbroek-23193
Last seen 4.4 years ago

Hi,

I'm trying to extract the consensus peakset that DiffBind constructs, with the raw counts in these regions. I have done this two different ways:

1) After running dba.count(), dba.contrast() and dba.analyze() with default settings:

counts_1 <- dba.report(DBA, th=1, bCounts = TRUE, bNormalized = FALSE, DataType = DBA_DATA_FRAME)

2) (I had to run dba.count() normally first, otherwise the second line of code did not run)

DBA <- dba.count(DBA)
DBA <- dba.count(DBA, peaks=NULL, score=DBA_SCORE_READS)
counts_2 <- dba.peakset(DBA, bRetrieve = TRUE, DataType = DBA_DATA_FRAME)

These two methods both yield an object with 104976 consensus peak regions. However, some of these contain 1s in every column (so there are no reads in any of my samples). My first question is, why is this? If these regions have no reads in any sample, why do they make it into the consensus peak set?

My second question is: The number of these regions without reads in different with the two methods. With method 1, I find 11476 such regions; with method 2, there are 9686, even though the total number of regions is the same. What causes this discrepancy, and what is the right way to extract raw read counts in consensus peaks?

Thank you in advance for your answer!

DiffBind • 2.6k views
ADD COMMENT
1
Entering edit mode
Rory Stark ★ 5.2k
@rory-stark-5741
Last seen 5 weeks ago
Cambridge, UK

I'm not sure how you called peaks initially? The defaults for determining a consensus peakset are based strictly on peak overlaps, not on counts within peaks.

The default counting method may undercount overlapping regions if you don't supply a fragmentSize and/or have paired-end reads. One thing to try is to call dba.count() with bUseSummarizeOverlaps=TRUE, and compare the results.

The main difference in count scores between the default scoring method DBA_SCORE_TMM_MINUS_FULL and DBA_SCORE_READS, besides the normalization, is that reads in the control track are subtracted (that's the MINUS part). So you have more peaks with no counts because there are more control counts that ChIP counts overlapping that interval.

Also, you should be able to specify score=DBA_SCORE_READS directly to the initial call to dba.count() (but not peaks=NULL, as that tells the function to use the current counts to recompute the score).

ADD COMMENT
0
Entering edit mode

Hi Rory,

Thank you for your answer! I used bUseSummarizeOverlaps=TRUE, and that did get rid of the empty rows. I think the problem was that my reads are paired-end. Please could you explain why the default counting method undercounts overlapping regions if the reads are paired-end? And also, how can I set this in DBA$config? (Apparently the default is singleEnd = TRUE, but I couldn't figure out how to change this).

Thank you very much for your help!

ADD REPLY
0
Entering edit mode

Set

myDBA$config$singleEnd <- FALSE

before calling

dba.count(myDBA, bUseSummarizeOverlaps=TRUE)
ADD REPLY
0
Entering edit mode

Hi Rory,

Thank you for your answer. I have done some comparisons, and I have found that I get far fewer differentially accessible regions when I set DBA$config$singleEnd = FALSE than when I don't set this (I use bUseSummarizeOverlaps=TRUE in both cases). Can you explain why that would be the case? And what would you recommend for paired-end data - what is the benefit of setting DBA$config$singleEnd = FALSE?

Thank you very much!

ADD REPLY
0
Entering edit mode

There could be a number of reasons for this. It could be that in paired-end mode the reads are overlapping multiple features and not being counted. It would be interesting to look at the difference in the read count matrices when singleEnd is FALSE vs when it is TRUE. If you were to send me DBA objects with the counts done in each way I could have a look.

ADD REPLY
0
Entering edit mode

Hi Rory,

Thank you for getting back to me. I'm not sure what the best way to get these objects to you is (I don't think I can upload them here), so I'm emailing them to you. I hope that is okay.

Thank you for your help with this! Best wishes, Hasse

ADD REPLY
0
Entering edit mode

If the bam files contain paired-end reads, and you set bUseSummarizeOverlaps=TRUE, then you need to make sure that singleEnd=FALSE. If not, each end of the read will be counted separately, so you'll get twice as many counts as you should (each fragment will be counted twice). The counts won't come out to exactly double when `singleEnd=TRUE, as in some cases each end will overlap a different peak (these would be discarded in the paired-end case as they can not be uniquely assigned to a single peak interval).

ADD REPLY
0
Entering edit mode
bioinfouser2 ▴ 10
@bioinfouser2-15147
Last seen 4.2 years ago

If you want to do this outside of DiffBind, there's another way of doing it.

  1. Call peaks(e.g. macs2). Convert the peak files to bed files.
  2. Merge all the bed files from your samples with bedtools to make a consensus peakset.
  3. Covert the merged peak file(*.bed file) into an *.saf file, to input into featureCounts(from SubRead package). Check this link how to make a saf file from bed file using simple awk command https://www.biostars.org/p/228636/.
  4. Get the raw read count matrix for your consensus peakset by using that one saf file and all the bam files in featureCounts to get raw count matrix.

You could probably write a bash script to do all these steps in one go.

ADD COMMENT

Login before adding your answer.

Traffic: 611 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6