retreiving peak caller fold enrichments after generating consensus peakset in DiffBind
2
0
Entering edit mode
mm2489 • 0
@mm2489-7705
Last seen 4.2 years ago
United States

Hi

I am using DiffBind for analyses of my Chip-seq results. I was wondering if there is any way I can get the original peak fold enrichment scores (from MACS2 output) for the consensus peak set that DiffBind generates.

Thanks

diffbind • 1.4k views
1
Entering edit mode
Gord Brown ▴ 640
@gord-brown-5664
Last seen 9 months ago
United Kingdom

The problem isn't uniquely defined, because more than one peak in the original peak file may correspond with a consensus peak.  (For example, in sample A, peaks 1 and 2 are near each other; in sample B, peak 3 overlaps both 1 and 2: the merged peak will include all 3 peaks.)

Having said that, DiffBind doesn't actually store the fold change, only the score, so notwithstanding the above, there won't be a direct way.  The following code snippet gives you the peaks in the original peak file corresponding with the consensus peaks, with the fold change, which is not what you want, but might be of some small help:

library(IRanges)
library(GenomicRanges)
library(GenomicAlignments)
library(DiffBind)

macs2granges <- function(filename) {
sep="\t",stringsAsFactors=FALSE)
gr <- GRanges(seqnames=Rle(df$chr),ranges=IRanges(df$start,df$end), strand=Rle("*",nrow(df)),score=df$fold_enrichment)
return(gr)
}

dba2granges <- function(d) {
gr <- GRanges(seqnames=Rle(d$chrmap[d$vectors$CHR]), ranges=IRanges(d$vectors$START,d$vectors$END), strand=Rle("*",nrow(d$vectors)))
return(gr)
}

orig <- macs2granges('your_peaks.xls')
dba_obj <- dba(sampleSheet='your_sampleSheet.csv')

overlap <- subsetByOverlaps(orig,dba2granges(dba_obj))
0
Entering edit mode

So you mentioned that DiffBind stores "scores" for each sample. Could you please explain how exactly those scores are calculated and how I can access the scores and possibly export them as a data frame?

0
Entering edit mode
Rory Stark ★ 4.1k
@rory-stark-5741
Last seen 16 days ago
CRUK, Cambridge, UK

You can get the full matrix of scores, as a data frame,  as follows:

> scoreMatrix <- dba.peakset(DBA, bRetrieve=TRUE, DataType=DBA_DATA_FRAME)

Before counting, the scores are derived from the peak scores in the peak files (in your case MACS2), normalized to a 0..1 scale. After counting via dba.count(), the scores are computed  The default scoring method, DBA_SCORE_TMM_MINUS_FULL, uses the TMM normalization method from the edgeR package, on read counts minus the control read counts (set to a minimum value of 1), and based on the full library size (number of reads in the bam file). Other scores include raw reads, RPKM, and summit heights, as described in the man page for dba.count for the score parameter.

You can re-set the scores without re-counting (and re-export the matrix) by calling dba.count with peaks=NULL, e.g.

> DBA <- dba.count(DBA, peaks=NULL, score=DBA_SCORE_RPKM)
> readCounts <- dba.peakset(DBA, bRetrieve=TRUE, DataType=DBA_DATA_FRAME)

Hope this helps-

Rory

0
Entering edit mode

Specifically, in the case of MACS2, the "peak score" is the "-log10(p-value)" column  by default.  You could set the "scoreCol" parameter to get the fold change instead: use "scoreCol=8" when calling "dba".  But it'll still be normalized before you see it, and the problem with multiple original peaks corresponding with one merged peak still remains.

0
Entering edit mode

Thank you both for your help.

I guess the problem is that I want peak caller scores but only for the consensus peaks, which means only the peak set generated after I call dba.count(), right?

1
Entering edit mode

There's a distinction between the peak *caller* score and the peak score.  The peak caller itself generates a score, by some mechanism specific to the caller.  Once you've called dba.count, the peak *caller* score is no longer applicable, because the peaks have changed.  As Rory has described, you then have a choice of various other peak scores, based on the read counts of the (merged) peaks.  Does that clarify anything?

0
Entering edit mode

Yes, I understand the difference. The peak caller score has to do with peaks being real or not compared with the input control, whereas the peak scores generated by dba.count have to do with intensity of each peak/number of reads mapping to each peak.

I was hoping I can still get some score equivalent to the peak caller score for the consensus peak set though, but if I understand you correctly, you are saying that since the peaks/peak ranges are different in the consensus set compared to the original peak ranges in MACS2 output bed files, it is not possible to get corresponding scores for the new/consensus peak set by going back to the original bed files. Is that right?

1
Entering edit mode

Actually, there is a way to get the a version of the peak caller scores, before calling dba.count(). If you want the consensus set based on peaks in 2 or more samples, you can just retrieve the binding matrix using dba.peakset(). If you want to use a different overlap rate, you can make a new DBA object using that rate. For example, if you want all the peaks in at least 3 peaksets:

> newDNA<-= dba(DBA, minOverlap=3)
> scores <- dba.peakset(newDBA, bRetrieve=TRUE, DataType=DBA_DATA_FRAME)

There are two issues to deal with there. The first is that the scores will have been normalized to a 0..1 scale, so you won't have the same numbers as reported in the MACS .xls files. You can recover them by multiplying each column by the maximum score in the peakset for that sample. The other complication comes in certain cases where more than one peak for a sample was merged into a single peak (eg because they both overlapped with a peak in a different sample). In this case, the score will be the maximum of the scores across the merged peak for that sample.

Cheers-

Rory

0
Entering edit mode

Thank you so much, this is exactly what I needed.