Hi
I am using DiffBind for analyses of my Chip-seq results. I was wondering if there is any way I can get the original peak fold enrichment scores (from MACS2 output) for the consensus peak set that DiffBind generates.
Thanks
Hi
I am using DiffBind for analyses of my Chip-seq results. I was wondering if there is any way I can get the original peak fold enrichment scores (from MACS2 output) for the consensus peak set that DiffBind generates.
Thanks
Hi, two quick comments:
The problem isn't uniquely defined, because more than one peak in the original peak file may correspond with a consensus peak. (For example, in sample A, peaks 1 and 2 are near each other; in sample B, peak 3 overlaps both 1 and 2: the merged peak will include all 3 peaks.)
Having said that, DiffBind
doesn't actually store the fold change, only the score, so notwithstanding the above, there won't be a direct way. The following code snippet gives you the peaks in the original peak file corresponding with the consensus peaks, with the fold change, which is not what you want, but might be of some small help:
library(IRanges) library(GenomicRanges) library(GenomicAlignments) library(DiffBind) macs2granges <- function(filename) { df <- read.table(filename,blank.lines.skip=TRUE,header=TRUE,comment.char="#", sep="\t",stringsAsFactors=FALSE) gr <- GRanges(seqnames=Rle(df$chr),ranges=IRanges(df$start,df$end), strand=Rle("*",nrow(df)),score=df$fold_enrichment) return(gr) } dba2granges <- function(d) { gr <- GRanges(seqnames=Rle(d$chrmap[d$vectors$CHR]), ranges=IRanges(d$vectors$START,d$vectors$END), strand=Rle("*",nrow(d$vectors))) return(gr) } orig <- macs2granges('your_peaks.xls') dba_obj <- dba(sampleSheet='your_sampleSheet.csv') overlap <- subsetByOverlaps(orig,dba2granges(dba_obj))
You can get the full matrix of scores, as a data frame, as follows:
> scoreMatrix <- dba.peakset(DBA, bRetrieve=TRUE, DataType=DBA_DATA_FRAME)
Before counting, the scores are derived from the peak scores in the peak files (in your case MACS2), normalized to a 0..1 scale. After counting via dba.count()
, the scores are computed The default scoring method, DBA_SCORE_TMM_MINUS_FULL
, uses the TMM normalization method from the edgeR
package, on read counts minus the control read counts (set to a minimum value of 1), and based on the full library size (number of reads in the bam file). Other scores include raw reads, RPKM, and summit heights, as described in the man page for dba.count
for the score
parameter.
You can re-set the scores without re-counting (and re-export the matrix) by calling dba.count
with peaks=NULL
, e.g.
> DBA <- dba.count(DBA, peaks=NULL, score=DBA_SCORE_RPKM) > readCounts <- dba.peakset(DBA, bRetrieve=TRUE, DataType=DBA_DATA_FRAME)
Hope this helps-
Rory
Specifically, in the case of MACS2, the "peak score" is the "-log10(p-value)" column by default. You could set the "scoreCol" parameter to get the fold change instead: use "scoreCol=8" when calling "dba". But it'll still be normalized before you see it, and the problem with multiple original peaks corresponding with one merged peak still remains.
There's a distinction between the peak *caller* score and the peak score. The peak caller itself generates a score, by some mechanism specific to the caller. Once you've called dba.count
, the peak *caller* score is no longer applicable, because the peaks have changed. As Rory has described, you then have a choice of various other peak scores, based on the read counts of the (merged) peaks. Does that clarify anything?
Yes, I understand the difference. The peak caller score has to do with peaks being real or not compared with the input control, whereas the peak scores generated by dba.count have to do with intensity of each peak/number of reads mapping to each peak.
I was hoping I can still get some score equivalent to the peak caller score for the consensus peak set though, but if I understand you correctly, you are saying that since the peaks/peak ranges are different in the consensus set compared to the original peak ranges in MACS2 output bed files, it is not possible to get corresponding scores for the new/consensus peak set by going back to the original bed files. Is that right?
Actually, there is a way to get the a version of the peak caller scores, before calling dba.count()
. If you want the consensus set based on peaks in 2 or more samples, you can just retrieve the binding matrix using dba.peakset()
. If you want to use a different overlap rate, you can make a new DBA object using that rate. For example, if you want all the peaks in at least 3 peaksets:
> newDNA<-= dba(DBA, minOverlap=3) > scores <- dba.peakset(newDBA, bRetrieve=TRUE, DataType=DBA_DATA_FRAME)
There are two issues to deal with there. The first is that the scores will have been normalized to a 0..1 scale, so you won't have the same numbers as reported in the MACS .xls files. You can recover them by multiplying each column by the maximum score in the peakset for that sample. The other complication comes in certain cases where more than one peak for a sample was merged into a single peak (eg because they both overlapped with a peak in a different sample). In this case, the score will be the maximum of the scores across the merged peak for that sample.
Cheers-
Rory
Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Thank you so much for your helpful response.
So you mentioned that DiffBind stores "scores" for each sample. Could you please explain how exactly those scores are calculated and how I can access the scores and possibly export them as a data frame?