Question

DiffBind Normalization by bFullLibrarySize

0

Entering edit mode

simonjean434 ▴ 10

@simonjean434-7535

Last seen 9.3 years ago

Canada

Hello,

I am trying to understand DiffBind in order to use it for my ATAC-seq data analysis and find differential open chromatin sites. could you please explain for me

1- why min number of raw count is 1 and not 0?

2- Regarding normalization, where bFullLibrarySize=FALSE has been used?

Thanks

atac-seq • 3.3k views

ADD COMMENT • link updated 9.9 years ago by Rory Stark ★ 5.2k • written 9.9 years ago by simonjean434 ▴ 10

score 0 · Answer 1 · 2016-03-10

0

Entering edit mode

Rory Stark ★ 5.2k

@rory-stark-5741

Last seen 13 months ago

Cambridge, UK

1. DiffBind sets the minimum read count for consensus peaks to 1. Basically, this avoids divide-by-zero checking etc. One-read differences shouldn't make a difference in meaningful results (although it does skew the read distribution).

2. The bFullLibrarySize option determines the total read count used for normalization. If bFullLibrarySize=FALSE, the number of reads that overlap consensus peaks is used for each sample (basically, the sum of all the counts). This is the best option for cases where most of the peaks are not expected to change their binding affinity significantly. For the more conservative default, bFullLibrarySize=TRUE, the total number of aligned reads in the .bam file is used (basically the sequencing depth). This is more appropriate in cases where you expect dramatic shifts in binding affinities, or if you are not sure what to expect.

Is this what you were asking?

-Rory

ADD COMMENT • link 9.9 years ago Rory Stark ★ 5.2k

0

Entering edit mode

Dear Rory,

Thanks for your reply. Regarding bFullLibrarySize=FALSE, I understand that you use Sum of all counts in all peaks in each sample. I wonder how you do the normalization of counts before giving count data to EdgeR?

The reason I am asking is because the DE results (FDR10%) that I get through DiffBind are not even close to what I get when running EdgeR GLM in parallel on the same Raw counts Matrix.

Example -> DiffBind (dba.analyze(my2, bFullLibrarySize=FALSE) ) -> 2542 sites

My analysis -> 18213 sites

I've visually checked some of the results and there are some obvious peaks that is not called DE by DiffBind. I am not sure why there is such a difference and not sure if I missed to add any particular parameters in DiffBind?

Thanks for your help

S

ADD REPLY • link 9.9 years ago simonjean434 ▴ 10

0

Entering edit mode

i wonder if you are using the correct score when retrieving the "Raw counts Matrix"? The default is normalized data, so if you gave that to edgeR, it would attempt to re-normalize it, which could explain it. Here's what I would try:

> my2 <- dba.count(my2, peaks=NULL, score=DBA_SCORE_READS_MINUS)
> bindingMatrix <- dba.peakset(my2, bRetrieve=NULL, DataType=DBA_DATA_FRAME)
> counts <- bindingMatrix(,4:ncol(bindingMatrix))

If you don't want to subtract the control reads, you can use score=DBA_SCORE_READS instead, and then set bSubControl=FALSE when calling dba.analyze().

See how that works. If it is still very different, it may have to do with the parameters you are setting in edgeR. See the technical not in the Vignette explaining some more details on how edgeR is used.

Cheers-

Rory

ADD REPLY • link 9.9 years ago Rory Stark ★ 5.2k