Question

Diffbind matrix/batch effect/normalization

0

Entering edit mode

littlefishes20 ▴ 10

@littlefishes20-13054

Last seen 21 months ago

Taiwan

Dear Rory,

Thanks for your effort in this excellent tool! Your tutorial and detailed answers to each post are extremely helpful. However, I feel confused to some results while running DiffBind. To better illustrate my problems, I would like to invite you to go through my case.

library("DiffBind")
samples=read.csv("res.info.csv")
tamoxifen <- dba(sampleSheet=samples)

tamoxifen_out <- dba.peakset(tamoxifen, minOverlap=1, bRetrieve=TRUE, DataType=DBA_DATA_FRAME)
write.table(tamoxifen_out, file="res.all_qvalue.txt", quote=F, sep="\t", row.names=F, col.names=T)

Problem: It should output all peaks found in any samples. However, only peaks found in more than two samples were output in file 'res.all_qvalue.txt' although I set 'minOverlap=1'

tamoxifen_consensus <- dba.peakset(tamoxifen, consensus=DBA_CONDITION, minOverlap=2)
tamoxifen <- dba.count(tamoxifen, peaks=consensus_peaks, bParallel=TRUE,summits=0)

tamoxifen_count1 <- dba.peakset(tamoxifen, bRetrieve=TRUE, DataType=DBA_DATA_FRAME)
write.table(tamoxifen_count1, file="res.consensus_count1.txt", quote=F, sep="\t", row.names=F, col.names=T)

Question: If I understand correctly, it should output the TMM normalized read number and using the full library sizes. Do you agree with me?

#Raw read count
tamoxifen_raw <- dba.count(tamoxifen, peaks=NULL, score=DBA_SCORE_READS)
tamoxifen_counts <- dba.peakset(tamoxifen_raw, minOverlap=1, bRetrieve=TRUE, DataType=DBA_DATA_FRAME)
write.table(tamoxifen_counts, file="res.consensus_raw_count.txt", quote=F, sep="\t", row.names=F, col.names=T)

#RPKM, producing from normalizing raw read count with sequencing depth and gene length
tamoxifen_rpkm <- dba.count(tamoxifen, peaks=NULL, score=DBA_SCORE_RPKM)
tamoxifen_rpkm_counts <- dba.peakset(tamoxifen_rpkm, minOverlap=1, bRetrieve=TRUE, DataType=DBA_DATA_FRAME)
write.table(tamoxifen_rpkm_counts, file="res.consensus_rpkm.txt", quote=F, sep="\t", row.names=F, col.names=T)

tamoxifen <- dba.normalize(tamoxifen)

Question: It is defining the normalization method used in the following differential analysis, default by using full library size on the raw read count. And it is not the same count as that in res.consensus_count1.txt, right? If so, I couldn't obtain the normalized count using for differential analysis. Do you agree with me?

tamoxifen <- dba.contrast(tamoxifen, design="~Condition + Factor",minMembers=2)
tamoxifen <- dba.contrast(tamoxifen,categories=c(DBA_CONDITION), block=DBA_FACTOR,minMembers=2)

Question: Two batch data were produced for some of the conditions and the batch information was stored in the 'Factor' column in csv file. I am wondering if I should specify the batch information in design or in block parameter.

Thanks for your time and effort. Looking forward to hearing from you at your earliest convenience.

DiffBind • 1.1k views

ADD COMMENT • link updated 2.2 years ago by Rory Stark ★ 5.2k • written 2.4 years ago by littlefishes20 ▴ 10

score 0 · Answer 1 · 2023-01-10

Problem: It should output all peaks found in any samples. However, only peaks found in more than two samples were output in file res.all_qvalue.txt although I set minOverlap=1

In dba.peakset(), the minOverlap parameter is only used when it is used to add new a new sample, generating a new DBA object. In your case, where bRetrieve=TRUE, the parameter is ignored. You should instead specify minOverlap=1 in the previous call to dba().

Question: If I understand correctly, it should output the TMM normalized read number and using the full library sizes. Do you agree with me?

No, the default normalization uses full library sizes only, and does not perform a TMM calculation.

Question: It is defining the normalization method used in the following differential analysis, default by using full library size on the raw read count. And it is not the same count as that in res.consensus_count1.txt, right? If so, I couldn't obtain the normalized count using for differential analysis. Do you agree with me?

You are using the default normalization in both cases, so res.consensus_count1.txt reports normalized counts using the same calculations that will be applied to a subsequent analysis. So you do indeed have them. (You should be aware that the retrieved counts are not used explicitly in the analysis: only the normalisation factors are actually used, not the counts themselves.)

Question: Two batch data were produced for some of the conditions and the batch information was stored in the 'Factor' column in csv file. I am wondering if I should specify the batch information in design or in block parameter.

You should specify it in the design formula, and not used the block parameter at all. The block parameter is only there for backward compatibility with pre-DiffBind_3.x versions that did not allow for the design formula to be specified explicitly. So your first version is the correct one.