Question

Does DiffBind make consensus peaks sets for samples from a given condition (e.g. case control)?

0

Entering edit mode

tahir.msajid • 0

@172e2d0e

Last seen 2.8 years ago

Australia

Hi, I want to use DiffBind for differential accessible chromatin regions for my ATACSeq analysis. I am trying to understand the core codes for this analysis as described in DiffBind manual but getting bit confused due to many interlinked options. If the metadata (information of case, control, replicates, bam files, peaks files, and format) are mentioned in a csv file and the analysis is run straight away with the command "dba.analyze("File.csv")". With this scenario I have three questions:

Does DiffBind make two consensus peak sets (one for each: case and control condition) using the replicates within each condition or does it make one consensus peak set for all samples regardless of the condition?
Can we mention the two conditions (Case and Control) with the option "categories=DBA_CONDITION" to make contrast before the differential analysis?
In default "dba.analyze("File.csv")" analysis option, does differential analysis consider only the common consensus peak regions between case and control conditions? or it considers the unique consensus peak regions present in either of the condition (unique to case or control only)?

If anybody can help answer these questions, it will be greatly appreciated. @rory-stark-5741?

Sajid

dba.analyze DiffBind @rory-stark-5741 • 3.0k views

ADD COMMENT • link 2.8 years ago tahir.msajid • 0

score 0 · Answer 1 · 2023-03-13

0

Entering edit mode

Rory Stark ★ 5.2k

@rory-stark-5741

Last seen 12 months ago

Cambridge, UK

In the fully default analysis, the consensus peakset includes merged peaks that are called in at least two samples, regardless of which sample group the belong to. In the default analysis, these peaks are re-centered around their point of greatest enrichment and set to a standard width of 401bp (200 bp up- and down-stream of the point of greatest enrichment).
You can call dba.contrast() and specify to specific contrast prior to initiating an analysis (using dba.analyze()). In versions since DiffBind_3.0, you should not use the categories= option, but rather contrast=c(Condition, "Case", "Control") and an optional design="~Condition".
In the fully default analysis, the consensus peakset as described in point 1 above is used. This considers peaks that may be called in two replicates of one condition (and may be unique to that condition) in addition to peaks that are called in samples belonging to different sample groups (peaks shared between conditions). It is frequently the case that peaks called across conditions have consistent changes in read counts (binding affinity) and are significantly differentially bound.

ADD COMMENT • link 2.8 years ago Rory Stark ★ 5.2k

0

Entering edit mode

Rory, thank you for your answers. It helped to resolve many puzzles in my mind. Much appreciated.

Just one more question. Specifically, before the differential analysis (may be by DESEQ2 or EDGER) analysis, can I correct the peakset signal (count) data of individual samples for some additional variables like breed, location, health of sampled animals? If so, How can I add these variables in default mode (dba.analyze("File.csv")), or in step-by-step mode just before differential analysis step?

ADD REPLY • link 2.8 years ago tahir.msajid • 0

0

Entering edit mode

Generally it is advisable to include all the steps of the analysis individually so that you can be clear in your mind what is actually going on, and to optimize the processing at specific steps as needed. The equivalent sequence is:

myDB <- dba(sampleSheet="File.csv")
myDB <- dba.blacklist(myDB)
myDB <- dba.count(myDB)
myDB <- dba.normalize(myDB)
myDB <- dba.contrast(myDB)
myDB <- dba.analyze(myDB)

To use a more complex model, you'll need to include the metadata factors in your sample sheet from the beginning. Basically you can treat the pre-defined meta-data columns Tissue, Factor, Condition, Treatment, and Replicate however you like with whatever metadata you choose. For example, besides including the Case/Control metadata in the Condition column, you could include the breed metadata in the Tissue column, and location metadata in the Treatment** column. Then, before you call dba.analyze(), you can call dba.constrast() and specify the design and contrast parameters:

myDB <- dba.contrast(myDB, design="~Tissue + Treatment + Condition", 
                     contrast=c("Condition","Case","Control")

This will control for variance associated with breeds and locations when testing for differences between Case and Control samples.

ADD REPLY • link 2.8 years ago Rory Stark ★ 5.2k

0

Entering edit mode

Rory, that great!! thank you much for guidance. It made things easier.

Cheers.

ADD REPLY • link 2.8 years ago tahir.msajid • 0

0

Entering edit mode

Hi Rory, I came up with two more questions while analyzing my data with DiffBind with a hope that it is not bothering you.

I ran following code to make consensus peakset between replicates of case and control condition:

consensus <- dba.peakset(dbObj, consensus =-DBA_REPLICATE, minOverlap = 2)

I got following summary:

consensus

6 Samples, 79828 sites in matrix (193306 total):

ID Tissue Condition Replicate Intervals

1 AW_1 Rumen Case 1 90088

2 AW_2 Rumen Case 2 106622

3 BW_1 Rumen Control 1 114659

4 BW_2 Rumen Control 2 54951

5 Case Rumen Case 1-2 60653

6 Control Rumen Control 1-2 33807

The last two rows (5 and 6) contain the consensus intervals specific to Case and Control groups. I am interested to proceed my analysis with these two consensus peak sets only. How can I separate these peak intervals to mention in: dba.count(dbObj, peaks=???) function? I have tried dba.count(dbObj, peaks=consensus) but looks like it uses all intervals (79828) mentioned in above summary.

I managed an alternate way by using overlap function as follow:

consensus_overlap <- dba.overlap(consensus, consensus_$masks$Consensus)

Unique_Case <- consensus_Rep_Overlap$onlyA (It created list of intervals unique to Case samples only)

Unique_Control <- consensus_Rep_Overlap$onlyB (It created list of intervals unique to Control samples only)

Common <- consensus_Rep_Overlap$inAll (It created list of intervals present in both Case and Control)

However, I had to run three separate sets of analyses for these three lists. Is there anyway to run a single analysis using above two unique and one common list of intervals?

Sajid

ADD REPLY • link 2.8 years ago tahir.msajid • 0

0

Entering edit mode

If you want to use all of the sites in either (or both) the Case consensus and/or the Control consensus, you can do it as follows:

consensusDB <- dba.peakset(dbObj, consensus =-DBA_REPLICATE, minOverlap = 2)
consensusDB <- dba(consensusDB, mask=consensusDB$masks$Consensus, minOverlap=1)
consensus.peaks <- dba.peakset(consensusDB, bRetrieve=TRUE)
dbObj <- dba.count(dbObj, peaks=consensus.peaks)

ADD REPLY • link 2.8 years ago Rory Stark ★ 5.2k

0

Entering edit mode

Thank you Rory. It worked well. Much appreciated.

ADD REPLY • link 2.8 years ago tahir.msajid • 0

0

Entering edit mode

Hi Rory, I came up with another question while analyzing my data. A scenario of analysis: Path to bam files, path to peaks files (broad peaks by MACS2), peak Format, and scoreCol information is mentioned in the meta-data csv file to make a DBA. Consensus peaksets are made for case and control conditions using the -DBA_REPLICATE option.

When using dba.count, does DiffBind make the affinity matrix by counting the actual number of reads from the bam files, or, does it rely on the score of peaks available in the peak file by MACS2 (also mentioned in scoreCol column in meta-data cvs file)?

If DiffBind counts the reads in peaks by itself, does it use the read counts (after normalization) for differential analysis as it is, or, does it rescale these read counts with some formula?

ADD REPLY • link 2.8 years ago tahir.msajid • 0

0

Entering edit mode

dba.count() counts reads overlapping consensus regions. After that, the original peak scores are ignored. All quantitative analysis (via dba.analyze()) is done using read counts. The peak scores have very limited use for generating clustering heatmaps and PCA plots prior to counting.

DiffBind itself does not alter the raw read counts when invoking an underlying analysis method (DESeq2 and/or edgeR), however it does supply normalization parameters along with the counts. You can retrieve a version of the normalized binding matrix, however these re-scaled counts are approximations computed by DiffBind used only for reporting and plotting, and not the raw data that DESeq2/edgeR see.

ADD REPLY • link 2.8 years ago Rory Stark ★ 5.2k

0

Entering edit mode

That's great. Thank you for clarifying the confusion.

Cheers,

ADD REPLY • link 2.8 years ago tahir.msajid • 0

0

Entering edit mode

Hi Rory, I am normalizing the counts with the following options:

normalized <- dba.normalize(counts, method=DBA_DESEQ2, normalize = DBA_NORM_LIB, library = DBA_LIBSIZE_FULL, bRetrieve = TRUE)

Then I want to retrieve information about the normalizing process with commands normalized$norm.method , normalized$norm.factors and normalized$lib.sizes

With every info command, instead of providing the information, it replies "NULL".

Can you please help me with this?

ADD REPLY • link 2.8 years ago tahir.msajid • 0

0

Entering edit mode

You need to call dba.normalize() twice, first to establish the normalization calculations, and the second time to retrieve them:

counts.norm <- dba.normalize(counts, method=DBA_DESEQ2, 
                             normalize = DBA_NORM_LIB, library = DBA_LIBSIZE_FULL)
norm.info <- dba.normalize(counts.norm, bRetrieve=TRUE)

that way you have both the normalized DBA object (counts.norm) and the normalization information (norm.info).