Hi, I want to use DiffBind for differential accessible chromatin regions for my ATACSeq analysis. I am trying to understand the core codes for this analysis as described in DiffBind manual but getting bit confused due to many interlinked options. If the metadata (information of case, control, replicates, bam files, peaks files, and format) are mentioned in a csv file and the analysis is run straight away with the command "dba.analyze("File.csv")". With this scenario I have three questions:
Does DiffBind make two consensus peak sets (one for each: case and control condition) using the replicates within each condition or does it make one consensus peak set for all samples regardless of the condition?
Can we mention the two conditions (Case and Control) with the option "categories=DBA_CONDITION" to make contrast before the differential analysis?
In default "dba.analyze("File.csv")" analysis option, does differential analysis consider only the common consensus peak regions between case and control conditions? or it considers the unique consensus peak regions present in either of the condition (unique to case or control only)?
If anybody can help answer these questions, it will be greatly appreciated. @rory-stark-5741?
Sajid
Rory, thank you for your answers. It helped to resolve many puzzles in my mind. Much appreciated.
Just one more question. Specifically, before the differential analysis (may be by DESEQ2 or EDGER) analysis, can I correct the peakset signal (count) data of individual samples for some additional variables like breed, location, health of sampled animals? If so, How can I add these variables in default mode (dba.analyze("File.csv")), or in step-by-step mode just before differential analysis step?
Generally it is advisable to include all the steps of the analysis individually so that you can be clear in your mind what is actually going on, and to optimize the processing at specific steps as needed. The equivalent sequence is:
To use a more complex model, you'll need to include the metadata factors in your sample sheet from the beginning. Basically you can treat the pre-defined meta-data columns Tissue, Factor, Condition, Treatment, and Replicate however you like with whatever metadata you choose. For example, besides including the Case/Control metadata in the Condition column, you could include the breed metadata in the Tissue column, and location metadata in the Treatment** column. Then, before you call
dba.analyze()
, you can calldba.constrast()
and specify thedesign
andcontrast
parameters:This will control for variance associated with breeds and locations when testing for differences between Case and Control samples.
Rory, that great!! thank you much for guidance. It made things easier.
Cheers.
Hi Rory, I came up with two more questions while analyzing my data with DiffBind with a hope that it is not bothering you.
I ran following code to make consensus peakset between replicates of case and control condition:
consensus <- dba.peakset(dbObj, consensus =-DBA_REPLICATE, minOverlap = 2)
I got following summary:
consensus
6 Samples, 79828 sites in matrix (193306 total):
ID Tissue Condition Replicate Intervals
1 AW_1 Rumen Case 1 90088
2 AW_2 Rumen Case 2 106622
3 BW_1 Rumen Control 1 114659
4 BW_2 Rumen Control 2 54951
5 Case Rumen Case 1-2 60653
6 Control Rumen Control 1-2 33807
The last two rows (5 and 6) contain the consensus intervals specific to Case and Control groups. I am interested to proceed my analysis with these two consensus peak sets only. How can I separate these peak intervals to mention in:
dba.count(dbObj, peaks=???)
function? I have trieddba.count(dbObj, peaks=consensus)
but looks like it uses all intervals (79828) mentioned in above summary.I managed an alternate way by using overlap function as follow:
consensus_overlap <- dba.overlap(consensus, consensus_$masks$Consensus)
Unique_Case <- consensus_Rep_Overlap$onlyA
(It created list of intervals unique to Case samples only)Unique_Control <- consensus_Rep_Overlap$onlyB
(It created list of intervals unique to Control samples only)Common <- consensus_Rep_Overlap$inAll
(It created list of intervals present in both Case and Control)However, I had to run three separate sets of analyses for these three lists. Is there anyway to run a single analysis using above two unique and one common list of intervals?
Sajid
If you want to use all of the sites in either (or both) the Case consensus and/or the Control consensus, you can do it as follows:
Thank you Rory. It worked well. Much appreciated.
Hi Rory, I came up with another question while analyzing my data. A scenario of analysis: Path to bam files, path to peaks files (broad peaks by MACS2), peak Format, and scoreCol information is mentioned in the meta-data csv file to make a DBA. Consensus peaksets are made for case and control conditions using the -DBA_REPLICATE option.
When using
dba.count
, does DiffBind make the affinity matrix by counting the actual number of reads from the bam files, or, does it rely on the score of peaks available in the peak file by MACS2 (also mentioned in scoreCol column in meta-data cvs file)?If DiffBind counts the reads in peaks by itself, does it use the read counts (after normalization) for differential analysis as it is, or, does it rescale these read counts with some formula?
dba.count()
counts reads overlapping consensus regions. After that, the original peak scores are ignored. All quantitative analysis (viadba.analyze()
) is done using read counts. The peak scores have very limited use for generating clustering heatmaps and PCA plots prior to counting.DiffBind
itself does not alter the raw read counts when invoking an underlying analysis method (DESeq2
and/oredgeR
), however it does supply normalization parameters along with the counts. You can retrieve a version of the normalized binding matrix, however these re-scaled counts are approximations computed byDiffBind
used only for reporting and plotting, and not the raw data thatDESeq2
/edgeR
see.That's great. Thank you for clarifying the confusion.
Cheers,
Hi Rory, I am normalizing the counts with the following options:
normalized <- dba.normalize(counts, method=DBA_DESEQ2, normalize = DBA_NORM_LIB, library = DBA_LIBSIZE_FULL, bRetrieve = TRUE)
Then I want to retrieve information about the normalizing process with commands
normalized$norm.method
,normalized$norm.factors
andnormalized$lib.sizes
With every info command, instead of providing the information, it replies "NULL".
Can you please help me with this?
You need to call
dba.normalize()
twice, first to establish the normalization calculations, and the second time to retrieve them:that way you have both the normalized DBA object (
counts.norm
) and the normalization information (norm.info
).Got it. Thank you Rory.