Question

DiffBind analysis on ChIP-Seq qata and design matrix

0

Entering edit mode

SFn • 0

@2487a888

Last seen 3.7 years ago

France

Hello everyone,

I’m working on a set of ChIP-Seq samples and I’ve started to use DiffBind to proceed to my differential analysis.

My experimental design is the following :

2 conditions : Glucose (control condition) and Lactose (Induction condition)

3 epigenetic marks : H3K4me3, H3K9me3 and H3K27me3

3 replicates for each mark and each condition

2 inputs for each condition

1 mock for each condition

The three replicates for each condition are biological replicates from which I IPed my three marks :

Glucose condition :

Biological replicate 1 : H3K4me3 (1) H3K9me3 (1) H3K27me3 (1) Input (1)

Biological replicate 2 : H3K4me3 (2) H3K9me3 (2) H3K273 (2) Input (2) Mock

Biological replicate 3 : H3K4me3 (3) H3K9me3 (3) H3K273 (3)

And same for lactose condition.

Firstly, I would like to do a differential analysis between glucose vs lactose.

I ran three independant DiffBind analyses, one for each mark, where I used factor=epigenetic mark, condition=glucose or lactose and I didn’t use the treatment column. It seemed the right way to go but I’ve been wondering how DiffBind treats its variables « treatment », « condition » and «factor» under the hood. In my case, I’m not 100% sure which one correspond to my design. Do you have any insight on that matter ?

On the other hand, I also ran a single analysis including all my three marks in the factor column and condition=glucose or lactose, leaving empty the treatment column. I got very low FRiP scores comparing to the independant analyses so I didn’t go further but why such a difference ?

Also, I’m not sure how to use my inputs here. Should I put my IP samples in the glucose condition as bamControl for the differential analysis or my input samples ?

Same thing with my mock sample, how can I use it ?

Thank you for your help !

ChIPSeq DiffBind ExperimentalDesign • 1.7k views

ADD COMMENT • link 3.9 years ago • updated 3.8 years ago SFn • 0

score 1 · Accepted Answer · 2021-02-14

1

Entering edit mode

Rory Stark ★ 5.2k

@rory-stark-5741

Last seen 8 weeks ago

Cambridge, UK

The metadata designators Tissue, Factor, Condition, and Treatment are all treated the same and basically arbitrary -- you can put any design factor into any of these. something like Glucose vs Lactose could be either Condition or Treatment, it's just how you like the design to appear.

As I understand it, in your first analysis you ran DiffBind three separate times, each time including the six replicates of a single mark (for two conditions), each time testing a single contrast (Glucose vs Lactose). In the second analysis, you included all 18 samples, and stopped when you saw the FRiPs were low.

The lower FRiPs are not unusual and not necessarily cause for concern if the three marks are largely disjoint, meaning that in general there are not loci that have the K4 mark in some conditions but the K9 or K27 mark in the same loci in other conditions. In this case, you will have a lot more consensus peaks (all the K4, K9, and K27 loci together), while each sample only has a subset of these. In this case, I'd probably form separate consensus sets for each mark (using dba.peakset() with consensus=DBA_FACTOR) and plot Venn diagrams to see how much overlap there is between the marks. You should also see that the number of peaks in the binding matrix is greater in the second case than in any of the first three cases.

To analyze this, you're probably better off including all 18 samples, but then setting up a multi-factor design (design="~Factor + Condition").

ADD COMMENT • link 3.9 years ago Rory Stark ★ 5.2k

0

Entering edit mode

Hello Rory, Thank you very much for your answer. I took your advise and tried to include all samples at onces. However, the goal of my analysis is to compare Condition levels lactose (L100) and glucose (G100) for each epigenetic mark, so for each level of Factor. Therefore, I believe that your experimental design formula: ~Factor + Condition would not allow me to achieve my goal, as it would use Factor as a blocking factor and just compare L100 vs G100. Instead, I have created a design matrix with all samples of all epigenetic marks and conditions and used the following design: ~ Factor + Factor:Condition and then I have tested 3 contrasts:

dba_contrast <- dba.contrast(dba_normalise,design = "~Factor + Factor:Condition",reorderMeta=list(Condition="G100"))

Contrast for K4: dba_contrast <- dba.contrast(dba_contrast, contrast= "FactorK4.ConditionL100")
Contrast for K9: dba_contrast <- dba.contrast(dba_contrast, contrast= "FactorK9.ConditionL100")
Contrast for K27: dba_contrast <- dba.contrast(dba_contrast, contrast= "FactorK27.ConditionL100")

It seems better but I'm still doubting. What do you think?

Additionally, I still do not know how to incorporate the peak files corresponding to my input into the DiffBind analysis.

Thank you very much again for your help

ADD REPLY • link 3.8 years ago SFn • 0

0

Entering edit mode

While I would probably take this conjunctive approach myself, it may be of benefit to talk to someone who is more expert and setting up designs for GLMs.

Regarding the Inputs, all of the ChIP bam files should be used as bamReads and the Input bam files as bamControls. The control condition is indicated in the design. I'm not sure what you mean by "the peak files corresponding to my input" -- have you called peaks on the Input files themselves? We normally don't do that -- peaks in the Inputs are incorporated in the analysis via the use of Greylists.