Question

dba.analyze settings having deeper input than ChIP

0

Entering edit mode

eva.pinatel ▴ 10

@evapinatel-7358

Last seen 3.2 years ago

Italy

Dear Dr Stark,

I have maybe a naïve question about dba.analyze function.

My experiment consist in:

WT-stimulated
WT-unstimulated(few leaky binding)
KO-stimulated (negative control)

immuno-precipitated samples, each performed in duplicate with its own input on a small bacterial genome obtaining from 3 to 13 million reads for each library.

I performed IDR calculations as suggested by the authors and I called peaks for each ChIP.bam against the corresponding pooled-input.bam and then performed all the other sub-pooling test. I ended up having, for each experiment, a unique list of peaks obtained from pooled-ChIP analysis, with different ChIP.bam and the same Pooled-input.bam for DiffBind analysis.

SampleID,Tissue,Factor,Treatment,Condition,Replicate,Peaks,bamReads,bamControl

IP-6C,WT,TF1,Fullmedia,1,1,peaks/Pooled_6_optimalset.bed,../Bam/IP-6C.sort.bam,../Bam/Input-6-merged.bam

IP-6E,WT,TF1,Fullmedia,1,2,peaks/Pooled_6_optimalset.bed,../Bam/IP-6E.sort.bam,../Bam/Input-6-merged.bam

As a direct consequence I usually have two to six times more reads in input respect to ChIP. Performing dba.counts setting bScaleControl=TRUE I can 'normalize' this disproportion, then input are subtracted and finally ChIP are normalized. In dba.analyze, if I understand correctly, calculations start again from raw reads but bScaleControl parameter is not available. So I suspect that I can introduce some bias just subtracting control reads, isn't it? How would you suggest to proceed? Is it better to eliminate the subtraction step or to extract the normalized counts and use DESeq2 out of this package?

Finally just a clarification, as I have some times up to 10 identical peak-ranges with different summits and scores, at which step are they merged togheter? I deduce from dba.count on, but I guess also in dba.peakset, right?

Thank you for your time

Eva

diffbind • 1.3k views

ADD COMMENT • link 9.2 years ago eva.pinatel ▴ 10

score 0 · Answer 1 · 2015-02-12

Hi Eva-

To address your main question, dba.analyze() does not start again from raw reads. It will use the scaled (but not normalized) values for the control reads obtained when you called dba.count(), so you should be fine. You can see the "raw" numbers that will be supplied to DESeq2 (after counting) as follows:

> myDBA <- dba.count(myDBA, peaks=NULL, score=DBA_SCORE_READS_MINUS)
> counts <- dba.peakset(myDBA, bRetrieve=TRUE)

For your second question about peak merging, if the peak-ranges are all identical, the merging function doesn't do anything. By default you are correct that merging occurs when dba.count()is called, as everything form that point on needs to use only a single set of consensus peaks to form the rows of the binding matrix.

I'm not sure how you are using dba.peakset(): to add or retrieve a peakset?

I do have some other questions as I am curious about your analysis.

To check I understand correctly, you have a total of six (6) ChIP samples, correct? Two each of three condition/treatments (WT-stim, WT-unstim, KO-stim). You also have six corresponding Input samples, correct?

I'm a little confused as to how you are doing the peak-calling. You mention IDR, which is normally done after peak calling in order to derive a high-confidence set from replicates. It looks like you are calling peaks for each replicate ChIP over a pooled Input, then using IDR on the peaksets generated from each pair of replicates, is that right? I'm not sure why you pooled the Inputs for this set, why not call peaks for each ChIP against its corresponding Input? Any why not use each Input separately for the rest of the analysis? We generally do not use IDR for a DiffBind analysis as it is not as important to get have a very high-confidence peaks set prior to the differential analysis, so I am less familiar with the standard practices regarding controls and IDR.

Also, if you have six ChIPs, how do you get up to 10 peak-ranges?

Finally, I'm not sure what contrasts you want to run, but if it is e.g. WT-stim vs KO-stim, having only two replicates for each sample group will probably be underpowered.

Cheers-

Rory

score 0 · Answer 2 · 2015-02-12

Hi Rory,

Thank you for clarification and quick reply.

>To check I understand correctly, you have a total of six (6) ChIP samples, correct? Two each of three condition/treatments (WT-stim, WT-unstim, KO-stim). You also have six corresponding Input samples, correct?

Yes, you are correct. As in bacterias all standard peak-calling algorithms on my experience exhibit strange behaviours and you can get no peaks as well as 2000 peaks for 1000 genes, I choose a quite permissive caller and decided to use IDR to limit to the high-confidence reproducible peaks all the subsequent analysis. For this reason I performed the analysis as they suggest, and so performing peak calling on pooled inputs. Using Diffbind I just thought to use the same bam files, but probably I can use the original ones as you suggest.

About the 10 identical peak-ranges they are due to the fact that the TF acts as tetramer so in the same sample on the same enriched region the peakcaller finds many summits, producing several rows with identical peak position and different fold-change,pval and peak-summit fields.

dba.peakset was used to retrive consensus regions (just from the positional point of view) to plot some Venn diagrams so it was performed on the very preliminar dba object, which contains duplicated lines due to the reason explained above.

Peaks<-dba.peakset(dba_raw, consensus = DBA_TISSUE)

dba.plotVenn(Peaks, Peaks$masks$Consensus)

About the underpower I know, but up to now I have only these :-), fortunately they are quite homogeneous.

Best regards

Eva