Question: DESeq2 for ChIP-seq differential peaks
0
3.6 years ago by
tangming2005110
United States
tangming2005110 wrote:

Hi there,

I want to use DESeq2 for differential peak detection for ChIP-seq data.

I know there are many methods implemented for this purpose https://github.com/crazyhottommy/ChIP-seq-analysis#differential-peak-detection

Diffbind internally uses DESeq2 and EdgR, but I want to take the other way:

Say I have untreat and treat group for my ChIP-seq data, each with three replicates.

for ChIP-seq, we usually include a input control, which is just genomic DNA without pulling down with a specific antibody.

I call peaks with MACS2 for all of them, and merge all the peaks into a super-set. https://github.com/crazyhottommy/ChIP-seq-analysis/blob/master/part1_peak_calling.md#merge-peaks

Now, I can count from the bam files for each replicate and input, how many reads in each peak. https://github.com/crazyhottommy/ChIP-seq-analysis/blob/master/part2_Preparing-ChIP-seq-count-table.md

A final count table will look like this:

lib size       10million   10million  10 million     20 million               12 million   12 million 12 million    24 million

untreat_1  untreat_2 untreat_3 untreat_input_control  treat_1 treat_2 treat_3 treat_input_control

peak_1     20            25              30                   3                             10              12      13             5

peak_2     10            28              13                   8                              64             43       45           7

........

How should I use this raw count table with DESeq2?

I should subtract input_control raw counts from the IP counts ( but I will need to normalize to the lib size first?),

but then I will get decimals which violates the DESeq2 assumption https://www.biostars.org/p/143458/#157303

Thanks !

Ming Tang

chipseq deseq deseq2 • 8.0k views
ADD COMMENTlink
modified 21 months ago by k.vitting.seerup90 • written 3.6 years ago by tangming2005110
1

To add to the other answers, you should definitely read the casw paper before continuing, especially the section about strategies for combining peak calls across samples. The paper shows that the taking the union of per-sample peak calls is not a good strategy and will result in loss of accurate FDR control. This finding applies regardless of whether or not you actually use the csaw package for your analysis.

ADD REPLYlink modified 3.3 years ago • written 3.3 years ago by Ryan C. Thompson7.2k

I'm not sure that's what they are saying. My understanding is that the FDR issue relates not to merging peak calls from different samples (prior to a differential analysis), but to merging peaks that have been identified as differentially bound. The confidence statistics associated with each differentially bound region can not be easily combined. This is actually more of an issue, requiring a specific solution, in an approach like csaw where windows are tested for differential counts, then clustered/merged into larger differentially bound regions.

The issue in the paper seems to boil down to expanding peak width that comes with taking unions. The wider the peaks, the more background reads they include which raises the FDR. In current versions of DiffBind, a recenter-around-summits strategy is used to control the width of the consensus peaks, which takes care of this problem.

Potentially, the more potent concern in a peak-based approach is that the same information that is used to identify the peak regions is re-used to identify differences in binding affinity within those regions. I'm not sure I agree that this is a problem in practice (or if it is it is not fully sidestepped by a windowing approach where many windows are filtered out), but it is a more applicable observation.

ADD REPLYlink modified 3.3 years ago • written 3.3 years ago by Rory Stark2.8k
1

Your last paragraph is exactly the issue I was referring to. I'm referring specifically to tables 1 and 2 in the csaw paper, and the accompanying text sections "Description of the peak-based strategy", "Many peak-based implementations are possible", and especially "Peaks should be called from pooled libraries". The question above describes a peak calling method that corresponds to method 1 in the csaw paper, which results in highly over-conservative differential binding calls.

ADD REPLYlink written 3.3 years ago by Ryan C. Thompson7.2k
1

Thought I might chip in here (no pun intended), as that paper sounds somewhat familiar. In short, Ryan's interpretation is mostly correct. Using what I call the "at least 2" strategy can be conservative (as shown in the 2014 NAR paper) but it can also be liberal (as shown in the 2015 NAR paper). To explain this behaviour, consider the peaks that are just detected, i.e., are called in exactly two libraries. If these two peaks belong in the same group, you enrich for spurious DB, which results in liberalness. If they belong in different groups, you enrich for peaks with large dispersions, which results in conservativeness. Whether it does one or the other can't be easily predicted in a given data set. This makes the "at least 2" approach a bit unpalatable, as its statistical properties are not well defined.

Of course, this really only matters for the peaks near the limit of detection. Strong peaks that are far away from the detection limit should be fine no matter what you do (with caveats, due to information sharing across peaks in edgeR). That said, as method developers, we earn our money by squeezing blood out of rocks - uh, I mean, getting more information out of the data, including that for weak peaks near the detection limit. If you want reliable statistical inferences for DB in these peaks, then the "at least 2" approach will have some shortcomings.

ADD REPLYlink modified 3.3 years ago • written 3.3 years ago by Aaron Lun23k

Thanks, I am following Dr. Gordon K. Smyth's tutorial  http://f1000research.com/articles/4-1080/v1

ADD REPLYlink modified 3.3 years ago • written 3.3 years ago by tangming2005110
Answer: DESeq2 for ChIP-seq differential peaks
5
3.6 years ago by
Michael Love22k
United States
Michael Love22k wrote:

hi,

You should definitely never subtract or divide the counts you provide to DESeq2.

I would not use the input counts at all for differential binding. I would just compare treated counts vs untreated ChIP counts.

but I would also recommend to also take a look at DiffBind and csaw vignettes and workflows, at the least to understand the best practices they've set out.

ADD COMMENTlink written 3.6 years ago by Michael Love22k
3

From what I can see, there's two choices; either we get erroneous DB calls because of differences in chromatin state and input coverage, or we get errors due to distorted modelling of the mean-variance relationship after input subtraction. Our (i.e., the edgeR development team's) opinion is that the latter is more dangerous for routine analyses of ChIP-seq count data. Inaccurate variance modelling will affect the inferences for every genomic region via EB shrinkage, while spurious calls due to differences in chromatin state should be limited to a smaller number of regions. Rory's blacklist idea might be useful here, as we would be able to screen out such problematic regions before they cause any headaches during interpretation.

ADD REPLYlink modified 3.6 years ago • written 3.6 years ago by Aaron Lun23k
1

While it is certainly the case that altering the read counts using control reads violates an essential assumption underlying DESeq2 (namely the use of unadulterated read counts), ignoring the control tracks can also lead to incorrect results. This is because the binding matrix may include read counts for enriched regions (peaks) in samples where they were not originally identified as enriched compared to the control. As DESeq2 will have no way of detecting an anomaly in the Input control for that sample in that region, the results may be misleading. This is most likely to occur in experiments involving different cell types.

There are alternative ways to incorporate the Input controls. For example, instead of scaling the control read counts, the control libraries can be down-sampled to the same level as each corresponding ChIP sample prior to peak calling and counting. This is what we do in our processing pipelines. This still involves altering the ChIP read counts via subtraction however, and in practice down-sampling and scaling almost always yield the same results.

The other method is the more aggressive use of blacklists. Generating blacklists based on every Input control, and removing reads/peaks from every sample that overlap any blacklisted area, can eliminate false positives in those regions where there is an anomaly in an Input control.  Gordon Brown developed the GreyListChIP package for this purpose.

ADD REPLYlink modified 3.6 years ago • written 3.6 years ago by Rory Stark2.8k

Thanks all for your insight! I now have a much better understanding of the assumptions behind DESeq and EdgR.

ADD REPLYlink modified 3.6 years ago • written 3.6 years ago by tangming2005110
Answer: DESeq2 for ChIP-seq differential peaks
2
3.6 years ago by
Rory Stark2.8k
CRUK, Cambridge, UK
Rory Stark2.8k wrote:

I'm not sure what you mean by wanting to "take it the other way"? Meaning, why not use the DiffBind package, which does exactly this type of analysis? Is the idea is that you want to really control each step yourself?

If you want to subtract the control reads from the ChIP reads, you should do a simple scaling first. Rounding the nearest integer avoids the DESeq2 issue (you also have to check for negative values as it may be possible there there re more reads in the control than in the ChIP for some merged consensus peaks for some samples). DiffBind does this by default. In your case, it would multiply the control read counts by the ratio of ChIP:Control reads (computed individually for each sample) -- 0.50 in all of your examples -- and round the result before subtracting.

In an experiment  like you describe, where the same tissue type is used for all the samples, subtracting the control reads shouldn't make much difference to the results as the same control is used for all the samples in each group, unless there are significant difference between the Inputs for each sample group. This should only happen if the treatment had a big impact the open chromatin (or be the result of a technical issue in the ChIP). The Input controls are most important for the MACS peak calling step.

Cheers-
Rory

ADD COMMENTlink written 3.6 years ago by Rory Stark2.8k

Thanks Rory for your comment. The reason is that I have a lot of bam files(hundreds..). using DiffBind/R might be too memory demanding. So, I want to prepare the count table like RNA-seq first and then feed into DESeq2.

As you suggested, I will need to scale to library size first (round the number before subtracting), and then subtract IP reads from the input reads(taking care of negative numbers), then feed the table to DESeq2.

Thanks again!

Ming

ADD REPLYlink written 3.6 years ago by tangming2005110

Hi Rory,

I want to know the counts from below are the raw counts or counts rounded after scaling to library size for both IP and Input.

In other words, Reads are from IP (before or after scaling to lib size?)

The same question for cReads.

Thanks!

H3K27ac_InputSubtract$peaks[[8]][1:10,] Chr Start End Score RPKM Reads cRPKM cReads 1 chr1 752472 759042 123 0.9810335 199 0.5289105 76 2 chr1 761042 762921 39 0.8101533 47 0.1946692 8 3 chr1 845566 847753 47 1.2143984 82 0.7317340 35 4 chr1 850918 861812 181 1.1000451 370 0.7932470 189 5 chr1 870871 872043 22 1.2159653 44 0.8582802 22 6 chr1 946879 951106 161 1.7163739 224 0.6814630 63 7 chr1 953846 963471 190 1.1273018 335 0.6888129 145 8 chr1 966517 978024 88 0.7205664 256 0.6675460 168 9 chr1 993603 1000013 77 0.8690936 172 0.6776408 95 10 chr1 1002011 1005956 21 0.8128012 99 0.9040274 78  ADD REPLYlink written 3.3 years ago by tangming2005110 Hi Rory, could you please confirm that the Reads and cReads are lib size scaled and rounded? see my question above. Thanks! ADD REPLYlink written 3.3 years ago by tangming2005110 1 No, these are true, raw counts (ChIP and Control respectively) that are not scaled or normalized in any way. RPKM and cRPKM are normalized to library depth and peak width. If you want the TMM normalized read counts, use dba.count() with score equal to one of the score IDs that start with DBA_SCORE_TMM_.The the normalized values will be present in the Score column of the peaks. You can change the score without re-counting reads by setting peaks=NULL, for example: > H3K27ac_TMMreadsFull <- dba.count(H3K27ac_InputSubtract, peaks=NULL, score=DBA_SCORE_TMM_READS_FULL) > H3K27ac_TMMreadsFull$peaks$Score[1:10] > H3K27ac_TMMsubtractEffective <- dba.count(H3K27ac_InputSubtract, peaks=NULL, score=DBA_SCORE_TMM_MINUS_EFFECTIVE) > H3K27ac_TMMsubtractEffectivel$peaks$Score[1:10] Rather than access $peaks directly, you can get all the scores in one go using dba.peakset():

> H3K27ac_TMMreadsFullScores <- dba.peakset(H3K27ac_TMMreadsFull, bRetrieve=TRUE)

Cheers-

Rory

ADD REPLYlink modified 3.3 years ago • written 3.3 years ago by Rory Stark2.8k

Thanks Rory for the clarification.

Ming

ADD REPLYlink written 3.3 years ago by tangming2005110

How can I get the TMM values for the Input control samples?  The reason is that I want to get the TMMs for all the IP and ChIP samples and plot a PCA to see whether my IP and ChIP are mislabeled.( It looks the case for one of my samples)

ADD REPLYlink written 3.3 years ago by tangming2005110
1

I've had to do this before as well. Once way is to run the ChIPQC package and get a report, as that specifically looks at enrichment in Input controls compared to corresponding ChIPs.

The other way is to create a new samplesheet that has extra rows for each of your Input controls. You can supply a consensus peakset for these (or all) samples, then read them in with dba(), run dba.count(), and then do the desired plots.

-R

ADD REPLYlink written 3.3 years ago by Rory Stark2.8k
Answer: DESeq2 for ChIP-seq differential peaks
2
3.6 years ago by
The Scripps Research Institute, La Jolla, CA
Ryan C. Thompson7.2k wrote:

You might want to take at look at the csaw package, which adapts the edgeR method with all the necessary modifications for unbiased ChIP-Seq differential binding analysis. Also, I would not recommend subtracting input from ChIP counts, since all the count-based methods assume that you are analyzing absolute counts, not "counts in excess of background".

ADD COMMENTlink written 3.6 years ago by Ryan C. Thompson7.2k

interesting, I am waiting for @mikelove for his comments.

ADD REPLYlink written 3.6 years ago by tangming2005110
Answer: DESeq2 for ChIP-seq differential peaks
0
21 months ago by
European Union
k.vitting.seerup90 wrote:

Is there a good reason why nobody have suggested incorporating the input data into the linear model @mikelove @aaronlun?

My idea is to make a model that includes both the input and chip libraries and then test the interaction between chip and condition.

More specifically for the following R code example (using knock down (kd) as treatment):

myDesing <- data.frame(sample=1:8, data=c(rep('Input',4), rep('Chip',4)), condition=c('Wt','Kd'))
myDesing$condition <- factor(myDesing$condition, levels=c('Wt','Kd'))
myDesing$data <- factor(myDesing$data, levels=c('Input','Chip'))

> model.matrix(data=myDesing, ~data + condition)
(Intercept) dataChip conditionKd
1           1        0           0
2           1        0           1
3           1        0           0
4           1        0           1
5           1        1           0
6           1        1           1
7           1        1           0
8           1        1           1

You could just test the interaction dataChip::conditionKd with edgeR or DESeq2?

Does this not correspond to measuring the fold change of Input vs WT ( FCiwt ) and the fold change of Input vs KD (FCikd) and then comparing whether there is a difference between FCiwt and FCikd?

ADD COMMENTlink written 21 months ago by k.vitting.seerup90

This should really belong as a separate question (which I will answer, of course).

ADD REPLYlink modified 21 months ago • written 21 months ago by Aaron Lun23k

I have now raised this as a separate question Modeling input data in ChIP-seq experiments with EdgeR, DESeq(2) or Limma.

ADD REPLYlink written 21 months ago by k.vitting.seerup90
Please log in to add an answer.

Content
Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 16.09
Traffic: 249 users visited in the last hour