Hello Everyone,
Let me start by saying I've read all posts that are similar to my question and the insights are tremendous! However, as I've been doing bioinformatics for a little over a year now, I'm still a newbie in all possible respects, and I just want a clarification.
So here's the gist. I have replicates of three conditions (each with its own Input). What I first did, last year (when I was much more primitive in bioinformatics than I am now), I called the peaks then used a IDR to determine consensus peaks between the replicates and then I used a combination of bedtools intersect -v and -wa to determine what peaks are found in one condition vs another. The problem with this is it doesn't address intensity of the binding peaks- it could be that the peaks that are found in condition A and not in b are very low intensity binding and I may be getting rid of regions that even though all conditions have, may have strong binding in one condition and weaker binding in the other condition.
So I saw that Tommy Tang had a pipeline, so I followed his advice on making a Count Table of all the conditions and then run a DESeq2 a la RNA seq. Based on the discussion from his question on biocoductor (DESeq2 for ChIP-seq differential peaks) I am now planning to use GreyListChIP to remove the cell type/condition specific peaks from the Inputs and then running the DESeq2.
I know that diffbind has been designed for this purpose, but I can't place my bam files on my computer which is required (probably should ask a separate question if it is possible to access a bam site from a remote server to do diffbind).
I plan to do DESeq2 after I made a count table adjusting for the " Grey List" I'm wondering if the parameters for the DESeq2 should be the same as RNAseq or should I run different parameters for this ChIPseq. When I tried to run a typical dds -
dds <- DESeq(dds) and I got the following warning:
estimating size factors estimating dispersions gene-wise dispersion estimates mean-dispersion relationship -- note: fitType='parametric', but the dispersion trend was not well captured by the function: y = a/x + b, and a local regression fit was automatically substituted. specify fitType='local' or 'mean' to avoid this message next time. final dispersion estimates fitting model and testing -- replacing outliers and refitting for 2 genes -- DESeq argument 'minReplicatesForReplace' = 7 -- original counts are preserved in counts(dds) estimating dispersions fitting model and testing
Again just want to make sure I'm on the right track.
Thank you!
Yonatan
Thank you, Mike!
I saw in the paper you referenced by Aaron Lun they discuss pooling ChIP-seq libraries and calling peaks and using these "consensus peaks" as the regions by which we produce the count table, but I'm not so clear as to how to do this. By meaning do I take all conditions (ie, condition a, b, and c) with their replicates, merge the bam files, then call peaks?
If so, do I use a control when calling peaks (I use MACS2)? Should I merge the Inputs too? What is the threshold should I call for the peaks (q = 0.05) like I would normally do, or do I have to make it less stringent because we're simply looking for all possible peaks for the count table.
I know you didn't write the paper, but I wanted your opinion on the matter, as well as Aaron Lun's (if you're out there reading :) ).
Yes, merging the BAM files prior to peak calling is what we had in mind. Basically, you cannot allow any information about the experimental design to contaminate the peak calls - the best way to do that is the pool the reads together before running MACS or your peak caller of choice. (To be even more sophisticated, you would downsample each library to the same sequencing coverage and then pooling the reads together.)
Calling peaks in each library separately sounds straightforward, and it is - up to the part where you have to try to consolidate all the peaks into a single consensus set. This is where the statistical nightmare begins; the merge must be done in a manner that preserves the uniformity of differential p-values under the null hypothesis, which severely restricts the type of merges that are appropriate. If you don't do this right, you get a pot luck of behaviours ranging from conservativeness (as shown in the NAR paper) to anticonservativeness (depending on the vagaries of empirical Bayes shrinkage for the dispersion estimates). There is also a general problem of standardizing coordinates from the "same" peak across different samples, which is not always straightforward, e.g., clusters of peaks in one sample nested within a larger peak in another sample.
As for the merged peak calling, yes, you can call the merged ChIP samples against the merged input control. Don't worry about the peak calling threshold, this is less useful than it seems when you're using the peak calls in a differential binding analysis. The chosen FDR across peak calls mainly affects the DB result in terms of power, and there is no reason to think that a 5% threshold for the former is appropriate for the latter. Who knows, maybe 1% or 20% is better.
This is incredibly helpful, thank you so much!
So I'll merge the bam files and call the peaks against the merged inputs (I have an input for each condition), then I take that list produce a SAF file and run featureCounts to get a count table which I will then use in either DESeq2 or edgeR.
Am I missing something?
I appreciate all of your insights! (I could've really done this incorrectly).
PS - your suggestion to "downsample each library to the same sequencing coverage" is perfectly understandable and makes a lot of sense. How can I do that?
Yonatan
That's right. If you want to downsample, there should be something in ShortRead to allow you to do that.
Thank you @aaronlun for all your help, I successfully completed my first round of DB analysis!
I am in the process of doing a different DB analysis and I was going to run it the same way, but I wanted to verify it with you first. In this case I'm comparing 4 different histone marks in three different conditions with two replicates per condition. I was originally going to merge the bam files and (because they're all broad peaks) call peaks via macs2 --broad to get my consensus list. The only problem is, unlike the last experiment, which was the same TF but different conditions, I have different histone marks and different conditions. Would I still merge it the same way (ie all the bam files) or just merge the bam files from the same histone mark separately and call peaks on those and then merge together the resultant peak lists that I get from each histone mark to produce the count table?
Yonatan
I would never analyze different histone marks (or generally different protein targets) in the same linear model. What does it even mean to compare, say, H3K4me3 enrichment to H3K27me3? This is uninterpretable unless you can say that the IP efficiency is the same for both antibodies, and I very much doubt that is the case.
More generally, the shapes of the binding profiles will be different across marks; the required normalization and filtering will probably be different across marks; and the dispersion will vary depending on each antibody's reliability. So, all in all, best to keep marks separate, and then do a high-level meta-analysis at the end, e.g., to identify regions with DB in multiple marks. Once you have p-values, this is relatively easy.