Question: Filtering for ATAC-seq
0
16 months ago by
shankasal0 wrote:

So I'm trying to determine a couple things when it comes to filtering called peaks that have low read counts relative to the total read counts in all peaks. I'm using the CQN R package to normalize my data set where I currently have 5 cell types with 2 replicates each and analyzing DE peaks with edgeR.

1) Everything I've read so far suggests filtering before normalization, but I'm not sure what effect this is going to have on the normalization performed by cqn().

2) In the edgeR vignette, all filtering is done with rowSums(cpm(y)>x)>=x2. Is there a good method, aside from trial and error, in which I can select the value for x such that I can remove low count peaks without increasing type I error?

I have tried to read up on some of this stuff as best I can, but I don't have a stats background, so I often end up getting lost in the jargon.

Thanks

edger cqn atac-seq • 1.5k views
modified 16 months ago by Aaron Lun24k • written 16 months ago by shankasal0
4
16 months ago by
Aaron Lun24k
Cambridge, United Kingdom
Aaron Lun24k wrote:

To detour a bit from your actual questions: you don't mention exactly what you've already done, but peak-calling and differential analysis is surprisingly difficult to get right. This is because peak calling is already a filtering step, and if it's not an independent filter, then you'll have lost type I error control before you even start using edgeR (or DESeq2, or other count-based modelling approaches). The statistical arguments are subtle, but perhaps a demonstration will suffice:

# Using a normal simulation for simplicity's sake.
# Just pretend this is some measure of log-binding intensity.
# Each row is a binding site, each column is a sample.
binding <- matrix(rnorm(80000), ncol=4)
group <- gl(2,2)

# Let's say our peak-calling method only detects peaks at
# a log-intensity above 1.
is.peak <- binding >= 1

# Let's say we only keep peaks that are detected in 2+ libraries.
keep <- rowSums(is.peak) >= 2

# Now let's do a t-test on the remaining peaks,
# comparing the first and last two libraries.
p <- apply(binding[keep,], 1, FUN=function(x) t.test(x[1:2], x[3:4])\$p.value)
hist(p)


Woah! That's not uniform, even though the null hypothesis is true for all binding sites. The reason is because you're implicitly selecting for sites that are either (i) differential between conditions, by enriching for peaks where high binding is present in two libraries of the same condition; or (ii) highly variable within conditions, by enriching for peaks with high binding in two libraries of different conditions.

Normally, the exact filter strategy does not matter much when you're using edgeR for RNA-seq, as the density of genes around the filter boundary is low, i.e., changing the filtering strategy will not have a major effect on the identity of the retained genes. However, this is not true for ChIP-seq data, where the frequency of binding sites decreases with increasing binding intensity. This means that the density of sites around the filter boundary is always relatively high compared to the total number of retained sites, such that the choice of filtering strategy will have a larger effect on the type I error rate amongst the retained peaks.

The solution is to do peak calling independently. This involves pooling all libraries together, with equal contributions from each library, and then calling peaks on the pooled library. This ensures that the only information available to the peak caller is the total read count, with no possibility of looking at differences between libraries. A related approach is described in the csaw user's guide, where we use aveLogCPM to perform various flavours of filtering; this provides an independent filter statistic (probably, it's hard to prove but it seems to work well) for negative binomial models. Note: there's no point doing this on your peaks if your peak calling strategy wasn't independent - the damage has already been done.

Anyway, I think I answered question 2 above. For question 1; the bigger question is whether you assume that most bound regions are non-DB. This is because most normalization methods (presumably CQN as well, based on my recollection of the method) assume that most of the input features are not differential across samples. If you can make that assumption, then yes, I would filter to get the bound regions and perform normalization on the filtered sites. However, some applications involve global changes in binding, e.g., drug treatments that suppress histone methylation, global downregulation of the targeted TF. In such cases, you cannot make the non-DB assumption, requiring other approaches. I'll guess that you don't have chromatin spike-ins, so check out the binned TMM method in the csaw user's guide.

Thanks for the detailed comment Aaron, I'll ruminate on what you've said.

To provide a little more detail on the method I used so far, I called peaks on individual samples and pooled replicates using MACS2 q value 0.01. From there I used IDR to select for and merge peaks between replicates using the pooled peak set as a the input reference. I then used HOMER mergePeaks to obtain a consensus peak set with which I generated the count matrix for the 10 samples using bedtools multicov.

Merging peaks is definitely not the way to go and will lead to bias. You should read Lun 2014 in which several strategies for summarizing peaks across samples are tested, including merging (i.e. taking the union) and intersecting, and the only strategy that correctly controls Type I error is the one Aaron has described in his answer.

As a note, merging peaks is not part of the IDR peak calling procedure. You use the IDR calculation between replicates to choose the appropriate number of peaks to call, and then you simply select that number of peaks from the set of peaks called on the reads that were pooled from all samples. There is no merging.

So I've tried to look at the Lun and Smyth 2014 Nuc. Acids. Res. paper, and I see from Table 2 that the whole group pooling strategy most closely follows the expected error rate. I don't really understand why though, if it can be explained in layman's terms. Won't a full pooling strategy reduce the ability to call peaks present in a pair of replicates compared to another even if they are biologically meaningful? Won't this effect become greater as more samples are added into the data set? How does this strategy ensure that I'm calling peaks that are reproducible between replicates? I'm planning on adding 8 more samples, 2 reps each, to this current grouping, but I also intend to compare an additional 16+ samples from a separate group. It's easy enough to pool the bam files to do the peak calling so I can try that and have a look at how that peak set compares to my current consensus peak set.

In the current case, my 10 samples are sequenced to a relatively similar depth (~55-60 million mapped reads each), but this is not always going to be the case, as when I want to add in more samples where I know the sequencing depth is different. Moreover, the samples I've collected are single-end sequenced, which is all I needed, whereas another dataset I'd like to eventually compare it to is paired-end, so I'm not sure how pooling with MACS2 would work in that case.

@Ryan, yes I didn't mean to say that IDR merges.

1

One question at a time.

Won't a full pooling strategy reduce the ability to call peaks present in a pair of replicates compared to another even if they are biologically meaningful?

I don't really understand what you mean here. I will assume you are referring to the act of calling peaks present in a pair of replicates in one condition compared to (i.e., absent in) another condition. This is precisely what we want to avoid - at the stage of peak calling, we do not want to give any preference to DB peaks compared to non-DB peaks. To do so would be to pre-empt the differential testing by enriching for low p-values, resulting in elevated type I error rates (as shown in the simulation above).

Perhaps a more extreme example would be instructive. Let's say that we use MACS2 in differential mode, where we call peaks in one condition A relative to B, and conversely B relative to A. We take the union of the peaks and perform a differential analysis with edgeR on the resulting set. To some people, this might seem like the obvious approach to enrich for peaks that are most likely to be biologically meaningful (in terms of differences between conditions). However, the differential analysis would be a disaster, as we have also enriched for peaks with spurious differences between conditions.

The problem fundamentally arises from the fact that the peaks are estimated from the same data used for the differential testing; this results in issues with multiple testing that are very difficult to resolve. This requires a great deal of care to avoid, hence the focus on independent filtering.

Won't this effect become greater as more samples are added into the data set?

Yes. But to see it as a "reduction in power" would be misleading. We can only sensibly talk about differences in power between methods when the false positives are the same; and clearly, using a non-independent filter does not guarantee control of the type I error rate or FDR.

How does this strategy ensure that I'm calling peaks that are reproducible between replicates?

That is not the peak caller's job. That is edgeR's job, and that of the dispersion estimation. In fact, enriching for peaks that are consistent between replicates will also violate the independence of the filter, resulting in the loss of type I error control when you perform the differential analysis.

I could go further and add that the statistical model of many peak callers do not even consider the concept of replication. For example, digging inside the old MACS code indicated that it assumed Poisson-distributed counts. Perhaps this was unavoidable at the time when sequencing was expensive and most people only had one replicate, but it is clearly inappropriate nowadays as there is demonstrably over-dispersion in biological replicates. As a result, the peak-calling p-values were systematically understated, requiring the statistical band-aid of the IDR to try to get rid of false positives. (And to consolidate peak sets across different labs and operators, a la ENCODE, but I won't go into that.)

These problems are avoided by using a negative binomial model that handles overdispersion. If you call a peak set independently, edgeR will handle variability between replicates for you. Just let it do its job.

Peak-calling is separating relevant regions of the genome from irrelevant regions, so it represents a filtering step. The critical point is that your peak calling procedure (like any other filtering procedure) must be blind to the experimental conditions. This ensures that the filter statistic is independent of the test statistic. This is the same principle behind independent filtering in RNA-seq data. If the filter statistic and test statistic are not independent, then you cannot control Type I error.

For example, the strategy of merging peaks called from individual samples enriches for peaks that have large differences between samples, since this increases the likelihood that at least one sample will pass your detection threshold. Hence, this filter preferentially selects peaks with high variance and/or high logFC. Both the variance and logFC obviously affect the test statistic, so this is not an independent filter. Even worse, since high variance and high logFC have opposite effects on significance, you can't even predict whether you will underestimate or overestimate the significance of your results. Either is possible, depending on the distribution of variances and logFC values in your data set.

I don't know your experimental design, but I can tell you how I implemented independent filtering in mine. In my data set, I have samples of cell cultures from 4 donors in each of 8 experimental conditions, for 32 samples total. In order to call peaks independently of the experimental condition using IDR, I pooled all the samples from each donor together, giving me 4 "all-donor" bam files. I called peaks on these 4 bam files and use the IDR procedure between the 4 peak sets to choose the appropriate number of peaks for my chosen IDR value. Then, I pooled all the reads into one giant bam file and called peaks from this file, and took the chosen number of top peaks. Thus, the peak calling procedure never had any information about the experimental conditions, and the called peaks are independent of any differences between conditions. Note that this procedure might not be appropriate if I cared about differences between donors, since donor ID was used in the peak-calling procedure. I would also note than I used a fairly relaxed IDR threshold, and then implemented a secondary count-based independent filter later in the edgeR analysis, since there's no point in calling a peak if that peak still doesn't have enough reads to analyze for differential abundance. You can see the code used for my filtering strategy here: https://github.com/DarwinAwardWinner/CD4-csaw/blob/master/scripts/chipseq-explore-H3K27me3.Rmd

Lastly, I'll address a few random points. You said that you filtered peaks using a q-value cut-off of 0.01 before running IDR. This is wrong, and will result in nonsensical IDR values. When running the IDR procedure, you need to give it the unfiltered peak list. The idea is that the tail of the list should represent non-significant peaks, from which IDR will estimate the "noise". In order to get a sufficiently large list of peaks, I used a p-value threshold of 0.5 with MACS2 (not a q-value threshold).

Finally, you asked about differing sequencing depths, and how this affects the pooling strategy. For this, I am not sure of the answer. The simplest strategy is to ignore the issue and give each read equal weight. This gives samples with more reads a greater weight during peak calling. An alternate strategy is to divide each read's contribution by the library size. This gives every sample an equal contribution, but now some reads count more than others. I'm not sure which of these strategies is more appropriate, or if there is a third strategy that is better than both. Aaron, do you have any input on this?

I'll assume that the peak caller is using a Poisson model. The Poisson mean is equivalent to the NB mean when the libraries are of the same size, and the NB mean is the independent filter statistic (probably) for a negative binomial GLM. So the solution would be to downsample each library to the same depth, and then pool the downsampled libraries. Clearly, this results in the loss of information, particularly if your libraries are of very different sizes. That's unfortunate but unavoidable, short of writing a peak caller that uses a NB model. I have thought about doing so but it wasn't worth it, because I don't use peaks in my DB analyses anyway.

Thank you both Aaron and Ryan for such detailed responses. I think I understand better now why my current peak calling is not independent of the sample conditions.

So if I understand this correctly now, the most appropriate strategy to use for my data set as is would be the following:

1) Pool bam files of all samples (10 in total, 2 each replicate) into a "master" bam file.

2) Call peaks on the master bam file with MACS2. (IDR does not seem appropriate in this case, so I can use a q value threshold).

3) Establish a count matrix for the samples with the new peaks.

4) Read the count matrix into edgeR and filter for low count peaks.

5) Normalize with CQN and then perform DE analysis.

That sounds right to me. As mentioned before, use aveLogCPM for the peak filtering within edgeR, otherwise you will undo your hard work if you use an "at least X in at least Y libraries" strategy.