Search
Question: Improve limma-voom trend fit to noisy data
0
7 months ago by
jma199130
jma199130 wrote:

I'm analysing low cell number ChIP-seq data (3 ChIP replicates / 3 Input replicates). The replicates are highly variable due to the low amount of starting material and the number of PCR cycles used for amplification. I am counting reads into windows along the genome and quantile normalising the counts to try and overcome some of this technical variation. I decided to use limma-voom to test for differential windows between the ChIP factor and the Input chromatin (mainly because it allows me to use quantile normalisation, unlike edgeR or DESeq2, please correct me If I'm wrong).

I have uploaded an image of the mean-variance plot produced by limma-voom ( https://ibb.co/eDfSxw ). In my opinion there seems to be three distinct components to the data, which I have circled in a copy of the image ( https://ibb.co/eEkGqG ). To me the windows in red are the low count - high variance windows (These generally correspond to windows which have been highly amplified randomly in one of the replicates). The windows in green are the increasing count - decreasing variance windows (These seem to be windows containing genuine binding). And the windows in orange are the low to medium count - constant variance windows (This seems to be a mix of the other two windows).

My problem is that the windows in the lower half of the orange circle which have a low variance are (I think) being squeezed to the trend line and therefore the actual variance of those windows is inflated. From prior knowledge we know that some of these windows contain genuine binding. The topTable reports a positive logFC, but the FDR is non-significant (I'm using FDR < 0.1 as a threshold).

Overall I think the trend isn't a very good fit to my data, and would like to know if there is anything else I can do to solve this? I should mention I am already using the robust = TRUE and trend = TRUE arguments to eBayes.

modified 7 months ago by Aaron Lun20k • written 7 months ago by jma199130
1
7 months ago by
Aaron Lun20k
Cambridge, United Kingdom
Aaron Lun20k wrote:

Firstly, I don't see any red/orange/green in your plot.

Secondly, quantile normalization is not appropriate in your situation. Quantile normalization forces all samples to have the same empirical distribution of log-CPMs, but this is not desirable when large-scale differences between samples are expected. Your experiment is one such case as you should only see protein binding in the groupo of ChIP samples, which should result in a long tail of large log-CPMs (assuming that your IP was successful). This tail will be lost when the ChIP log-CPM distribution is incorrectly coerced to be the same as that of the control samples, reducing your power to detect differential binding events. There is also a possibility that the distortion in the distribution will create spurious binding events in the control samples.

Finally, your plot shows clear evidence of discreteness near the left edge. This is consistent with the presence of low counts that are difficult to model with voom's continuous approximation. edgeR handles this type of data much better; if you haven't done so already, I would suggest reading the csaw user's guide for how it can be applied to ChIP-seq data. This may also help with the different components in your plot, if their presence is caused by having a lower-than-appropriate mean after the log-transformation in voom; in contrast, edgeR computes the mean from the raw counts, which gives greater weight to high-count binding events.

I've just double-checked on some colleagues computers, the second image should definitely have the highlighted areas ( https://ibb.co/eEkGqG )

I guess the discreteness issue could be improved by filtering based on X values being in N number of replicates (rather than average X count, which I'm using currently). I'll try edgeR and see if that improves the detection power (using prior knowledge from a high-cell number bulk ChIP-seq experiment).

1

The "at least N" filtering strategy is not independent of the test statistic (see this), and will distort the dispersion estimates and p-values. If you're filtering ChIP-seq windows by abundance prior to using edgeR, the average count is a more statistically rigorous approach. In any case, the filtering threshold is more important than the filtering strategy for mitigating discreteness, though as I said before, using edgeR will be much more robust to discreteness in the first place.

Whatever you do, don't use quantile normalization here. I suspect that this is the major cause for your stripes at the left edge, where zeroes get transformed to different quantiles. Check out some alternatives in the csaw user's guide.

Apologies, I meant X count in *any* N samples, so not filtering based on any group information. That should be independent of the test statistic. You're right about threshold > strategy. I'd like to automate the pre-filtering stage somehow (picking a decent threshold based on the data, HTSFilter was developed for this task but it filters too aggressively on our data), but I realise it might better to do independent filtering of the results (similar to DESeq2)

2

Note that "any N samples" is still not an independent filtering strategy for NB models. This is a common misconception, that being blind to the groups will guarantee independence of the filter to the test statistic. The p-value for each window is sensitive to things other than the log-fold changes between groups (e.g., dispersion, of itself and of other windows via empirical Bayes shrinkage), which are affected by the "any N samples" filter.

In any case, I wouldn't worry about the exact value of the filter threshold, just pick something reasonable and go with it. There's a number of semi-automatic strategies for defining "reasonable" in the csaw user's guide. Unfortunately, full automation requires some strong assumptions about how strong you expect the binding to be.