Bias towards significantly negative logFCs in ATAC-Seq differential chromatin openness tests
1
0
Entering edit mode
pgugger ▴ 70
@pgugger-21084
Last seen 14 months ago
United States

Hi, I am analyzing ATAC-Seq data to test for differentially open chromatin among two genotypes of mice (n = 5 per group). Following Lun & Smyth 2014, I combined reads from all 10 samples into a single BAM, called peaks (Genrich), and counted reads per peak region for each sample (featureCounts). I then ran limma following common guidance (example code below), and the resulting MA plot shows many more significantly negative logFCs (~4300) than positive logFCs (~900), especially at high mean logCPM.

MA plot

To evaluate whether this pattern was a result of something specific in limma, I also tried edgeR and DESeq2 and get the same pattern (not shown). Is this something to be concerned about, and if so, do you have any suggestions to remedy it? Perhaps quantile normalization?

A related question is whether we can conclude that these patterns are due to inadequate data processing or if they reflect a true biological signal. We do not have any a priori reason to believe that openness would tend to decline in this comparison, nor do we have any reason to believe that there would be few or many significantly differentially open regions (we observe ~5k significant of 80k total peak regions in these tests).

cnt = DGEList(counts.all)
cnt$samples$genotype <- genotype
design <- model.matrix(~0+genotype)

keep.exprs <- filterByExpr(cnt, design)
cnt <- cnt[keep.exprs, , keep.lib.sizes=FALSE]

#Alternative filtering that also leads to same pattern ultimately
#keep <- aveLogCPM(cnt) > 1
#cnt <- cnt[keep, , keep.lib.sizes=FALSE]

cnt <- calcNormFactors(cnt, method = "TMM")

v <- voom(cnt, design, plot=TRUE)

contr.matrix <- makeContrasts( AvB = genotypeA-genotypeB, levels = colnames(design))

vfit = lmFit(v, v$design) vfit = contrasts.fit(vfit, contr.matrix) efit = eBayes(vfit, robust=TRUE) eres <- decideTests(efit, p.value = 0.05) plotMD(efit, column=1, status=abs(eres[,1]), main=colnames(efit)[1], ylab="log2(fold-change)", xlab="Mean log(CPM)", legend=FALSE, cex=0.5) abline(h=0, col="darkgrey")  limma edger atac atac-seq • 663 views ADD COMMENT 0 Entering edit mode If you're not going to show DESeq2 code, don't tag it with DESeq2. You can hardly expect the DESeq2 authors to provide any debugging advice on their package by looking at your limma code. ADD REPLY 0 Entering edit mode Ok, sorry, my misunderstanding of the posting guidelines. However, I believe my question is not so much about debugging and more about whether there is any reason to be concerned or to change my analysis based on the bias toward significant negative log fold changes at high logCPM in the MA plots - a pattern observed in all three programs when following common guidance. Any thoughts are greatly appreciated. ADD REPLY 0 Entering edit mode The pattern seems surprising. It would helpful to have the plot of the voom command to view the mean-variance trend . I think you should also look at the pair plots of each genotype (2 matrices of 5x5 plots), then pair plots of 3 samples of each genotype together (leading to a matrix of 6 x 6 plots). ADD REPLY 0 Entering edit mode Thanks for your response. Here are two voom plots: v <- voom(cnt, design, plot=TRUE)  voom plot, before plotSA(efit, main="Final model: Mean-variance trend")  voom plot, after And, here are pair plots, in which the first 5 are one genotype and the other five are another genotype: pairs(v$E)


pair plots

0
Entering edit mode

voom plots looks fine. The pair plot would lead me to remove the last sample (last column) as you probably guessed. Hope others will contribute here.

0
Entering edit mode

Thanks, I agree that the voom plots seem fine. I can also add that an MDS plot shows the samples clustering by genotype nicely. And, I tried removing the last sample as you suggested and the results are the same (MA plot bias, MDS plot, voom plots, etc.).

It seems to me limma is behaving as it should, and perhaps the pattern reflects a real biological response. However, I may be missing something more fundamental. For example, if I run voom with quantile normalization (which may or may not be appropriate), the bias disappears. I am curious to hear what others think.

# ...
cnt <- calcNormFactors(cnt, method = "none")
# ...
v <- voom(cnt, design, normalize.method = "quantile")
# ...


MA plot after quantile normalization

2
Entering edit mode
Aaron Lun ★ 27k
@alun
Last seen 19 hours ago
The city by the bay

Your treatment (or whatever the differences between the two conditions are) leads to decreased accessibility for a small proportion of regions, and no change everywhere else. Doesn't seem particularly controversial to me.

If you want some real action, try knocking down a histone mark writer and then measuring the enrichment for that same histone mark. That's usually much more exciting, to the point that you need to use some other normalization strategies (see the csaw user's guide) because the TMM assumptions do not apply for the peaks.

We do not have any a priori reason to believe that openness would tend to decline in this comparison, nor do we have any reason to believe that there would be few or many significantly differentially open regions (we observe ~5k significant of 80k total peak regions in these tests).

Yeah, well, sometimes we do experiments to find new things that we didn't know before.

Check the tracks to make sure that the replicates are behaving consistently. Check the nearby genes, make sure you don't have copy number changes or changes in repeat number/length across conditions. See if it corroborates with other pieces of information about your system. Typical scientific skepticism applies.

For example, if I run voom with quantile normalization (which may or may not be appropriate), the bias disappears. I am curious to hear what others think.

If the observed preference for negative log-fold changes is a genuine biological effect, then quantile normalization is not appropriate. In fact, not only will it get rid of many of the negative log-fold changes, it will introduce artificial positive log-fold changes to "balance things out". For example, see this toy example:

# Making up some log-expression values.
x <- y <- 1:100/10
y[50] <- 12 # some gene in the middle is upregulated in 'y'.

library(limma)
normed <- normalizeQuantiles(cbind(x, y))
plot(rowMeans(normed), normed[,2] - normed[,1]) # last gene is now downregulated in 'y'!


Of course, if you knew that the preference for negative log-fold changes was some technical artifact, then quantile normalization would be an effective solution. But if you knew that we wouldn't need to have this discussion.

0
Entering edit mode

Thank you, Aaron, for offering your expertise on this topic. Your response confirms my thinking on this and addresses any concerns I had. After further investigation and some discussion with colleagues, we are comfortable assuming that it is possible that more sites decrease in accessibility than increase, so we are sticking with the "standard" limma workflow with TMM normalization.