edgeR: effect of adding background reads
Entering edit mode
Vivek.b ▴ 100
Last seen 2.2 years ago

Hello everyone

I am using edgeR for the analysis of allele-specific expression events with mouse RNA-Seq dataset. I use featurecount to count the reads mapped to maternal/paternal allele and also those which map equally to both (i.e. Reads that don't overlap a SNP). Then I use edgeR to analyse the differential expression (maternal over paternal allele) using these counts. However I found that mostly these counts are low (since I am counting only allele-specific reads and discard reads with no SNP information).

To work around this problem, I was suggested by someone to add a proportion of "background reads" (i.e. reads with no allelic information) to the allele-specific read counts on both sides. This improved the number of differentially genes detected. In fact, addition of 50% background reads also makes the expression status of my "control genes" (mouse imprinted genes), comparable to a previously published dataset in the same cell line (where they indeed sequenced with twice the depth as ours).

However, I am unsure if my strategy is correct. How does the testing in edgeR affected if you are comparing, for example, 14 vs 12 reads, in place of 4 vs 2 reads? What's the best strategy to compute differential expression in this situation?

edger • 1.2k views
Entering edit mode
Aaron Lun ★ 27k
Last seen 5 hours ago
The city by the bay

The size of your counts is an inherent feature of your data. No amount of manipulation will avoid the fact that your counts are low, and that there is limited information available for that gene. In particular, adding a prior count (i.e., what you call "background reads") will shrink the log-fold changes towards zero and reduce detection power for that gene, e.g., the fold change of 14 against 12 will be smaller than the fold change of 4 against 2.

I suspect that any "improvement" that you observe is due to the fact that the addition of the prior count reduces the apparent variability of the counts at low abundances. This results in lower dispersion estimates for those genes and, because information is shared across the data set via empirical Bayes shrinkage, lower estimates for all genes. This is particularly true if the low-abundance genes dominate the data set.

If this is true, the better approach would be to filter away the low abundance genes, as described in the edgeR user's guide. You're not going to get any information out of them anyway, as the counts are too low to be useful. Also, by filtering, you can get the benefits of lower dispersions for the higher-abundance genes without having to artificially change the counts.

Entering edit mode
Last seen 1 hour ago
WEHI, Melbourne, Australia

To paraphrase Aaron's comments but in blunter terms: No this strategy is not correct. You must not misrepresent to edgeR the true nature of your counts.

Your anonymous advisor may be mislead by the fact that edgeR adds prior counts when computing predictive (shrunk) log-fold changes, but adding imaginary counts must not be done as part of the hypothesis testing.

Entering edit mode

Thanks Aaron and Dr. Smith for your answers. Indeed most of the genes have low counts when I count only allele-specific reads and I also expect the reduction in variability by adding background reads to be the reason of improved differential expression. If this is the case, then not adding these counts and filtering low count genes (as Aaron suggested), should improve the differential expression results. I can still hope to see the high fold-change genes to be on top of the list in both cases.

Can you suggest how should I decide the count cut-off to filter these genes?


Login before adding your answer.

Traffic: 405 users visited in the last hour
Help About
Access RSS

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6