Hello,
I am analysing RNA-seq data to find differentially expressed genes between a wild-type (Ctl) and mutant (Mut). The mutant has a large DNA duplication encompassing several hundred genes, so they always come up as ~2x upregulated but this is not biologically interesting. Hence the H0 (null) for these particular genes is that Mut = 2 x Ctl, while H0 for the rest of the genes is Mut = Ctl.
My workaround for this was to do two runs of contrast testing on the limma.voom output, one run where H0 is Mut = Ctl, the second where H0 is Mut = 2 x Ctl. Then I would get 2 topTable outputs, one from which I'd select rows for the non-duplicated genes and the other from which I'd select the duplicated genes. However, I am having difficulty specifying (or perhaps even understanding) the contrasts for these comparisons and the results are not making sense to me. In particular, I would have expected the log-FC values from both comparisons to be well-correlated, but they are actually very different, especially for genes with high expression. So is this even a valid approach?
Here is an example using the zebrafish dataset:
library(zebrafishRNASeq) library(limma) library(edgeR) library(ggplot2) data(zfGenes) filter <- apply(zfGenes, 1, function(x) length(x[x>10]) > 1) cmat <- zfGenes[filter,] design <- matrix(c(rep(1,3), rep(0,6), rep(1,3)), ncol=2, dimnames = list(c(),c('ctl','mut'))) nf <- calcNormFactors(cmat) y <- voom(cmat, design, lib.size=colSums(cmat)*nf) fit <- lmFit(y,design) ## make a function to run eBayes etc. with different null hypotheses ## run_cf_eb <- function(H0 = 1, fit){ contrasts = paste0('mut-', H0, '*ctl') cont.matrix <- makeContrasts(contrasts = contrasts, levels=c('ctl','mut')) print(cont.matrix) fit_b <- contrasts.fit(fit, cont.matrix) fit_b <- eBayes(fit_b) return( topTable(fit_b, n=Inf, sort.by='none') ) } tt_1 <- run_cf_eb(H0 = 1, fit) ## H0 is 'ctl = 1 x mut' ## tt_2 <- run_cf_eb(H0 = 2, fit) ## H0 is 'ctl = 2 x mut' ## tt_all <- merge(tt_1, tt_2, by = 'row.names', suffixes = c('.H0_1','.H0_2')) ## Plot logFC from H0('ctl=mut') vs logFC from H0('ctl=2mut'): ## ggplot(tt_all) + geom_point(aes(x=logFC.H0_1, y=logFC.H0_2, colour = AveExpr.H0_1))
Output:
Contrasts
Levels mut-1*ctl
ctl -1
mut 1
Contrasts
Levels mut-2*ctl
ctl -2
mut 1
I realize that I will probably have to tweak the calcNormFactors bit to account for the duplication and also re-run the p-value adjustment after the analysis, however my question right now is just about how to approach the contrasts.
Thanks,
Stuart
Thank you very much, this was really helpful.
The duplicated genes show an average ~ +0.8 logFC. This is less than +1.0 probably due to homeostatic mechanisms at play, which indicates that I should treat duplicated genes that are downregulated from the logFC +1.0 baseline with caution. Nonetheless it is pretty clear that the duplicated genes are generally overexpressed.
The following slightly modified code worked on limma_3.22.7 in R3.1.3:
Thanks,
Stuart
edit: changed degrees of freedom from '4' to 'fit$df.total' according to comments below
For future reference, it's typically a good idea to be using the latest version of all packages. Currently, limma is at version 3.26.5, so you're two Bioconductor releases behind (for reference, we're currently at BioC version 3.2 on R 3.2.2+). There's no real reason to use an older version unless you're trying to reproduce old results.
Yes, well spotted, there were two typos in my code for the t.stat and p.value. Note that the df for pt() should be the total empirical Bayes df, not just the ordinary residual df.