Question

smallRNA-seq analysis batch correction in limma

0

Entering edit mode

Eleonora • 0

@53e98983

Last seen 19 months ago

Denmark

Good morning,

I am currently trying to analyse a small-RNA sequencing dataset using limma package and the voomLmFit function. However, I am experiencing some issues with the p-value distribution, with some of the comparisons I am testing showing odd p-value distribution. This probably indicates that there is some kind of batch effect that I am not correcting for, even though there is no separation of the samples in the PCA plot based on any of the potential confounding variables.

I have tried to include different variables in the model to see if the p-value distribution improves, but it does not. However, including some variables (for examples RNA_extraction_batch) improves the Students t Q-Q plot (that I read it is good to check?). Do you have any suggestion on why this is the case and if I should use a model that improves the Q-Q plot even though it does not really improve the p-value distribution? I have included the code and two examples of qq plots of the fit2$t with or without including the variables in the model.

Thank you for the help!

y <- DGEList(counts = counts[, -c(1:6)], samples = metadata, genes = counts[, 1:6])
design <- model.matrix(~0 + group +V1 +V2, data = y$samples) #V1 and V2 are optional

keep <- filterByExpr(y,  min.count = 10, design = design)

y <- y[keep,,keep.lib.sizes = FALSE]

y <- calcNormFactors(y)

v <- voomLmFit(counts = y, design = design, sample.weights = TRUE, plot= TRUE)

fit <- eBayes(v)

contrasts <- makeContrasts(....) #all the different contrasts that I want to test

fit2 <- contrasts.fit(fit, contrasts)

fit2 <- eBayes(fit2)

hist(topTable(fit2, coef = x, number = Inf)$P.Value, breaks = 20) #x = specific contrast

#qqplot 

qqt(fit2$t, df=fit2$df.prior+fit2$df.residual,pch=16,cex=0.2)

abline(0,1, col="red", lwd=2)

qq plot with no extra variables in the design qq plot when adding addional variable (v1, v2) in design

BatchEffect limma • 970 views

ADD COMMENT • link 20 months ago • updated 19 months ago Eleonora • 0

score 0 · Answer 1 · 2023-07-19

The Q-Q plot is mainly for theoretical interest and I do not recommend it as a way to judge whether batch factors should be included in the model.

Same remarks for the p-value histogram. I have never claimed that limma will return a uniform p-value distribution for count data and indeed most statistical tests will not do that. The important thing is overall FDR control rather than uniformity of the p-value histogram.

In general, neither Q-Q plots or p-value histograms are primary tools for data exploration or trouble-shooting. The MDS plots, mean-variance plots and MD plots are more important. And of course you can simply try including the putative batch effects in the linear model and see whether or not they account for DE genes.