Hi,
I have an RNA-seq test dataset where I know which genes are differentially expressed. I used RUVseq to get the factors of unwanted variation and ran these data through voom and limma using the following code:
require(RUVSeq)
require(EDASeq)
require(edgeR)
require(limma)
require(dplyr)
#Input data matrix where rownames are gene IDs and colnames are sample IDs.
#Columns are ordered so "Group1" samples are listed before "Group2" samples.
data.matrix
#Create RUV design matrix
Grp1vGrp2 = factor(c(rep("Group1", times = 8),
rep("Group2", times = 8)))
RUV.design_matrix = makeGroups(Grp1vGrp2)
#Normalize input data with EDASeq's UQ normalization.
RUV_input_data = betweenLaneNormalization(data.matrix, which="upper")
#Run RUVs
RUV_results = RUVs(RUV_input_data, cIdx=rownames(RUV_input_data), k=1,
scIdx=RUV.design_matrix)
#Create design matrix integrating factors of unwanted variation
limma.design_matrix = model.matrix(~Grp1vGrp2 + RUV_results$W)
#Run data through voom (using edgeR to perform the UQ normalization)
voom_results =
voom(calcNormFactors(DGEList(data.matrix, group=Grp1vGrp2),
method = "upperquartile"),
design=limma.design_matrix, plot = FALSE, save.plot = FALSE)
#Use limma to perform DE analysis for each normalized dataset
limma_results = eBayes(lmFit(voom_results, design=limma.design_matrix))
#Export the results
topTable(limma_results, n=Inf, adjust.method = "BH")
Looking at the results, it appears the adjusted p-values returned by RUVseq + voom-limma are way smaller than they should be. In fact, voom-limma without RUVseq seems to perform better. Also, while I used RUVs in the example above with k=1 and using all genes as negative controls, I have performed parameter sweeps using various k-values (1-3), RUV modes (RUVs, RUVr, RUVg), and negative control gene lists (all, empirically-determined), and the results are the same across all configurations. It's possible this result is correct, or this could be an artifact of the dataset I'm using, but before I delved further into either of these options, I just wanted to make sure I was integrating RUVseq's results correctly into the voom-limma model.
My dataset is too large to include here, so I'm really just seeking advice from the experts about best practices for using RUVseq with voom-limma. Thanks for your help.
Hi Aaron,
Thanks for your response. Sadly I was lured off course by the Siren's song of the limmavoom tag. I have updated to the appropriate tags.
Apologies for the lack of detail in my question. I'm working with a dataset where I know which genes that are truly differential. The way I was running RUV + limma/voom was resulting in a lot of false positives, because the p-values (and as a result the q-values) were extremely low (< 1e-22) across the board. Like I said, this could be a weird artifact of the data I'm using, so I wanted to make sure I wasn't making any egregious errors in my code before I really started to run this down.
As for your comments on the code, I was mirroring the section in the RUVseq vignette describing how to use RUV with edgeR. In that case, the authors used EDASeq's
betweenLaneNormalization
to perform an upper-quartile for the initial RUV run, and then used edgeR's implementation of upper-quartile normalization for the final DEG analysis. I was trying to keep the limma/voom analysis as close as possible to the edgeR example, which may not be the best choice in this case. If I understand your suggestion correctly, are you saying I should perform UQ andvoom
normalization, then use the results of this normalization as input for RUV? It looks like theE
field stores values in log2. Do you happen to know the base of the log RUVs is expecting whenisLog=TRUE
? I suspect it might be natural log, but I checked the documentation and couldn't find anything explicit. I will also follow your suggestion about the coefficient intopTable
call and see how that affects my results.Thanks for your help.
Well, in hindsight, I don't think the choice of normalization strategy will make a major difference, provided
betweenLaneNormalization
isn't doing anything crazy (and it probably isn't, if the RUVseq authors are using it). But, that being said, yes - I was suggesting the log-transformed values from voom as input intoRUVs
, just to make sure that you're identifying unwanted variation from the same values that you'll ultimately do the DE analysis on. Also, the base of the log-transform shouldn't matter, because that'll just affect the scaling of the covariates returned byRUVs
(which won't affect the final fit). ThetopTable
thing is probably the real problem - make sure you're specifying the right contrast.I agree with Aaron, normalization and the base of the log function will not affect too much your results and that the choice of the coefficient in topTable is the real issue here. You're effectively testing for genes that are associated with either the factor of interest or the factors of unwanted variation, so it's not surprising that you get many false positives.
I just wanted to circle back and confirm that you both were correct and the missing coefficient was the major source of my problems. I set coef=2 and the results are much more in line with what I'd expect.
I haven't yet tested using the
voom
normalized values as input for RUVseq, or tried the original RUV as Davide suggested below. I'll update this post later with those results just for everyone's reference.Thank you both for helping me figure this out.