Question

How to use RUVseq with voom-limma for differential expression analysis?

0

Entering edit mode

brainfood • 0

@brainfood-11349

Last seen 7.7 years ago

Hi,

I have an RNA-seq test dataset where I know which genes are differentially expressed. I used RUVseq to get the factors of unwanted variation and ran these data through voom and limma using the following code:

require(RUVSeq)
require(EDASeq)
require(edgeR)
require(limma)
require(dplyr)

#Input data matrix where rownames are gene IDs and colnames are sample IDs.
#Columns are ordered so "Group1" samples are listed before "Group2" samples.
data.matrix

#Create RUV design matrix
Grp1vGrp2 = factor(c(rep("Group1", times = 8),
rep("Group2", times = 8)))
RUV.design_matrix = makeGroups(Grp1vGrp2)

#Normalize input data with EDASeq's UQ normalization.
RUV_input_data = betweenLaneNormalization(data.matrix, which="upper")

#Run RUVs
RUV_results = RUVs(RUV_input_data, cIdx=rownames(RUV_input_data), k=1,
scIdx=RUV.design_matrix)

#Create design matrix integrating factors of unwanted variation
limma.design_matrix = model.matrix(~Grp1vGrp2 + RUV_results$W)

#Run data through voom (using edgeR to perform the UQ normalization)
voom_results =
    voom(calcNormFactors(DGEList(data.matrix, group=Grp1vGrp2),
                         method = "upperquartile"),
         design=limma.design_matrix, plot = FALSE, save.plot = FALSE)

#Use limma to perform DE analysis for each normalized dataset
limma_results = eBayes(lmFit(voom_results, design=limma.design_matrix))

#Export the results
topTable(limma_results, n=Inf, adjust.method = "BH")

Looking at the results, it appears the adjusted p-values returned by RUVseq + voom-limma are way smaller than they should be. In fact, voom-limma without RUVseq seems to perform better. Also, while I used RUVs in the example above with k=1 and using all genes as negative controls, I have performed parameter sweeps using various k-values (1-3), RUV modes (RUVs, RUVr, RUVg), and negative control gene lists (all, empirically-determined), and the results are the same across all configurations. It's possible this result is correct, or this could be an artifact of the dataset I'm using, but before I delved further into either of these options, I just wanted to make sure I was integrating RUVseq's results correctly into the voom-limma model.

My dataset is too large to include here, so I'm really just seeking advice from the experts about best practices for using RUVseq with voom-limma. Thanks for your help.

ruvseq limma voom • 3.1k views

ADD COMMENT • link updated 7.7 years ago by davide risso ▴ 950 • written 7.7 years ago by brainfood • 0

score 0 · Answer 1 · 2016-08-27

0

Entering edit mode

Aaron Lun ★ 28k

@alun

Last seen 4 hours ago

The city by the bay

For starters, it's not clear what you mean by the p-values being "smaller than they should be". If you knew what the p-values should be, you wouldn't have to run analyses on real data. Besides, if there truly was some hidden factor of variation, then it would inflate the sample variances in a direct voom-limma analysis. Discovering it with RUVs and blocking on it should reduce the variance estimates and decrease the p-values, so your observation would be consistent with expected behaviour. As for why naive voom-limma "seems to perform better"; that's too vague a statement for us to help you with, so you'll have to be more specific.

Now, onto your code. The first issue is that you run RUVs on the output of betweenLaneNormalization, but you then run the rest of the analysis with a different normalization scheme. If you truly wanted to be consistent, you'd use the same normalization scheme throughout the analysis. For example, you could run RUVs with isLog=TRUE on the E field of the voom output. The second issue is that your topTable call will automatically drop all non-intercept coefficients for testing, including the factor of unwanted variation (see the documentation for more details). This is probably not what you want, so set coef appropriately - probably to 2, judging by the design matrix.

Finally, if you want responses to your post, tag with limma and voom separately, otherwise the relevant people won't be notified.

ADD COMMENT • link 7.7 years ago Aaron Lun ★ 28k

0

Entering edit mode

Hi Aaron,

Thanks for your response. Sadly I was lured off course by the Siren's song of the limmavoom tag. I have updated to the appropriate tags.

Apologies for the lack of detail in my question. I'm working with a dataset where I know which genes that are truly differential. The way I was running RUV + limma/voom was resulting in a lot of false positives, because the p-values (and as a result the q-values) were extremely low (< 1e-22) across the board. Like I said, this could be a weird artifact of the data I'm using, so I wanted to make sure I wasn't making any egregious errors in my code before I really started to run this down.

As for your comments on the code, I was mirroring the section in the RUVseq vignette describing how to use RUV with edgeR. In that case, the authors used EDASeq's betweenLaneNormalization to perform an upper-quartile for the initial RUV run, and then used edgeR's implementation of upper-quartile normalization for the final DEG analysis. I was trying to keep the limma/voom analysis as close as possible to the edgeR example, which may not be the best choice in this case. If I understand your suggestion correctly, are you saying I should perform UQ and voom normalization, then use the results of this normalization as input for RUV? It looks like the E field stores values in log2. Do you happen to know the base of the log RUVs is expecting when isLog=TRUE? I suspect it might be natural log, but I checked the documentation and couldn't find anything explicit. I will also follow your suggestion about the coefficient in topTable call and see how that affects my results.

Thanks for your help.

ADD REPLY • link 7.7 years ago brainfood • 0

0

Entering edit mode

Well, in hindsight, I don't think the choice of normalization strategy will make a major difference, provided betweenLaneNormalization isn't doing anything crazy (and it probably isn't, if the RUVseq authors are using it). But, that being said, yes - I was suggesting the log-transformed values from voom as input into RUVs, just to make sure that you're identifying unwanted variation from the same values that you'll ultimately do the DE analysis on. Also, the base of the log-transform shouldn't matter, because that'll just affect the scaling of the covariates returned by RUVs (which won't affect the final fit). The topTable thing is probably the real problem - make sure you're specifying the right contrast.

ADD REPLY • link 7.7 years ago Aaron Lun ★ 28k

0

Entering edit mode

I agree with Aaron, normalization and the base of the log function will not affect too much your results and that the choice of the coefficient in topTable is the real issue here. You're effectively testing for genes that are associated with either the factor of interest or the factors of unwanted variation, so it's not surprising that you get many false positives.

ADD REPLY • link 7.7 years ago davide risso ▴ 950

0

Entering edit mode

I just wanted to circle back and confirm that you both were correct and the missing coefficient was the major source of my problems. I set coef=2 and the results are much more in line with what I'd expect.

I haven't yet tested using the voom normalized values as input for RUVseq, or tried the original RUV as Davide suggested below. I'll update this post later with those results just for everyone's reference.

Thank you both for helping me figure this out.

ADD REPLY • link 7.7 years ago brainfood • 0

score 0 · Answer 2 · 2016-08-31

Just to integrate Aaron's answer, note that RUVSeq was designed specifically with a negative binomial model in mind. Hence, it is appropriate to use with edgeR and DESeq / DESeq2, but its behavior with limma/voom was never tested.

However, the RUV methodology was first developed in a linear model setting, implemented in the CRAN package "ruv": https://cran.r-project.org/web/packages/ruv/index.html

So, a possibly better approach to your problem would be to use voom in conjunction with the linear model version of RUV. (You may want to look into the RUV2 and RUV4 algorithms implemented in the CRAN package). Note that this was never the intended use of either of the RUV packages, so don't expect things to "just work", a fair amount of "hacking" could be required.