Question

limma-voom when testing differential expression of a subset of genes.

2

Entering edit mode

ankur.chakravarthy.10 ▴ 40

@ankurchakravarthy10-8368

Last seen 6.8 years ago

United Kingdom

Dear list users,

I am trying to test if a given, predefined set of genes is differentially expressed between two groups in an RNA-seq experiment (count data, using voom) . I am therefore interested in only evaluating the hypothesis of differential expression within my set, so as to limit the burden of multiple-testing, for instance.

In light of this, which of the following is likely to be the most appropriate approach.

1) reduce the counts matrix to my gene set first, estimate mean-variance weights using voom, go on to use lmFit, eBayes and topTable to identify DEGs

or

2) model the mean-variance relationship across the whole matrix of counts with voom, then subset the voom object to only include my genes of interest, and then use lmFit, eBayes and topTable.

or

3) Carry out mean-variance modelling and shrinkage on the whole dataset, then subset the topTable to my gene set and re-estimate false discovery rates from the raw P values, since the only hypotheses I intended to test were those related to my subset of genes.

In terms of numbers, I am looking at a subset of 153 genes out of a total of 20,000.

Best wishes,

Ankur Chakravarthy

limma voom differential gene expression • 2.5k views

ADD COMMENT • link updated 8.8 years ago by Gordon Smyth 50k • written 8.8 years ago by ankur.chakravarthy.10 ▴ 40

3

Entering edit mode

Gordon Smyth 50k

@gordon-smyth

Last seen 7 hours ago

WEHI, Melbourne, Australia

Just to add to Jim's answer, you should filter out unexpressed or very low expressed genes before running voom. So #3 is correct provided that the "whole dataset" is interpreted to mean all expressed genes.

ADD COMMENT • link 8.8 years ago Gordon Smyth 50k

score 3 · Accepted Answer · 2015-07-09

You should use #3, as both voom() and eBayes() are intended to use all available data to make their estimates. And do note that MArrayLM objects can be subsetted as if they were data.frames or matrices. So if you do

v <- voom(dglst.object)
fit <- lmFit(v, design)
fit2 <- contrasts.fit(fit, contrast) ## if necessary
fit2 <- eBayes(fit2)
fit2 <- fit2[<indicator variable to subset>,]
topTable(fit2, 1)

then the BH adjustment for the genes in your topTable will be based on just the subset of genes you care about.