Trying to account for gene length in topGO analysis
Entering edit mode
Last seen 6.2 years ago
Michigan State University

Hello all,

I've attempted to use GOseq, and while I know that, due to the gene length correction, it is more appropriate for RNAseq data than topGO, I like that topGO takes the topology of the GO graph in to account. I'm trying to find a middle ground.

In his great great tutorial (, Bernd Klaus, uses the genefinder function to match 10 genes of similar expression levels (mean normalized count) to his DE genes of interest. He then uses this set as a background set to proceed with enrichment using topGO. I liked this approach, though I have an RNAseq data set with 4 different treatments with varied gene expression among the four and so I thought that matching based on average gene count might yield some inappropriate matches. I then thought couldn't I just use this matching method to pick genes with similar length I by that control for that bias similarly to GOseq. 

Does this make sense? 

Another question is, in the tutorial we search for 10 gene matches. Is this enough? I have a specific set of 70+ genes I'm interested in GO terms for and figured the background matching would only be 700 genes and that wouldn't be so representative of the whole set. Thoughts?

Any comments or suggestions are greatly appreciated.


My code based on the tutorial:


backG <- genefinder(geneLengthMatrix, rownames(DEgeneSet),numResults = 100, method = "manhattan")

backG <- rownames(geneLengthMatrix)[as.vector(sapply(backG, function(x)x$indices))]

backG <- setdiff(backG, rownames(DEgeneSet))


multidensity( list( 

      all= log2(geneLengthMatrix[,1]),

      foreground =log2(geneLengthMatrix[rownames(DEgeneSet),]), 

      background = log2(geneLengthMatrix[backG,])), 

      xlab="log2 gene length", main = "Matching for enrichment analysis")

geneIDs = rownames(dds)

inUniverse = geneIDs %in% c(rownames(DEgeneSet),  backG) 

inSelection =  geneIDs %in% rownames(DEgeneSet) 

alg <- factor( as.integer( inSelection[inUniverse] ) )

names(alg) <- geneIDs[inUniverse]

tgd<-new("topGOdata", ontology="BP", allGenes=alg, nodeSize=ns, annot= annFUN.gene2GO, gene2GO= CukeGO)
topgo goseq deseq2 • 1.2k views
Entering edit mode
Bernd Klaus ▴ 600
Last seen 3.9 years ago

Hi Ben,

thank you for your positive feedback, great that you like my tutorial!

1.) Matching by actual expresssion:

I simply used the basemeans for matching and 10 matches to speed up the whole process and since it gave reasonable results. This workflow was intended for teaching and thus needed to run relatively quickly. 

For the Fisher-test bases GO analysis that topGO performs, it is best to have as many genes in the background as possible in order to have maximal power, having just 700 in the background is probably a bit too few genes.

Also, you can provide an actual expression matrix of your genes to genefinder function, in order to match the genes by their expression pattern across samples, rather than just by base mean.

So you could try to run something like this:

# do rlog to have homoscedastic data

rld_mat <- assay(rlog(dds, blind = FALSE))

de_idx <- which(rownames(rld_mat) %in% de_genes)

genefinder(rld_mat, de_idx, 100, method = "manhattan")


2.) Matching by gene length

if you rather would want to match by gene length, you can adapt the following code to find the background genes

using the MatchIt package, just replace the gene mean count by the gene length.

background <- c()

## prepare data frame for matching, sign indicates wheather
## the gene is differentially expressed or not
df <- data.frame(
  sign=as.numeric( rownames(DESeq_results)  %in% as.character(de_genes)),

df$baseMean <- round( df$baseMean, 0)

## repeat matching multiple times since
## each differentially expressed gene is
## matched by exactly one non-expressed gene
## you can run this as often as you want ...

for( i in 1:3 ){
  match_res <- matchit( sign ~ baseMean, df,
                 method="nearest", distance="mahalanobis")
  background <- c(background, match_res$match.matrix[,1])
  df <- df[which(!rownames(df) %in% background),]

background <- unique( na.exclude(background) )
## total number of matched de_genes
length(background )

## no DE genes in Background:
intersect(background, de_genes)

Hope that helps, feel free to ask for clarifications if needed,


Entering edit mode

Bernd thanks for your great detailed answer.

I had looked into using the variance stabilized count matrix or even the counts(dds, normalized =T) matrix to use in genefinder and this is where I started to think that I would be getting inappropriate matches for the background. For example, I am interested in genes DE with age and genotype - so they are upregulated in a specific sample group out of 4. I used a padj threshold of 0.05 and fold change threshold of 2 to call genes DE. Using the counts matrices would pull out background sets that are all relatively up in one tissue (even if they didn't pass my thresholds). I'm not sure if this is a great group to act as a background. Am I wrong?

Regarding the gene length. Do you think that matching in this way is an appropriate method to address the gene length bias? Is there any precedence to do it this way? Though it makes sense to me, I couldn't find any papers addressing this issue in this way.

What are your thoughts?

What are the benefits of using matchit vs genefinder? I am slightly familiar with mahalanobis matching (my wife uses it in her political science research :-) ) But what am I gaining from this specific method vs mine above?  

Thank you again for all your help, I do appreciate it immensely.


Entering edit mode

Hi Ben,

your first observation is very pertinent: Yes, using genefinder like this can lead to a background that contains all your "false negatives". Running an enrichment analysis with this set might make you loose a lot of power. However, you can always exclude these potential false negative from your background set. You don't have to accept the genefinder results as it is.

To me, the rationale behind matching by gene length is that longer genes have more counts, and as genes instead of samples become your sampling units in an enrichment analysis, it makes sense to select a background set, that has a comparable gene length distribution and thus (most probably) comparable expression. I don't have a direct ref here, but one could potentially cite the goseq paper

which uses a more sophisticated method to correct for gene length / expression strength)

I think there are no other direct benefits to using matchIT compared to genefinder other than matchIT's flexibility in the program options. Note that the mahalanobis distance in the one-dimensional case (i.e. gene length only) is just a scaled eucledian distance).  For example, you could use the mahalnobis distance using both gene length and log-average expression. So I just wanted to point out another nice package that is not obvious to find for Bioconductors as it was written by political science people.

So basically I don't have very strong recommendations overall, but I can just provide you  with some hints. I think as long as you carefully describe what you did, avoid sweeping doubts under the carpet and discuss potential shortcomings, whatever strategy you decide on in the end is fine.

Best wishes,


Entering edit mode


Thanks for your advice and explanations. 

After talking to my PI and discussing the matter I think we will opt to go with matching based on average gene expression as you originally described (not the counts matrix). We also thought of using the two variables (expression and length), as you mentioned, but we figure that expression levels would account for length as well. So basically as you describe in your code above. Maybe I'll run it once that way and see if I get any different results.

Thanks again for all your help.



Login before adding your answer.

Traffic: 441 users visited in the last hour
Help About
Access RSS

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6