Question

Limma get all DE genes or filter by logFC

0

Entering edit mode

bharata1803 ▴ 60

@bharata1803-7698

Last seen 5.1 years ago

Japan

Hello,

So, I used Limma package for both microarray and rna-seq data. I have followed the tutorial and it works great. So, most of the tutorial will use topTable function to get the "interesting" gene. The topTable function will filter based on the level of logFC value.

- My first question is, how I get the list of genes with a certain range of logFC. For example, I want to check the similar expressed genes which will have logFC near 0. How do I do topTable like function to get the near 0 logFC?

- My second question is how to get all genes, not filtered by topTable or anything? I want to compare all of the genes' logFC and not only the "top" genes. Currently, I used topTable with the n=total number of genes. If I know the number of genes, I can do that, but I don't know the exact number for microarray.

Thank you for your answer.

limma microarray rnaseq • 3.5k views

ADD COMMENT • link updated 8.4 years ago by Aaron Lun ★ 28k • written 8.4 years ago by bharata1803 ▴ 60

score 3 · Answer 1 · 2015-11-16

I'll answer your second question first; if you want to get all genes, it should be as easy as specifying n=Inf in topTable. This will extract statistics for all genes, sorted by the B-statistic. Of course, you could also use the total number of genes, which can be determined by running nrow on the MArrayLM object you supply to topTable.

As for your first question, topTable only supports filtering on a minimum log-fold change. If you want something more complicated, you'll have to do it yourself. For example, if you assign the output of topTable into res, you could do:

keep <- res$logFC < 1 & res$logFC > -1
res[keep,]

... to get all genes with an estimated log-fold change between -1 and 1 (i.e., no more than 2-fold change in either direction). A more rigorous strategy for identifying genes with near-zero log-fold changes is to use confidence intervals:

res <- topTable(fit,coef=2, confint=0.95) # 95% CIs for the log-fold changes
keep <- res$CI.L > -1 & res$CI.R < 1
res[keep,]

This will identify genes where the 95% confidence interval for the log-fold change lies within -1 and 1. The use of confidence intervals accounts for any uncertainty in log-fold change estimation (e.g., when the data is variable).

However, be warned that filtering on the log-fold change is generally incompatible with filtering on the (adjusted) p-value. Check out the "Note" in ?topTable:

If the fold changes and p-values are not highly correlated, then the use of a fold change cutoff can increase the false discovery rate above the nominal level.

This is because you can get fairly large log-fold changes when genes have highly variable expression between replicates. Selection for large log-fold changes would result in selection of these variable genes that are not significantly DE. Of course, if you're looking for non-DE genes with near-zero log-fold changes, this is less of a problem. Filtering for large p-values would be ad hoc anyway, as the significance statistics won't tell you whether or not the gene is non-DE (absence of evidence does not equal evidence of absence, and all that).