Question

NaN results on some pathways in CAMERA

2

Entering edit mode

Ryan C. Thompson ★ 7.9k

@ryan-c-thompson-5618

Last seen 6 months ago

Icahn School of Medicine at Mount Sinai…

I'm running CAMERA on one of my datasets with gene set collections imported from MSigDB and graphite, and I am getting NaN results for a large fraction of gene sets that I'm testing (around 50% of pathways for some collections). This is odd because I don't remember that happening a few versions ago the last time I used CAMERA on the same gene sets. I've prepared a test case that demonstrates the problem. You can download the script and data file from here: https://www.dropbox.com/sh/dp5aheqzvf2rij5/AADuA68Hk5BdRIJpL4H_knbHa?dl=0 Simply put them in the same folder and run the script, which is just this:

library(edgeR)
testcase <- readRDS("camera-nan-testcase.RDS")
result <- with(testcase, camera(dge, pathway, design, makeContrasts(contrasts=contrast, levels=design)))
print(result)

The result should be:

Warning message:
Zero sample variances detected, have been offset
data.frame with 1 row and 4 columns
        NGenes Correlation   Direction    PValue
     <numeric>   <numeric> <character> <numeric>
set1        40         NaN          Up       NaN

I have no idea what's causing this. Is there something about this gene set that makes it incompatible with CAMERA, or is there an issue in my data?

camera edger pathway analysis • 2.1k views

ADD COMMENT • link updated 9.8 years ago by Gordon Smyth 52k • written 9.8 years ago by Ryan C. Thompson ★ 7.9k

0

Entering edit mode

Upon further investigation, it appears that the problem occurs with any gene set containing at least one gene that has zero counts in all my samples.

ADD REPLY • link 9.8 years ago Ryan C. Thompson ★ 7.9k

score 5 · Accepted Answer · 2015-06-18

5

Entering edit mode

Gordon Smyth 52k

@gordon-smyth

Last seen 4 hours ago

WEHI, Melbourne, Australia

As you have figured out, this occurs for gene sets that contain at least one gene with no counts. A gene with no counts is essentially missing. It has undefined differential expression, undefined dispersion and undefined correlation with other genes. Hence the differential status of the whole gene set is also undefined, hence the camera output.

In principle, we could put a check in the camera method for DGEList objects to remove genes with no counts, and that would give you non-missing camera results. But it would be much better for you to filter out genes with very low counts before you do the dispersion estimation step prior to running camera. As you know, we recommend that edgeR users always filter genes with very low counts.

ADD COMMENT • link 9.8 years ago Gordon Smyth 52k

0

Entering edit mode

At least, it would probably be good to issue a warning explaining why some gene sets have undefined outputs and reminding the user to filter low counts.

Regarding filtering before dispersion estimation, do very low-count genes actually significantly distort the dispersion trend and thereby interfere with accurate dispersion estimates of higher-count genes? Is that why you need to filter them out before estimating dispersions and not after?

ADD REPLY • link 9.8 years ago Ryan C. Thompson ★ 7.9k

1

Entering edit mode

Yes, they do distort the dispersion trend. The default loess-based fitting procedure is span-based, so if you have a large number of very low-abundance genes (sometimes half of all annotated genes in our data sets), the span will be skewed towards these low abundances. This compromises the ability of the trend to fit accurately at high abundances. Check out Section 5.2.1 in the csaw user's guide for an example (albeit for ChIP-seq). Thus, we need to filter them out to get better accuracy for the higher-abundance genes that are more relevant to us.

In this specific case, the zero counts stuff up camera in several places. The first is in the dispersion trend; the second is in the estimation of the variances, which are now zero (as all values are the same for all-zero rows); the third is in the estimation of the prior degrees of freedom, where the zero variances serve as large negative outliers upon log-transformation; and the fourth is in the actual hypothesis testing itself, where the presence of zero t-statistics results in reduced power.

I think that a warning has been added to camera, as you've suggested in your comment. However, for the reasons listed above, it would be a lot better to remove low-abundance genes beforehand.

ADD REPLY • link 9.8 years ago Aaron Lun ★ 28k