Question

Dependence of FDR and P-values on sample size

0

Entering edit mode

ilovesuperheroes1993 • 0

@ilovesuperheroes1993-17038

Last seen 4.8 years ago

Hi, I am doing a differential expression analysis of small RNA using edgeR. I have 4 normal and 4 diseased samples, all samples are paired. Now, I have very little knowledge of statistics so I would appreciate clarification on the following:

When I do the analysis without any per-filtering of low abundant genes, I get, let's say about 10-12 values in the FDR column with FDR < 0.05. Now when I impose a filtering criterion, say for example that, the cpm value of at least 4 of the samples should be greater than 1, I am getting only 1 FDR < 0.05.

However, the top genes are more or less same in both cases. In case two, my results look better, when viewing the cpm values of the samples side by side, as most of the low abundant genes have been filtered out.

So, my question is, on reducing the data set, is it so that the FDR values also increase? Do the p-values and FDR values have a bearing on the samples size?

Thank you

edgeR FDR p-value Benjamini-Hochberg Differential Analysis • 1.1k views

ADD COMMENT • link updated 5.1 years ago by Aaron Lun ★ 28k • written 5.1 years ago by ilovesuperheroes1993 • 0

score 0 · Answer 1 · 2019-03-08

In case two, my results look better, when viewing the cpm values of the samples side by side,

I don't understand what you mean here.

So, my question is, on reducing the data set, is it so that the FDR values also increase?

In theory, no. Upon filtering, the total number of tests should decrease, which should reduce the severity of the FDR correction among the remaining genes. So if your top genes survive filtering (as you describe), they should have lower FDR values in the filtered analysis - in theory.

That this does not occur actually points towards the practical motivation for filtering in RNA-seq data. Specifically, filtering aims to get rid of low-abundance genes that interfere with accurate normalization and modelling of the mean-dispersion trend. For example, without filtering, I would expect to see lots of discreteness in the MA plot that interferes with TMM normalization. I would also expect to see sharp increases in the dispersion at low abundances (possibly due to the increased relative impact of PCR duplication on extra-Poisson variation); this is not only difficult to model in itself, it can actually reduce the accuracy of the fitted trend for high-abundance genes.

By filtering, we avoid these problems and get more accurate inferences for the remaining genes - keeping in mind, of course, that more accurate results do not necessarily mean more DE genes. There is also the benefit of reducing the severity of the FDR correction, though this is not a major consideration for RNA-seq analyses.

Do the p-values and FDR values have a bearing on the samples size?

I think you have your sentence back to front.