Question: Dependence of FDR and P-values on sample size
0
gravatar for ilovesuperheroes1993
3 months ago by
ilovesuperheroes19930 wrote:

Hi, I am doing a differential expression analysis of small RNA using edgeR. I have 4 normal and 4 diseased samples, all samples are paired. Now, I have very little knowledge of statistics so I would appreciate clarification on the following:

When I do the analysis without any per-filtering of low abundant genes, I get, let's say about 10-12 values in the FDR column with FDR < 0.05. Now when I impose a filtering criterion, say for example that, the cpm value of at least 4 of the samples should be greater than 1, I am getting only 1 FDR < 0.05.

However, the top genes are more or less same in both cases. In case two, my results look better, when viewing the cpm values of the samples side by side, as most of the low abundant genes have been filtered out.

So, my question is, on reducing the data set, is it so that the FDR values also increase? Do the p-values and FDR values have a bearing on the samples size?

Thank you

ADD COMMENTlink modified 3 months ago by Aaron Lun24k • written 3 months ago by ilovesuperheroes19930
Answer: Dependence of FDR and P-values on sample size
0
gravatar for Aaron Lun
3 months ago by
Aaron Lun24k
Cambridge, United Kingdom
Aaron Lun24k wrote:

In case two, my results look better, when viewing the cpm values of the samples side by side,

I don't understand what you mean here.

So, my question is, on reducing the data set, is it so that the FDR values also increase?

In theory, no. Upon filtering, the total number of tests should decrease, which should reduce the severity of the FDR correction among the remaining genes. So if your top genes survive filtering (as you describe), they should have lower FDR values in the filtered analysis - in theory.

That this does not occur actually points towards the practical motivation for filtering in RNA-seq data. Specifically, filtering aims to get rid of low-abundance genes that interfere with accurate normalization and modelling of the mean-dispersion trend. For example, without filtering, I would expect to see lots of discreteness in the MA plot that interferes with TMM normalization. I would also expect to see sharp increases in the dispersion at low abundances (possibly due to the increased relative impact of PCR duplication on extra-Poisson variation); this is not only difficult to model in itself, it can actually reduce the accuracy of the fitted trend for high-abundance genes.

By filtering, we avoid these problems and get more accurate inferences for the remaining genes - keeping in mind, of course, that more accurate results do not necessarily mean more DE genes. There is also the benefit of reducing the severity of the FDR correction, though this is not a major consideration for RNA-seq analyses.

Do the p-values and FDR values have a bearing on the samples size?

I think you have your sentence back to front.

ADD COMMENTlink modified 3 months ago • written 3 months ago by Aaron Lun24k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 16.09
Traffic: 329 users visited in the last hour