Hi,
I have RNASeq data from in total 54 samples, half of which are females and half males (n=27 of each sex), and all of which have phenotypic values of a continuous, normally distributed trait, the association of which I want to test with the expression of genes in the whole-transcriptome setting.
I have tried to figure out the best way to filter out the lowly-expressed genes, but the most common filtering methods seem to rely on per group sample sizes in data grouped by treatment etc., and I am not sure how to best apply these methods to my continuous data.
E.g., as far as I understand, the standard method advised by the edgeR-manual is to use a cpm cutoff so that it corresponds to about 10-15 reads in a number of libraries corresponding to the minimum per group sample size. E.g. in a data set with library size about 20M, and minimum no. of biological replicates per group =2, you would use:
keep <- rowSums( cpm(y) > 0.5 ) >=2
In the case of my data, this could translate into:
keep <- rowSums( cpm(y) > 0.5 ) >=27
, which seems would be a very strict filter. I could potentially set the required library number filling the cpm requirement smaller, but what to base the decision on?
Another option I have considered is to filter by the average logCPM distribution of all genes in the data set, whereby you typically see a bimodal distribution, with one mode representing expressed genes and the other representing unexpressed genes, and you choose a threshold between the two modes. This method would be blind to the experimental design (not relying on how/wether the data is grouped), but does not seem to be widely used.
I would be grateful for some educated advice on how to do the filtering!
Thank you for the reply, which has helped to convince me of the validity of the filtering method by average logCPM!