Question

Filtering lowly-expressed genes in a RNASeq dataset with continuous instead of grouping variables

0

Entering edit mode

anniina.mattila • 0

@anniinamattila-22282

Last seen 5.1 years ago

Hi,

I have RNASeq data from in total 54 samples, half of which are females and half males (n=27 of each sex), and all of which have phenotypic values of a continuous, normally distributed trait, the association of which I want to test with the expression of genes in the whole-transcriptome setting.

I have tried to figure out the best way to filter out the lowly-expressed genes, but the most common filtering methods seem to rely on per group sample sizes in data grouped by treatment etc., and I am not sure how to best apply these methods to my continuous data.

E.g., as far as I understand, the standard method advised by the edgeR-manual is to use a cpm cutoff so that it corresponds to about 10-15 reads in a number of libraries corresponding to the minimum per group sample size. E.g. in a data set with library size about 20M, and minimum no. of biological replicates per group =2, you would use:

keep <- rowSums( cpm(y) > 0.5 ) >=2

In the case of my data, this could translate into:

keep <- rowSums( cpm(y) > 0.5 ) >=27

, which seems would be a very strict filter. I could potentially set the required library number filling the cpm requirement smaller, but what to base the decision on?

Another option I have considered is to filter by the average logCPM distribution of all genes in the data set, whereby you typically see a bimodal distribution, with one mode representing expressed genes and the other representing unexpressed genes, and you choose a threshold between the two modes. This method would be blind to the experimental design (not relying on how/wether the data is grouped), but does not seem to be widely used.

I would be grateful for some educated advice on how to do the filtering!

limma edger • 1.4k views

ADD COMMENT • link updated 5.1 years ago by James W. MacDonald 67k • written 5.1 years ago by anniina.mattila • 0

score 3 · Answer 1 · 2019-11-05

The standard method for edgeR is (these days) to use filterByExpr, which if you have groups will do the usual N > count, where N is the smallest group size, and count is ~ 10 counts. This obviously doesn't work in the context of a continuous variable, in which case it will use the minimum inverse leverage of any fitted value, which is essentially the inverse of the maximum value from the 'hat' matrix.

This can (has in my experience) set the estimated N value to something that is pretty small, in which case you might consider using the average of the logCPM values, as you mention, which is also a valid thing to do. It's probably less common simply because a continuous outcome is pretty rare as well.