parameters of filterByExpr
1
1
Entering edit mode
@3f9f9566
Last seen 1 day ago
Germany

I have been advised to use filterByExpr after running DESeq2 in order to get rid of a relatively large number of differentially expressed genes which turned out to be expressed in only a few individuals in a few treatment levels.

We have performed RNAseq on a lot of samples of individual flies (~ 20 per condition). Without filterByExpr, I was unable to use group-aware filtering, which I need. Therefore I am now doing :

mm <- model.matrix(~0 + host_treatment_generation, data = sample_data)
mm

keep <- filterByExpr(dds, mm, min.count = 20, min.total.count = 20, large.n = 18, min.prop = 0.7)

dds_keep <- dds[keep,]

thinking this would select genes which have more than 10 counts in at least 70 % of 18 individuals (out of 20, which I would consider a large n).

However, this is not the case, as you can see from the plotted counts of one gene which was detected as being differentially expressed between treatment levels : enter image description here

there is a maximum of 10 individuals per treatment level which have counts for this gene. Am I doing something wrong ?

edgeR DESeq2 filterByExpr • 1.1k views
ADD COMMENT
1
Entering edit mode

It might be worth trying limma-voom (or limma-trend) instead of a GLM based method. In our experience these are less prone to this problem. Apologies, I can only give anecdotal experience here. I speculate that fitting the data on a log scale tends to reduce the influence of positive outliers, compared to GLMs which fit the data on a linear scale (leaving out a lot of details!).

The gene you show will not be filtered by filterByExpr since there are overall plenty of counts > 10, and you will still get spurious results, eg when comparing E and F. However, if you subset the data down to just the E and F conditions when comparing E and F then filtering would work, and I think is valid.

ADD REPLY
2
Entering edit mode
@gordon-smyth
Last seen just now
WEHI, Melbourne, Australia

You mention "group-aware filtering", and I am guessing that you are expecting filterByExpr to select genes that are expressed a certain number of times in any particular group. But filterByExpr does not do that and, indeed, any group orientated filtering strategy like that would be statistically invalid, either before or after conducting the DE analysis, because it would result in error rate inflation.

filterByExpr instead conducts independent filtering that can be done before the DE analysis without affecting the error rate control. It selects genes that are expressed in a minimum number of samples across the whole experiment, regardless of which groups those samples below to. It uses the design argument only to compute the minimum group size. The filtering conducted by filterByExpr would be exactly the same even if you permuted the group labels or permuted the rows of the design matrix.

ADD COMMENT
0
Entering edit mode

You are guessing right. It makes sense thank you. I suppose this means n (in large.n) is relative to the total number of samples we have, not the number of samples per treatment level ?

ADD REPLY
0
Entering edit mode

It is best to read the documentation for the function. The large.n value is used only to adjust the MinSampleSize estimate. If you type ?filterByExpr then the help page says:

This function implements the filtering strategy that was described informally by Chen et al (2016). Roughly speaking, the strategy keeps genes that have at least min.count reads in a worthwhile number samples. More precisely, the filtering keeps genes that have CPM >= CPM.cutoff in MinSampleSize samples, where CPM.cutoff = min.count / median(lib.size) * 1e6 and MinSampleSize is the smallest group sample size or, more generally, the minimum inverse leverage computed from the design matrix.

If all the group samples sizes are large, then the above filtering rule is relaxed slightly. If MinSampleSize > large.n, then genes are kept if CPM >= CPM.cutoff in k samples where k = large.n + (MinSampleSize - large.n) * min.prop. This rule requires that genes are expressed in at least min.prop * MinSampleSize samples, even when MinSampleSize is large.

In addition, each kept gene is required to have at least min.total.count reads across all the samples.

For your experiment, the filtering is very easy to understand. You have about 20 flies per condition but, presumably, it varies a bit from condition to condition. What is the smallest group size? Let's say the smallest group is 15 flies. Then filterByExpr will select genes that are expressed at the specified level in at least 15 samples. It is as simple as that.

ADD REPLY
0
Entering edit mode

Our smallest group is 19 flies. So in my case, MinSampleSize > large.n : k = 18+(19-18)*0.7 = 12.6 samples ~ 12 samples should have at least 20 counts to pass the filter ? And these would not be 12 samples per treatment level, but 12 samples across all treatment levels ?

Thank you and sorry for the naive questions !

ADD REPLY
0
Entering edit mode

k = 18+(19-18)*0.7 is 18.7, not 12.6. And it is not rounded down.

The operation of the function on your data is simple and unambiguous. Your minimum sample size is 19, so it keeps genes expressed in at least 19 samples.

ADD REPLY

Login before adding your answer.

Traffic: 410 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6