Question

filterbyexp() in edgeR

1

Entering edit mode

Simon.garinet ▴ 10

@edc5aa0a

Last seen 3.8 years ago

France

Dear R users, I want to perform a GSEA on RNA-seq data, I use the filterbyexp() function in the DGELlist in edgeR. I am not sure about hte arguments I use : I would like to keep all genes with an expression of at least 10 cpm in 10% of samples. Is it filterbyexp(DGElistobject, min.count = 10, min.prop = 0.1) ?

It keeps almost the same number of gene sthan the default filter which is I think min.count = 10, min.prop = 0.7

Best

Simon

edgeR • 2.7k views

ADD COMMENT • link updated 13 months ago by Yunshun Chen ▴ 900 • written 3.8 years ago by Simon.garinet ▴ 10

score 2 · Answer 1 · 2022-02-03

min.count is a threshold for actual counts rather than cpm. If you want to filter genes based on their cpm values, you may need to take into account the library sizes and get an equivalent cut-off for min.count. E.g., if the average library size of all the samples is about 20 million, then 10 cpm would be equivalent to min.count = 200.

score 0 · Answer 2 · 2022-02-03

0

Entering edit mode

Gordon Smyth 53k

@gordon-smyth

Last seen 3 hours ago

WEHI, Melbourne, Australia

The arguments are already set to optimal values for differential expression analyses so you don't need to change them at all.

The only thing you need to do is to make sure that you specify the design of your experiment, either by setting the group variable of your DGEList or specifying the design or group arguments to filterByExpr().

ADD COMMENT • link 3.8 years ago Gordon Smyth 53k

0

Entering edit mode

I used filterByExpr with the design parameter. I have 7 donors and paired samples (Condition = {KO, WT}, except for Donor 1)

design <- model.matrix(~0 + Donor + Condition, data=cluster_metadata)
keep.exprs <- filterByExpr(x, design)

> design
            DonorD_01 DonorD_02 DonorD_03 DonorD_05 DonorD_06 DonorD_07 DonorD_08 ConditionT
01-KO         1         0         0         0         0         0         0          1
02-WT         0         1         0         0         0         0         0          0
02-KO         0         1         0         0         0         0         0          1
03-WT         0         0         1         0         0         0         0          0
03-KO         0         0         1         0         0         0         0          1
05-WT         0         0         0         1         0         0         0          0
05-KO         0         0         0         1         0         0         0          1
06-WT         0         0         0         0         1         0         0          0
06-KO         0         0         0         0         1         0         0          1
07-WT         0         0         0         0         0         1         0          0
07-KO         0         0         0         0         0         1         0          1
08-WT         0         0         0         0         0         0         1          0
08-KO         0         0         0         0         0         0         1          1

Count matrix:

           01-KO 02-WT  02-KO 03-WT  03-KO 05-WT  05-KO 06-WT  06-KO 07-WT  07-KO 08-WT  08-KO 
Gene_X     0     345       0       0       0       0       0       0       0       0       0       0       0

This gene is odd because it's only expressed in one sample, but it's still kept by filterByExpr(). What is the minimum number of samples by default?

ADD REPLY • link 13 months ago picasa1983 • 0

0

Entering edit mode

You probably need to print out your design here for us to see where the problem might be.

ADD REPLY • link 13 months ago Yunshun Chen ▴ 900

0

Entering edit mode

Thanks for your reply. I have edited my initial post

ADD REPLY • link 13 months ago picasa1983 • 0

0

Entering edit mode

Since you only have 1 sample in DonorD_01, your Gene_X will not be filtered if it exceeds the CPM cut-off in at least 1 sample. In a typical RNA-seq dataset, a read count of 345 is large enough to pass the CPM threshold.

ADD REPLY • link 13 months ago Yunshun Chen ▴ 900