Question

how to understand cpm=1 and how filterByExpr works?

0

Entering edit mode

linouhao ▴ 20

@linouhao-15901

Last seen 12 months ago

United States

how to understand cpm=1 and how filterByExpr works?
I have read the doc, but can not understand it easily

edger • 7.6k views

ADD COMMENT • link updated 23 months ago by gowthamee ▴ 10 • written 3.6 years ago by linouhao ▴ 20

score 2 · Answer 1 · 2020-09-04

2

Entering edit mode

Kevin Blighe ★ 3.9k

@kevin

Last seen 4 days ago

Republic of Ireland

Hi again,

The documentation for filterByExpr() is clear. Please take a look: https://rdrr.io/bioc/edgeR/man/filterByExpr.html

If you have any doubt or misunderstanding, then please elaborate on that specifically (not generally).

Kevin

ADD COMMENT • link 3.6 years ago Kevin Blighe ★ 3.9k

0

Entering edit mode

The filterByExpr function keeps rows that have worthwhile counts in a minumum number of samples (two samples in this case because the smallest group size is two). The function accesses the group factor contained in y in order to compute the minimum group size, but the filtering is performed independently of which sample belongs to which group so that no bias is introduced. It is recommended to recalculate the library sizes of the DGEList object after the filtering, although the downstream analysis is robust to whether this is done or not.

the sentence in the edger user guide is hard to understand, especially the bold part, did not tell why we need to do so.

Users should also filter with count-per-million (CPM) rather than filtering on the counts directly, as the latter does not account for differences in library sizes between samples.

and cpm has relationship with library size, if we sequence 100M reads, and cpm=1 for egfr gene in sample A, what does cpm=1 stand for in reality but the code in the user guide does not calculate cpm, I did not see this part.

y <- DGEList(counts=x) A grouping factor can be added at the same time: group <- c(1,1,2,2) y <- DGEList(counts=x, group=group) keep <- filterByExpr(y) y <- y[keep, , keep.lib.sizes=FALSE]

and you can see

ADD REPLY • link 3.6 years ago linouhao ▴ 20

2

Entering edit mode

the sentence in the edger user guide is hard to understand, especially the bold part, did not tell why we need to do so.

After the filtering, the library sizes would slightly change as some genes are filtered out. It is recommended to recalculate the library sizes to get more precise values, although it won't make any noticeable difference in general.

if we sequence 100M reads, and cpm=1 for egfr gene in sample A, what does cpm=1 stand for in reality

In reality, a cpm of 1 stands for 1 count in every 1m reads. In your case, it would be equivalent to a read count of 100.

but the code in the user guide does not calculate cpm, I did not see this part.

It has been taken care of in filterByExpr. The value of the min.count argument in filterByExpre (default to 10) is first converted into cpm and then used as a threshold.

ADD REPLY • link 3.6 years ago Yunshun Chen ▴ 840

1

Entering edit mode

as the most view question, I guess most people has the same question with me

ADD REPLY • link 3.6 years ago linouhao ▴ 20

score 1 · Answer 2 · 2022-05-18

I found the following two articles very helpful in explaining the functions "cpm" and "filterByExpr" Article#1: From reads to genes to pathways: differential expression analysis of RNA-Seq experiments using Rsubread and the edgeR quasi-likelihood pipeline [version 2; peer review: 5 approved] https://f1000researchdata.s3.amazonaws.com/manuscripts/9996/37711c4a-d061-4af3-8ca0-5e8714e43c8e_8987_-_gordon_smyth_v2.pdf?doi=10.12688/f1000research.8987.2&numberOfBrowsableCollections=50&numberOfBrowsableInstitutionalCollections=4&numberOfBrowsableGateways=40

Article#2: RNA-seq analysis is easy as 1-2-3 with limma, Glimma and edgeR [version 3; peer review: 3 approved] https://f1000researchdata.s3.amazonaws.com/manuscripts/18347/846f3e9f-1806-4c61-8fcf-254e56340236_9005_-__matt_ritchie_v3.pdf?doi=10.12688/f1000research.9005.3&numberOfBrowsableCollections=50&numberOfBrowsableInstitutionalCollections=4&numberOfBrowsableGateways=40

P.S: I strongly agree - the documentation that accompanies functions are very often vague and hard to understand.