Question

total count filter cutoff (edgeR)

0

Entering edit mode

Gordon Smyth 53k

@gordon-smyth

Last seen 5 hours ago

WEHI, Melbourne, Australia

Hi Mahnaz, Why don't you follow the advice of the edgeR User's Guide (as Mark has suggested)? All the case studies in the User's Guide describe how the filtering was done in a principled way. Total count filtering is not so bad, but it is susceptible to being driven by one library, especially by one library with a large sequence depth. The procedure described by Mark and used in the guide is a compromise of several considerations. BTW, there are newer versions of R and edgeR available than what you are using. Best wishes Gordon > Date: Wed, 30 Apr 2014 21:34:50 +0200 > From: Mark Robinson <mark.robinson at="" imls.uzh.ch=""> > To: "Ryan C. Thompson" <rct at="" thompsonclan.org=""> > Cc: bioconductor at r-project.org, Mahnaz Kiani <mahnazkiani at="" gmail.com=""> > Subject: Re: [BioC] total count filter cutoff > > > In my lab, we typically follow a "CPM of at least X in at least Y > samples" rule, where X=1 (arbitrary but reasonable, can be changed) and > Y=size of smallest replicate group, according to one of the case studies > in the user's guide, for example: > > ------ > 4.3.6 Filtering > We filter out very lowly expressed tags, keeping genes that are > expressed at a reasonable level in at least one treatment condition. > Since the smallest group size is three, we keep genes that achieve at > least one count per million (cpm) in at least three samples: > >> keep <- rowSums(cpm(y)>1) >= 3 >> y <- y[keep,] > ------ > > (http://www.bioconductor.org/packages/release/bioc/vignettes/edgeR/i nst/doc/edgeRUsersGuide.pdf) > > Cheers, Mark > > > ---------- > Prof. Dr. Mark Robinson > Statistical Bioinformatics, Institute of Molecular Life Sciences > University of Zurich > http://ow.ly/riRea > Date: Wed, 30 Apr 2014 11:29:28 -0700 (PDT) > From: "mahnaz Kiani [guest]" <guest at="" bioconductor.org=""> > To: bioconductor at r-project.org, mahnazkiani at gmail.com > Subject: [BioC] total count filter cutoff > > > I'm using edgeR for analysis of may data and I'm not sure what total > count filter value cutoff value I should use, My reads are paired 50bP > reads and total reads per sample is about 80,000,000. I tried cutoff > values of 5,10,15,30,50 and 100 and I only saw differences between 50 > and 100 but still looking for logical reason to chose the cutoff value. > > Appreciate your help, > Mahnaz > > -- output of sessionInfo(): > > R 3.0.2 ______________________________________________________________________ The information in this email is confidential and intend...{{dropped:4}}

edgeR edgeR • 3.9k views

ADD COMMENT • link 11.5 years ago Gordon Smyth 53k

score 0 · Answer 1 · 2014-05-03

Mahnaz, Just one more comment: with the large sequence depth that you have, you can afford to go down to a low cpm cutoff in order to include very lowly expressed genes and transcripts in your analysis. You could try cpm>0.2 or cpm>0.1. Best Gordon On Fri, 2 May 2014, Gordon K Smyth wrote: > Hi Mahnaz, > > Why don't you follow the advice of the edgeR User's Guide (as Mark has > suggested)? All the case studies in the User's Guide describe how the > filtering was done in a principled way. > > Total count filtering is not so bad, but it is susceptible to being driven by > one library, especially by one library with a large sequence depth. The > procedure described by Mark and used in the guide is a compromise of several > considerations. > > BTW, there are newer versions of R and edgeR available than what you are > using. > > Best wishes > Gordon > > >> Date: Wed, 30 Apr 2014 21:34:50 +0200 >> From: Mark Robinson <mark.robinson at="" imls.uzh.ch=""> >> To: "Ryan C. Thompson" <rct at="" thompsonclan.org=""> >> Cc: bioconductor at r-project.org, Mahnaz Kiani <mahnazkiani at="" gmail.com=""> >> Subject: Re: [BioC] total count filter cutoff >> >> >> In my lab, we typically follow a "CPM of at least X in at least Y samples" >> rule, where X=1 (arbitrary but reasonable, can be changed) and Y=size of >> smallest replicate group, according to one of the case studies in the >> user's guide, for example: >> >> ------ >> 4.3.6 Filtering > >> We filter out very lowly expressed tags, keeping genes that are expressed >> at a reasonable level in at least one treatment condition. Since the >> smallest group size is three, we keep genes that achieve at least one count >> per million (cpm) in at least three samples: >> >>> keep <- rowSums(cpm(y)>1) >= 3 >>> y <- y[keep,] >> ------ >> >> (http://www.bioconductor.org/packages/release/bioc/vignettes/edgeR/ inst/doc/edgeRUsersGuide.pdf) >> >> Cheers, Mark >> >> >> ---------- >> Prof. Dr. Mark Robinson >> Statistical Bioinformatics, Institute of Molecular Life Sciences >> University of Zurich >> http://ow.ly/riRea > > >> Date: Wed, 30 Apr 2014 11:29:28 -0700 (PDT) >> From: "mahnaz Kiani [guest]" <guest at="" bioconductor.org=""> >> To: bioconductor at r-project.org, mahnazkiani at gmail.com >> Subject: [BioC] total count filter cutoff >> >> >> I'm using edgeR for analysis of may data and I'm not sure what total count >> filter value cutoff value I should use, My reads are paired 50bP reads and >> total reads per sample is about 80,000,000. I tried cutoff values of >> 5,10,15,30,50 and 100 and I only saw differences between 50 and 100 but >> still looking for logical reason to chose the cutoff value. >> >> Appreciate your help, >> Mahnaz >> >> -- output of sessionInfo(): >> >> R 3.0.2 > ______________________________________________________________________ The information in this email is confidential and intend...{{dropped:4}}