Question

edgeR dataset filtering using pnas_expression.txt

0

Entering edit mode

Dave Tang ▴ 210

@dave-tang-4661

Last seen 6.5 years ago

Australia/Perth/UWA

Hi list, Just a question regarding edgeR and dataset processing/filtering prior to calling differential expression. Case Study 12 (RNA-seq of Hormone-Treated LNCaP Cells) from the edgeR manual mentions that: "We filter out lowly expressed tags and those which are only expressed in a small number of samples. We keep only those tags that have at least one count per million in at least three samples." Then in section 6 of the manual it mentions that: "The edgeR methodology needs to work with the original digital expression counts, so these should not be transformed in any way by users prior to analysis. edgeR automatically takes into account the total size (total read number) of each library in all calculations of fold-changes, concentration and statistical significance." My question is whether filtering counts as "transforming" the data. Since this would affect the total size of each library and thus affecting all downstream calculations, is it OK to use such filters? And what should one be cautious about when applying such filters e.g. at least n tags in n samples, prior to performing the edgeR analysis? Many thanks, -- Dave

edgeR edgeR • 1.3k views

ADD COMMENT • link updated 12.9 years ago by Wolfgang Huber ★ 13k • written 12.9 years ago by Dave Tang ▴ 210

score 0 · Answer 1 · 2012-01-04

Hi Dave Dave Tang scripsit 01/04/2012 03:04 PM: > Hi list, > > Just a question regarding edgeR and dataset processing/filtering prior > to calling differential expression. > > Case Study 12 (RNA-seq of Hormone-Treated LNCaP Cells) from the edgeR > manual mentions that: > > "We filter out lowly expressed tags and those which are only expressed > in a small number of samples. We keep only those tags that have at least > one count per million in at least three samples." > > Then in section 6 of the manual it mentions that: > > "The edgeR methodology needs to work with the original digital > expression counts, so these should not be transformed in any way by > users prior to analysis. edgeR automatically takes into account the > total size (total read number) of each library in all calculations of > fold-changes, concentration and statistical significance." > > My question is whether filtering counts as "transforming" the data. > Since this would affect the total size of each library and thus > affecting all downstream calculations, is it OK to use such filters? Typically, such filtering as suggested by the edgeR manual cited above has negligible impact on size factor and dispersion estimates, yet by doing away with lots of gene-by-gene tests that never have a chance of being rejected anyway, it will improve your statistical power experiment-wide. If your data were peculiar enough that the filtering would affect size factor or dispersion estimation, then you would have a problem. To address that, you would need to look more closely at data QA/QC and your overall analytical strategy. Some more on filtering is here: - http://www.pnas.org/content/107/21/9546.long (Bourgon et al., PNAS 2010) - Section 5 "Independent filtering" in the vignette of a recent DESeq package (e.g. version >= 1.7.3) Best wishes Wolfgang. > And > what should one be cautious about when applying such filters e.g. at > least n tags in n samples, prior to performing the edgeR analysis? > > Many thanks, > Wolfgang Huber EMBL http://www.embl.de/research/units/genome_biology/huber