Search
Question: Filtering out tags with low counts in edgeR results in fewer differentially expressed genes.....
0
5.4 years ago by
Mark Christie30 wrote:
Dear edgeR users, Contrary to expectation, when I filter out tags with with low counts in edgeR I get fewer genes that are differentially expressed. More importantly, the MDS plot shows no differences between groups after tags have been filtered, versus substantial differences between groups when no filtering was conducted. RNA-Seq experiment details: 24 biological replicates averaging 10 million mapped reads per replicate. 12 samples belong to group one and 12 samples belong to group two. Individuals from both groups were reared in common environments, but had different genetic backgrounds such that the differences between groups may be subtle (or at least more subtle than typical gene expression studies with a strict control and treatment). When filtering, I used the recommended approach: keep <- rowSums(cpm(y)>1) >= 12 # where 12 is the minimum group size y <- y[keep,] dim(y) The above filtering reduces the number of tags from 90,000 to ~ 50,000. After filtering a total of 29 genes are differentially expressed (adjusted p-value <= 0.05), versus 97 when the count data are not filtered. Could this result simply be because there are actual biological differences between the two groups that result in many individuals of one group having cpm < 1. Or could this somehow be an artifact of how I ran edgeR? On perhaps a related note, should I alter my code because I have a fairly large number of biological replicates. Below is the entire code I am running. Many thanks for your advice! dat<- read.table("GeneCounts_males.txt", header=TRUE, sep="\t", na.strings="?",dec=".", strip.white=TRUE,row.names=1) group=factor(c("group1","group1","group1","group1","group1","group1"," group1","group1","group1","group1","group1","group1","group2","group2" ,"group2","group2","group2","group2","group2","group2","group2","group 2","group2","group2")) y <- DGEList(counts=dat,group=group) colnames(y) dim(y) #total number of unique tags y$samples #number of tags ("genes") per sample #filter (optional?) keep <- rowSums(cpm(y)>1) >= 12 y <- y[keep,] dim(y) #recompute library sizes y$samples$lib.size <- colSums(y$counts) y$samples #normalize y <- calcNormFactors(y) y$samples #plot MDS of treatments plotMDS(y) #estimate dispersion y <- estimateCommonDisp(y, verbose=TRUE) y <- estimateTagwiseDisp(y) #uses empirical Bayes method, if prior null then df=20 #y <- estimateTagwiseDisp(y,prior.df=5) #I am not sure if this is the correct implementaion, put more weight on tagwise vs. common y plotBCV(y) #test for DE et <- exactTest(y) top <- topTags(et) top cpm(y)[rownames(top), ] #total number of genes at 5% FDR summary(de <- decideTestsDGE(et)) #-1 equals upregulate, 1 equals down regulated: need to be carefuul though as order (control,vs treatment matters) top <- topTags(et,n=40) #take top n genes write.table(top,file="DEgenesMales2.txt",col.names=TRUE,sep="\t",appen d=FALSE) #move column headers one column to the right (inapproporiate format due to as.matrix) detags <- rownames(y)[as.logical(de)] plotSmear(et, de.tags=detags) abline(h=c(-1, 1), col="blue") -- Mark R. Christie, PhD Department of Zoology, Oregon State University http://www.science.oregonstate.edu/~christim/ [[alternative HTML version deleted]]
modified 5.4 years ago by Gordon Smyth33k • written 5.4 years ago by Mark Christie30
0
5.4 years ago by
Gordon Smyth33k
Walter and Eliza Hall Institute of Medical Research, Melbourne, Australia
Gordon Smyth33k wrote:
Dear Mark, Getting few DE genes is to be expected. When you filter in this way, you are saying that you don't want to see genes for which differential expression is driven by only a subset of individuals in the up-group. If you are happy to see genes that are DE but are only expressed in (say) half of the samples in the up-group, then use rowSums(cpm(y)>1) >= 6. Best wishes Gordon > Date: Mon, 4 Feb 2013 18:27:33 -0500 > From: Mark Christie <christim at="" science.oregonstate.edu=""> > To: bioconductor at r-project.org > Subject: [BioC] Filtering out tags with low counts in edgeR results in > fewer differentially expressed genes..... > > Dear edgeR users, > > Contrary to expectation, when I filter out tags with with low counts in > edgeR I get fewer genes that are differentially expressed. More > importantly, the MDS plot shows no differences between groups after tags > have been filtered, versus substantial differences between groups when no > filtering was conducted. RNA-Seq experiment details: 24 biological > replicates averaging 10 million mapped reads per replicate. 12 samples > belong to group one and 12 samples belong to group two. Individuals from > both groups were reared in common environments, but had different genetic > backgrounds such that the differences between groups may be subtle (or at > least more subtle than typical gene expression studies with a strict > control and treatment). > > When filtering, I used the recommended approach: > > keep <- rowSums(cpm(y)>1) >= 12 # where 12 is the minimum group size > y <- y[keep,] > dim(y) > > The above filtering reduces the number of tags from 90,000 to ~ 50,000. > After filtering a total of 29 genes are differentially expressed (adjusted > p-value <= 0.05), versus 97 when the count data are not filtered. Could > this result simply be because there are actual biological differences > between the two groups that result in many individuals of one group having > cpm < 1. Or could this somehow be an artifact of how I ran edgeR? On > perhaps a related note, should I alter my code because I have a fairly > large number of biological replicates. Below is the entire code I am > running. Many thanks for your advice! > > > > > dat<- read.table("GeneCounts_males.txt", header=TRUE, sep="\t", > na.strings="?",dec=".", strip.white=TRUE,row.names=1) > group=factor(c("group1","group1","group1","group1","group1","group1" ,"group1","group1","group1","group1","group1","group1","group2","group 2","group2","group2","group2","group2","group2","group2","group2","gro up2","group2","group2")) > > y <- DGEList(counts=dat,group=group) > colnames(y) > dim(y) #total number of unique tags > y$samples #number of tags ("genes") per sample > > > #filter (optional?) > keep <- rowSums(cpm(y)>1) >= 12 > y <- y[keep,] > dim(y) > > #recompute library sizes > y$samples$lib.size <- colSums(y$counts) > y$samples > > #normalize > y <- calcNormFactors(y) > y$samples > > #plot MDS of treatments > plotMDS(y) > > > #estimate dispersion > y <- estimateCommonDisp(y, verbose=TRUE) > y <- estimateTagwiseDisp(y) #uses empirical Bayes > method, if prior null then df=20 > #y <- estimateTagwiseDisp(y,prior.df=5) #I am not sure if this is > the correct implementaion, put more weight on tagwise vs. common > y > plotBCV(y) > > #test for DE > et <- exactTest(y) > top <- topTags(et) > top > cpm(y)[rownames(top), ] > > #total number of genes at 5% FDR > summary(de <- decideTestsDGE(et)) #-1 equals upregulate, > 1 equals down regulated: need to be carefuul though as order (control,vs > treatment matters) > top <- topTags(et,n=40) #take top n genes > write.table(top,file="DEgenesMales2.txt",col.names=TRUE,sep="\t",app end=FALSE) > #move column headers one column to the right (inapproporiate format due to > as.matrix) > > detags <- rownames(y)[as.logical(de)] > plotSmear(et, de.tags=detags) > abline(h=c(-1, 1), col="blue") > > > > > > > > -- > Mark R. Christie, PhD > Department of Zoology, > Oregon State University > http://www.science.oregonstate.edu/~christim/ > ______________________________________________________________________ The information in this email is confidential and intend...{{dropped:4}}
Thanks Gordon! My impression was that filtering in this way was purposefully without respect to grouping (i.e., it could be any 6 or 12 individuals) in order to avoid increasing the type I error rate. I also thought that this step would increase the number of detected DE genes for a given alpha because it drastically cuts down on the number of comparisons (which affects the false discovery rate). My guess is that, as you pointed out below, this step does eliminate counts that are only expressed in subsets of individuals (which may be a problem in my particular case and also why the number of detected DE genes decreases). More generally though, what numbers to choose seems to be a subjective decision. Is this a required step? I am averaging 10 million mapped reads per individual and have 24 individuals. Basically I would like to have as much power as possible to detect DE genes, but I obviously don't want those differences driven by a single individual. Is there some rule of thumb for how many reads each individual in a group should have? Because the MDS plots change depending on what values I use, I am assuming that any downstream analyses would also differ. Maybe I should try comparing gene lists with several different filtering choices? Thanks for any tips! Mark On Tue, Feb 5, 2013 at 8:07 PM, Gordon K Smyth <smyth@wehi.edu.au> wrote: > Dear Mark, > > Getting few DE genes is to be expected. When you filter in this way, you > are saying that you don't want to see genes for which differential > expression is driven by only a subset of individuals in the up- group. If > you are happy to see genes that are DE but are only expressed in (say) half > of the samples in the up-group, then use > > rowSums(cpm(y)>1) >= 6. > > Best wishes > Gordon > > Date: Mon, 4 Feb 2013 18:27:33 -0500 >> From: Mark Christie <christim@science.oregonstate.**edu<christim@science.oregonstate.edu> >> > >> To: bioconductor@r-project.org >> Subject: [BioC] Filtering out tags with low counts in edgeR results in >> fewer differentially expressed genes..... >> >> Dear edgeR users, >> >> Contrary to expectation, when I filter out tags with with low counts in >> edgeR I get fewer genes that are differentially expressed. More >> importantly, the MDS plot shows no differences between groups after tags >> have been filtered, versus substantial differences between groups when no >> filtering was conducted. RNA-Seq experiment details: 24 biological >> replicates averaging 10 million mapped reads per replicate. 12 samples >> belong to group one and 12 samples belong to group two. Individuals from >> both groups were reared in common environments, but had different genetic >> backgrounds such that the differences between groups may be subtle (or at >> least more subtle than typical gene expression studies with a strict >> control and treatment). >> >> When filtering, I used the recommended approach: >> >> keep <- rowSums(cpm(y)>1) >= 12 # where 12 is the minimum group >> size >> y <- y[keep,] >> dim(y) >> >> The above filtering reduces the number of tags from 90,000 to ~ 50,000. >> After filtering a total of 29 genes are differentially expressed (adjusted >> p-value <= 0.05), versus 97 when the count data are not filtered. Could >> this result simply be because there are actual biological differences >> between the two groups that result in many individuals of one group having >> cpm < 1. Or could this somehow be an artifact of how I ran edgeR? On >> perhaps a related note, should I alter my code because I have a fairly >> large number of biological replicates. Below is the entire code I am >> running. Many thanks for your advice! >> >> >> >> >> dat<- read.table("GeneCounts_males.**txt", header=TRUE, sep="\t", >> na.strings="?",dec=".", strip.white=TRUE,row.names=1) >> group=factor(c("group1","**group1","group1","group1","** >> group1","group1","group1","**group1","group1","group1","** >> group1","group1","group2","**group2","group2","group2","** >> group2","group2","group2","**group2","group2","group2","** >> group2","group2")) >> >> y <- DGEList(counts=dat,group=**group) >> colnames(y) >> dim(y) #total number of unique tags >> y$samples #number of tags ("genes") per sample >> >> >> #filter (optional?) >> keep <- rowSums(cpm(y)>1) >= 12 >> y <- y[keep,] >> dim(y) >> >> #recompute library sizes >> y$samples$lib.size <- colSums(y$counts) >> y$samples >> >> #normalize >> y <- calcNormFactors(y) >> y$samples >> >> #plot MDS of treatments >> plotMDS(y) >> >> >> #estimate dispersion >> y <- estimateCommonDisp(y, verbose=TRUE) >> y <- estimateTagwiseDisp(y) #uses empirical Bayes >> method, if prior null then df=20 >> #y <- estimateTagwiseDisp(y,prior.**df=5) #I am not sure if >> this is >> the correct implementaion, put more weight on tagwise vs. common >> y >> plotBCV(y) >> >> #test for DE >> et <- exactTest(y) >> top <- topTags(et) >> top >> cpm(y)[rownames(top), ] >> >> #total number of genes at 5% FDR >> summary(de <- decideTestsDGE(et)) #-1 equals upregulate, >> 1 equals down regulated: need to be carefuul though as order (control,vs >> treatment matters) >> top <- topTags(et,n=40) #take top n genes >> write.table(top,file="**DEgenesMales2.txt",col.names=** >> TRUE,sep="\t",append=FALSE) >> #move column headers one column to the right (inapproporiate format due to >> as.matrix) >> >> detags <- rownames(y)[as.logical(de)] >> plotSmear(et, de.tags=detags) >> abline(h=c(-1, 1), col="blue") >> >> >> >> >> >> >> >> -- >> Mark R. Christie, PhD >> Department of Zoology, >> Oregon State University >> http://www.science.**oregonstate.edu/~christim/<http: www.science.="" oregonstate.edu="" ~christim=""/> >> >> > ______________________________**______________________________**____ ______ > The information in this email is confidential and inte...{{dropped:18}}