Question

Filtering out tags with low counts in edgeR results in fewer differentially expressed genes.....

0

Entering edit mode

Mark Christie ▴ 30

@mark-christie-5745

Last seen 9.6 years ago

Dear edgeR users, Contrary to expectation, when I filter out tags with with low counts in edgeR I get fewer genes that are differentially expressed. More importantly, the MDS plot shows no differences between groups after tags have been filtered, versus substantial differences between groups when no filtering was conducted. RNA-Seq experiment details: 24 biological replicates averaging 10 million mapped reads per replicate. 12 samples belong to group one and 12 samples belong to group two. Individuals from both groups were reared in common environments, but had different genetic backgrounds such that the differences between groups may be subtle (or at least more subtle than typical gene expression studies with a strict control and treatment). When filtering, I used the recommended approach: keep <- rowSums(cpm(y)>1) >= 12 # where 12 is the minimum group size y <- y[keep,] dim(y) The above filtering reduces the number of tags from 90,000 to ~ 50,000. After filtering a total of 29 genes are differentially expressed (adjusted p-value <= 0.05), versus 97 when the count data are not filtered. Could this result simply be because there are actual biological differences between the two groups that result in many individuals of one group having cpm < 1. Or could this somehow be an artifact of how I ran edgeR? On perhaps a related note, should I alter my code because I have a fairly large number of biological replicates. Below is the entire code I am running. Many thanks for your advice! dat<- read.table("GeneCounts_males.txt", header=TRUE, sep="\t", na.strings="?",dec=".", strip.white=TRUE,row.names=1) group=factor(c("group1","group1","group1","group1","group1","group1"," group1","group1","group1","group1","group1","group1","group2","group2" ,"group2","group2","group2","group2","group2","group2","group2","group 2","group2","group2")) y <- DGEList(counts=dat,group=group) colnames(y) dim(y) #total number of unique tags y$samples #number of tags ("genes") per sample #filter (optional?) keep <- rowSums(cpm(y)>1) >= 12 y <- y[keep,] dim(y) #recompute library sizes y$samples$lib.size <- colSums(y$counts) y$samples #normalize y <- calcNormFactors(y) y$samples #plot MDS of treatments plotMDS(y) #estimate dispersion y <- estimateCommonDisp(y, verbose=TRUE) y <- estimateTagwiseDisp(y) #uses empirical Bayes method, if prior null then df=20 #y <- estimateTagwiseDisp(y,prior.df=5) #I am not sure if this is the correct implementaion, put more weight on tagwise vs. common y plotBCV(y) #test for DE et <- exactTest(y) top <- topTags(et) top cpm(y)[rownames(top), ] #total number of genes at 5% FDR summary(de <- decideTestsDGE(et)) #-1 equals upregulate, 1 equals down regulated: need to be carefuul though as order (control,vs treatment matters) top <- topTags(et,n=40) #take top n genes write.table(top,file="DEgenesMales2.txt",col.names=TRUE,sep="\t",appen d=FALSE) #move column headers one column to the right (inapproporiate format due to as.matrix) detags <- rownames(y)[as.logical(de)] plotSmear(et, de.tags=detags) abline(h=c(-1, 1), col="blue") -- Mark R. Christie, PhD Department of Zoology, Oregon State University http://www.science.oregonstate.edu/~christim/ [[alternative HTML version deleted]]

edgeR edgeR • 7.0k views

ADD COMMENT • link updated 11.2 years ago by Gordon Smyth 50k • written 11.2 years ago by Mark Christie ▴ 30

score 0 · Answer 1 · 2013-02-06

Dear Mark, Getting few DE genes is to be expected. When you filter in this way, you are saying that you don't want to see genes for which differential expression is driven by only a subset of individuals in the up-group. If you are happy to see genes that are DE but are only expressed in (say) half of the samples in the up-group, then use rowSums(cpm(y)>1) >= 6. Best wishes Gordon > Date: Mon, 4 Feb 2013 18:27:33 -0500 > From: Mark Christie <christim at="" science.oregonstate.edu=""> > To: bioconductor at r-project.org > Subject: [BioC] Filtering out tags with low counts in edgeR results in > fewer differentially expressed genes..... > > Dear edgeR users, > > Contrary to expectation, when I filter out tags with with low counts in > edgeR I get fewer genes that are differentially expressed. More > importantly, the MDS plot shows no differences between groups after tags > have been filtered, versus substantial differences between groups when no > filtering was conducted. RNA-Seq experiment details: 24 biological > replicates averaging 10 million mapped reads per replicate. 12 samples > belong to group one and 12 samples belong to group two. Individuals from > both groups were reared in common environments, but had different genetic > backgrounds such that the differences between groups may be subtle (or at > least more subtle than typical gene expression studies with a strict > control and treatment). > > When filtering, I used the recommended approach: > > keep <- rowSums(cpm(y)>1) >= 12 # where 12 is the minimum group size > y <- y[keep,] > dim(y) > > The above filtering reduces the number of tags from 90,000 to ~ 50,000. > After filtering a total of 29 genes are differentially expressed (adjusted > p-value <= 0.05), versus 97 when the count data are not filtered. Could > this result simply be because there are actual biological differences > between the two groups that result in many individuals of one group having > cpm < 1. Or could this somehow be an artifact of how I ran edgeR? On > perhaps a related note, should I alter my code because I have a fairly > large number of biological replicates. Below is the entire code I am > running. Many thanks for your advice! > > > > > dat<- read.table("GeneCounts_males.txt", header=TRUE, sep="\t", > na.strings="?",dec=".", strip.white=TRUE,row.names=1) > group=factor(c("group1","group1","group1","group1","group1","group1" ,"group1","group1","group1","group1","group1","group1","group2","group 2","group2","group2","group2","group2","group2","group2","group2","gro up2","group2","group2")) > > y <- DGEList(counts=dat,group=group) > colnames(y) > dim(y) #total number of unique tags > y$samples #number of tags ("genes") per sample > > > #filter (optional?) > keep <- rowSums(cpm(y)>1) >= 12 > y <- y[keep,] > dim(y) > > #recompute library sizes > y$samples$lib.size <- colSums(y$counts) > y$samples > > #normalize > y <- calcNormFactors(y) > y$samples > > #plot MDS of treatments > plotMDS(y) > > > #estimate dispersion > y <- estimateCommonDisp(y, verbose=TRUE) > y <- estimateTagwiseDisp(y) #uses empirical Bayes > method, if prior null then df=20 > #y <- estimateTagwiseDisp(y,prior.df=5) #I am not sure if this is > the correct implementaion, put more weight on tagwise vs. common > y > plotBCV(y) > > #test for DE > et <- exactTest(y) > top <- topTags(et) > top > cpm(y)[rownames(top), ] > > #total number of genes at 5% FDR > summary(de <- decideTestsDGE(et)) #-1 equals upregulate, > 1 equals down regulated: need to be carefuul though as order (control,vs > treatment matters) > top <- topTags(et,n=40) #take top n genes > write.table(top,file="DEgenesMales2.txt",col.names=TRUE,sep="\t",app end=FALSE) > #move column headers one column to the right (inapproporiate format due to > as.matrix) > > detags <- rownames(y)[as.logical(de)] > plotSmear(et, de.tags=detags) > abline(h=c(-1, 1), col="blue") > > > > > > > > -- > Mark R. Christie, PhD > Department of Zoology, > Oregon State University > http://www.science.oregonstate.edu/~christim/ > ______________________________________________________________________ The information in this email is confidential and intend...{{dropped:4}}