Question: Filtering out tags with low counts in edgeR results in fewer differentially expressed genes.....
5.5 years ago by
Mark Christie • 30
Mark Christie • 30 wrote:
Dear edgeR users, Contrary to expectation, when I filter out tags with with low counts in edgeR I get fewer genes that are differentially expressed. More importantly, the MDS plot shows no differences between groups after tags have been filtered, versus substantial differences between groups when no filtering was conducted. RNA-Seq experiment details: 24 biological replicates averaging 10 million mapped reads per replicate. 12 samples belong to group one and 12 samples belong to group two. Individuals from both groups were reared in common environments, but had different genetic backgrounds such that the differences between groups may be subtle (or at least more subtle than typical gene expression studies with a strict control and treatment). When filtering, I used the recommended approach: keep <- rowSums(cpm(y)>1) >= 12 # where 12 is the minimum group size y <- y[keep,] dim(y) The above filtering reduces the number of tags from 90,000 to ~ 50,000. After filtering a total of 29 genes are differentially expressed (adjusted p-value <= 0.05), versus 97 when the count data are not filtered. Could this result simply be because there are actual biological differences between the two groups that result in many individuals of one group having cpm < 1. Or could this somehow be an artifact of how I ran edgeR? On perhaps a related note, should I alter my code because I have a fairly large number of biological replicates. Below is the entire code I am running. Many thanks for your advice! dat<- read.table("GeneCounts_males.txt", header=TRUE, sep="\t", na.strings="?",dec=".", strip.white=TRUE,row.names=1) group=factor(c("group1","group1","group1","group1","group1","group1"," group1","group1","group1","group1","group1","group1","group2","group2" ,"group2","group2","group2","group2","group2","group2","group2","group 2","group2","group2")) y <- DGEList(counts=dat,group=group) colnames(y) dim(y) #total number of unique tags y$samples #number of tags ("genes") per sample #filter (optional?) keep <- rowSums(cpm(y)>1) >= 12 y <- y[keep,] dim(y) #recompute library sizes y$samples$lib.size <- colSums(y$counts) y$samples #normalize y <- calcNormFactors(y) y$samples #plot MDS of treatments plotMDS(y) #estimate dispersion y <- estimateCommonDisp(y, verbose=TRUE) y <- estimateTagwiseDisp(y) #uses empirical Bayes method, if prior null then df=20 #y <- estimateTagwiseDisp(y,prior.df=5) #I am not sure if this is the correct implementaion, put more weight on tagwise vs. common y plotBCV(y) #test for DE et <- exactTest(y) top <- topTags(et) top cpm(y)[rownames(top), ] #total number of genes at 5% FDR summary(de <- decideTestsDGE(et)) #-1 equals upregulate, 1 equals down regulated: need to be carefuul though as order (control,vs treatment matters) top <- topTags(et,n=40) #take top n genes write.table(top,file="DEgenesMales2.txt",col.names=TRUE,sep="\t",appen d=FALSE) #move column headers one column to the right (inapproporiate format due to as.matrix) detags <- rownames(y)[as.logical(de)] plotSmear(et, de.tags=detags) abline(h=c(-1, 1), col="blue") -- Mark R. Christie, PhD Department of Zoology, Oregon State University http://www.science.oregonstate.edu/~christim/ [[alternative HTML version deleted]]
ADD COMMENT • link •