Entering edit mode
Mark Christie
▴
30
@mark-christie-5745
Last seen 10.6 years ago
Dear edgeR users,
Contrary to expectation, when I filter out tags with with low counts
in
edgeR I get fewer genes that are differentially expressed. More
importantly, the MDS plot shows no differences between groups after
tags
have been filtered, versus substantial differences between groups when
no
filtering was conducted. RNA-Seq experiment details: 24 biological
replicates averaging 10 million mapped reads per replicate. 12 samples
belong to group one and 12 samples belong to group two. Individuals
from
both groups were reared in common environments, but had different
genetic
backgrounds such that the differences between groups may be subtle (or
at
least more subtle than typical gene expression studies with a strict
control and treatment).
When filtering, I used the recommended approach:
keep <- rowSums(cpm(y)>1) >= 12 # where 12 is the minimum
group size
y <- y[keep,]
dim(y)
The above filtering reduces the number of tags from 90,000 to ~
50,000.
After filtering a total of 29 genes are differentially expressed
(adjusted
p-value <= 0.05), versus 97 when the count data are not filtered.
Could
this result simply be because there are actual biological differences
between the two groups that result in many individuals of one group
having
cpm < 1. Or could this somehow be an artifact of how I ran edgeR? On
perhaps a related note, should I alter my code because I have a fairly
large number of biological replicates. Below is the entire code I am
running. Many thanks for your advice!
dat<- read.table("GeneCounts_males.txt", header=TRUE, sep="\t",
na.strings="?",dec=".", strip.white=TRUE,row.names=1)
group=factor(c("group1","group1","group1","group1","group1","group1","
group1","group1","group1","group1","group1","group1","group2","group2"
,"group2","group2","group2","group2","group2","group2","group2","group
2","group2","group2"))
y <- DGEList(counts=dat,group=group)
colnames(y)
dim(y) #total number of unique tags
y$samples #number of tags ("genes") per sample
#filter (optional?)
keep <- rowSums(cpm(y)>1) >= 12
y <- y[keep,]
dim(y)
#recompute library sizes
y$samples$lib.size <- colSums(y$counts)
y$samples
#normalize
y <- calcNormFactors(y)
y$samples
#plot MDS of treatments
plotMDS(y)
#estimate dispersion
y <- estimateCommonDisp(y, verbose=TRUE)
y <- estimateTagwiseDisp(y) #uses empirical Bayes
method, if prior null then df=20
#y <- estimateTagwiseDisp(y,prior.df=5) #I am not sure if
this is
the correct implementaion, put more weight on tagwise vs. common
y
plotBCV(y)
#test for DE
et <- exactTest(y)
top <- topTags(et)
top
cpm(y)[rownames(top), ]
#total number of genes at 5% FDR
summary(de <- decideTestsDGE(et)) #-1 equals
upregulate,
1 equals down regulated: need to be carefuul though as order
(control,vs
treatment matters)
top <- topTags(et,n=40) #take top n genes
write.table(top,file="DEgenesMales2.txt",col.names=TRUE,sep="\t",appen
d=FALSE)
#move column headers one column to the right (inapproporiate format
due to
as.matrix)
detags <- rownames(y)[as.logical(de)]
plotSmear(et, de.tags=detags)
abline(h=c(-1, 1), col="blue")
--
Mark R. Christie, PhD
Department of Zoology,
Oregon State University
http://www.science.oregonstate.edu/~christim/
[[alternative HTML version deleted]]