Hi Mark:
Hello,
Sorry for the unclear demonstration in the previous email. My
question
is how to find "true" normalization factors in edgeR given non-DE
genes? I
think that if edgeR knows non-DE genes, it would have lower false
positive
rates (find fewer wrong DE genes). By studying this, I want to have a
better understanding of how edgeR calculates normalization factors.
Here is my sessionInfo():
> sessionInfo()
R version 3.1.0 (2014-04-10)
Platform: x86_64-apple-darwin13.1.0 (64-bit)
locale:
[1]
en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
attached base packages:
[1] grid splines parallel stats graphics grDevices
utils datasets methods base
other attached packages:
[1] gdata_2.13.3 tm_0.6 NLP_0.1-3
gridExtra_0.9.1 tweeDEseqCountData_1.2.0
[6] Biobase_2.24.0 BiocGenerics_0.10.0
edgeR_3.6.7
limma_3.20.8
loaded via a namespace (and not attached):
[1] gtools_3.4.1 slam_0.1-32 tools_3.1.0
Here is a reproducible example:
1. Run the source code written by Xiaobei and you:
Pipeline:
http://130.60.190.4/robinson_lab/edgeR_robust/pipeline.pdf
Source:
http://130.60.190.4/robinson_lab/edgeR_robust/robust_simulation.R
2. Simulate NB data:
library(tweeDEseqCountData)
data(pickrell)
pickrell<-as.matrix(exprs(pickrell.eset))
set.seed(3)
nSamp<-8
grp <- as.factor(rep(0:1, each = nSamp/2)) ## set up
group: 4
vs 4
data_2 <- NBsim(foldDiff=3,dataset =pickrell, nTags = 1000,
group =
grp, verbose = TRUE, add.outlier = TRUE,
outlierMech = "S", pOutlier = 0.1,
drop.extreme.dispersion = 0.1,pDiff=0.3,pUp=0.9)
3. Plot the true DE genes I have put in:
d <- DGEList(counts=data_2$counts, group=grp,
lib.size=colSums(data_2$counts))
rownames(d$counts) <- paste("ids",1:nrow(d$counts),sep="")
d <- estimateCommonDisp(d)
true.tags<-rownames(data_2$counts[data_2$indDE,])
plotSmear(d, de.tags=true.tags)
4. Introduce two methods of calculating "true" normalization
factors
with non-DE genes. The first method works on log of non-DE counts. It
first
calculates the difference of log non-DE counts between each individual
(column) and first individual (column). Then I take exponential of the
median of each column to get "true" normalization factors. The second
method is using calcNormFactors() function, but with non-DE genes.
calNorm<-function(counts,group,indNonDE){
counts<-counts+1 ## Avoid zero counts issue
## First method: Calculate true normalization factors
based on
median of log counts difference, and stored them in
"norm_edgeR_median"
## variable.
norm<-rep(NA,8)
baseline<-counts[indNonDE,1]
for (i in 1:length(norm))
{
norm_temp<-data_2$counts[indNonDE,i]
norm[i]<-median(log(norm_temp)-log(baseline))
}
norm_edgeR_median<-exp(norm)
## Forced all factors multiple to 1
norm_edgeR_median<-norm_edgeR_median/exp(mean(log(norm_edgeR_median)))
## Second method: Calculate true normalization factors for
edgeR with none-DE genes, using calcNormFactors() function,
## and stored them in "norm_edgeR_calc" variable.
counts_edgeR<-counts[indNonDE,]
de <- DGEList(counts = counts_edgeR, group = group )
de <- calcNormFactors(de,method="TMM")
norm_edgeR_calc<-de$samples$norm.factors
list(norm_edgeR_median=norm_edgeR_median,norm_edgeR_calc=norm_edgeR_ca
lc)
}
library(gridExtra)
norm_factors<-calNorm(data_2$counts,grp,data_2$indNonDE)
edgeR_median.pfun <-
function(counts, group, design = NULL, mc.cores = 4,
prior.df=10)
{
## edgeR standard pipeline ##
library(edgeR)
d <- DGEList(counts = counts, group = group )
d <- calcNormFactors(d,method="TMM")
### Use median of log difference of none-DE genes as
normalization factors
d$samples$norm.factors<-norm_factors$norm_edgeR_median
d <- estimateGLMCommonDisp(d, design = design)
d <- estimateGLMTrendedDisp(d,design=design)
d <- estimateGLMTagwiseDisp(d, design = design,
prior.df =
prior.df)
f <- glmFit(d, design = design)
lr <- glmLRT(f, coef=2)
pval = lr$table$PValue
padj = p.adjust(pval, "BH")
cbind(pval = pval, padj = padj)
}
edgeR_calc.pfun <-
function(counts, group, design = NULL, mc.cores = 4,
prior.df=10)
{
## edgeR standard pipeline ##
library(edgeR)
d <- DGEList(counts = counts, group = group )
d <- calcNormFactors(d,method="TMM")
### Use calcNormFactors() with none-DE genes to
calculate
normalization factors.
d$samples$norm.factors<-norm_factors$norm_edgeR_calc
d <- estimateGLMCommonDisp(d, design = design)
d <- estimateGLMTrendedDisp(d,design=design)
d <- estimateGLMTagwiseDisp(d, design = design,
prior.df =
prior.df)
f <- glmFit(d, design = design)
lr <- glmLRT(f, coef=2)
pval = lr$table$PValue
padj = p.adjust(pval, "BH")
cbind(pval = pval, padj = padj)
}
5. Generate false positive plot:
pvals_2 <- pval(data_2, method
=c("edgeR","edgeR_median","edgeR_calc"), count.type =
"counts",mc.cores = 4)
fdPlot(pvals_2, cex = 0.6,xlim=c(0,700))
I think the result is unexpected because edgeR_median and
edgeR_calc
have higher false positive rate than default edgeR. The reason would
be
that I am not using a proper way to calculate "true" normalization
factors
with non-DE genes? Thanks for your time!
Best regards,
Tianyu Zhan
2014-08-10 7:24 GMT-04:00 Mark Robinson <mark.robinson@imls.uzh.ch>:
> And, please keep the discussion on the list. Others can benefit.
>
> Also, re-read the Posting guide:
>
http://www.bioconductor.org/help/mailing-list/posting-guide/
>
> M.
>
>
>
>
> On 10.08.2014, at 13:08, Mark Robinson <mark.robinson@imls.uzh.ch>
wrote:
>
> > Hi Tianyu,
> >
> > Good questions get good answers.
> >
> > If you want my help, you should make it easier for me:
> >
> > 1. Answer my original questions. In particular, look at your data
to see
> what normalization factors you've introduced. That may give you
some
> indication of if you've done something wrong.
> >
> > 2. Instead of sending me a bunch of files, tell me exactly what I
should
> look at.
> >
> > Mark
> >
> >
> > ----------
> > Prof. Dr. Mark Robinson
> > Statistical Bioinformatics, Institute of Molecular Life Sciences
> > University of Zurich
> >
http://ow.ly/riRea
> >
> >
> >
> >
> >
> >
> >
> > On 09.08.2014, at 20:28, Zhan Tianyu <sewen67@gmail.com> wrote:
> >
> >> Hi Professor Robinson:
> >> Hello,
> >> Thanks for your response. It is true that in reality we never
know
> what is non-DE genes. However, I want to have a better understanding
of
> edgeR and its performance. I thank that if edgeR knows non-DE genes
in
> advance, it would have lower false positive rates (find more right
DE
> genes).
> >> The attachments contains R scripts for data simulation and
false
> positive plots. You could first run the source code of "R_robust.R",
> "calNorm.R", and "simulation_normal.R". "R_robust.R" is written by
you and
> Xiaobei, and I also added three functions of edgeR:
edgdR_median.pfun,
> edgeR_calc.pfun, and edgeR_ratio.pfun. All the three methods take
advantage
> of known non-DE genes. The details of how I calculate "true"
normalization
> factors are in the file "calNorm.R" file.
> >> Then I tested the default edgeR, and three functions with
known
> non-DE genes in two simulations: normal simulation ( Nsim() function
in the
> file "simulation_normal.R"), and NB simulation ( NBsim() in the file
> "R_robust"). However, the false positive plots show that default
edgeR
> performed the best (it has the lowest false positive rates). I think
the
> reason is that I am not using the right way to calculate true
normalization
> factors for edgeR, assuming that I know non-DE genes. I am wondering
what
> is a proper way to find the true normalization factors edgeR with
non-DE
> genes? Thanks for your time!
> >>
> >> Best regards,
> >> Tianyu Zhan
> >>
> >>
> >> 2014-08-09 10:34 GMT-04:00 Mark Robinson
<mark.robinson@imls.uzh.ch>:
> >> Hi Tianyu,
> >>
> >> Sorry for the slow response (was away).
> >>
> >> I have read your attachment, but your question and simulation is
still
> a little unclear to me. First, some useful diagnostics for me to
see
> whether this simulation is in any way reasonable to a real RNA-seq
dataset
> are an M-A plot -- plotSmear(), perhaps highlighting the true DE
genes you
> have put in -- or a dispersion-mean plot -- plotBCV(). Also, there
is no
> code in there to show exactly what you did when you say 'I replaced
the
> "norm" in third line of above codes with the "true normalization
factors"';
> what did you exactly do? Maybe send a completely reproducible
example ?
> Give your sessionInfo() too.
> >>
> >>> I am wondering what is a better procedure
> >>> of finding the "true" normalization factors in edgeR, given
which
> genes are
> >>> none-DE?
> >>
> >> The other concern is that, in practice, you never know what is
non-DE.
> So, how is this useful?
> >>
> >> Cheers, Mark
> >>
> >>
> >> ----------
> >> Prof. Dr. Mark Robinson
> >> Statistical Bioinformatics, Institute of Molecular Life Sciences
> >> University of Zurich
> >>
http://ow.ly/riRea
> >>
> >>
> >>
> >>
> >>
> >>
> >>
> >> On 04.08.2014, at 22:44, Zhan Tianyu <sewen67@gmail.com> wrote:
> >>
> >>> Hi all,
> >>>
> >>> I have a question concerning the normalization factors in
edgeR. I
> >>> think that if we know which genes are none-DE genes in advance,
we
> could
> >>> calculate "true" normalization factors based on those
information.
> Given
> >>> "true normalization factors", edgeR could find more right DE
genes, or
> even
> >>> find all the right DE genes when the simulation is simple.
> >>> The procedures I used in edgeR are:
> >>> library(edgeR)
> >>> dge <- DGEList(counts = counts, group = group )
> >>> norm <- calcNormFactors(dge,method="TMM")
> >>> d <- estimateGLMCommonDisp(norm, design = design)
> >>> d <- estimateGLMTrendedDisp(d,design=design)
> >>> d <- estimateGLMTagwiseDisp(d, design = design,
prior.df = 10)
> >>> f <- glmFit(d, design = design)
> >>> lr <- glmLRT(f, coef=2)
> >>> pval = lr$table$PValue
> >>> padj = p.adjust(pval, "BH")
> >>> cbind(pval = pval, padj = padj)
> >>> I used the order of p-value or adjust p-value to label DE
genes
> >>> (for example, the first 300 genes in the order of p value from
low to
> >>> hight, in a scenario with 30% DE in 1000 genes). In a simple
simulation
> >>> with 30% asymmetry DE in 1000 total genes, the default edgeR
would
> find 80
> >>> wrong DE genes in 300 DE genes. The simulation method is in the
> attachment,
> >>> and I used 10 genes as a demonstration.
> >>> Then I tried two methods to calculate "true" normalization
> >>> factors". The first one is using calnormFactors() function, but
using
> >>> counts from all none-DE genes. The second one is taking log of
all the
> >>> none-DE counts, and then calculate the median of
> >>> log(counts)[,i]-log(counts)[,1], where i is the index of each
> individual
> >>> (each row in the count matrix). Then I used
> norm<-norm/exp(mean(log(norm)))
> >>> to make sure that all factors multiple to one.
> >>> After that, I replaced the "norm" in third line of above
codes
> >>> with the "true normalization factors". However, both methods
would
> find 110
> >>> wrong DE genes, which is higher than the false discovery rates
of
> default
> >>> edgeR method (80 wrong DE genes). I am wondering what is a
better
> procedure
> >>> of finding the "true" normalization factors in edgeR, given
which
> genes are
> >>> none-DE?
> >>> Thank you!
> >>>
> >>> Best regards,
> >>> Tianyu Zhan
> >>> _______________________________________________
> >>> Bioconductor mailing list
> >>> Bioconductor@r-project.org
> >>>
https://stat.ethz.ch/mailman/listinfo/bioconductor
> >>> Search the archives:
>
http://news.gmane.org/gmane.science.biology.informatics.conductor
> >>
> >>
> >> <r scripts="" tianyu="" zhan.zip="">
> >
>
>
[[alternative HTML version deleted]]