Hi Mark:
    Hello,
    Sorry for the unclear demonstration in the previous email. My
question
is how to find "true" normalization factors in edgeR given non-DE
genes? I
think that if edgeR knows non-DE genes, it would have lower false
positive
rates (find fewer wrong DE genes). By studying this, I want to have a
better understanding of how edgeR calculates normalization factors.
    Here is my sessionInfo():
    > sessionInfo()
       R version 3.1.0 (2014-04-10)
       Platform: x86_64-apple-darwin13.1.0 (64-bit)
       locale:
       [1]
en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
       attached base packages:
       [1] grid      splines   parallel  stats     graphics  grDevices
utils     datasets  methods   base
       other attached packages:
       [1] gdata_2.13.3             tm_0.6                   NLP_0.1-3
           gridExtra_0.9.1          tweeDEseqCountData_1.2.0
       [6] Biobase_2.24.0           BiocGenerics_0.10.0
edgeR_3.6.7
           limma_3.20.8
       loaded via a namespace (and not attached):
       [1] gtools_3.4.1 slam_0.1-32  tools_3.1.0
    Here is a reproducible example:
    1. Run the source code written by Xiaobei and you:
        Pipeline:
http://130.60.190.4/robinson_lab/edgeR_robust/pipeline.pdf
        Source:
http://130.60.190.4/robinson_lab/edgeR_robust/robust_simulation.R
    2. Simulate NB data:
        library(tweeDEseqCountData)
        data(pickrell)
        pickrell<-as.matrix(exprs(pickrell.eset))
        set.seed(3)
        nSamp<-8
        grp <- as.factor(rep(0:1, each = nSamp/2))       ## set up
group: 4
vs 4
        data_2 <- NBsim(foldDiff=3,dataset =pickrell, nTags = 1000,
group =
grp, verbose = TRUE, add.outlier = TRUE,
                                  outlierMech = "S", pOutlier = 0.1,
drop.extreme.dispersion = 0.1,pDiff=0.3,pUp=0.9)
    3. Plot the true DE genes I have put in:
       d <- DGEList(counts=data_2$counts, group=grp,
lib.size=colSums(data_2$counts))
       rownames(d$counts) <- paste("ids",1:nrow(d$counts),sep="")
       d <- estimateCommonDisp(d)
       true.tags<-rownames(data_2$counts[data_2$indDE,])
       plotSmear(d, de.tags=true.tags)
    4. Introduce two methods of calculating "true" normalization
factors
with non-DE genes. The first method works on log of non-DE counts. It
first
calculates the difference of log non-DE counts between each individual
(column) and first individual (column). Then I take exponential of the
median of each column to get "true" normalization factors. The second
method is using calcNormFactors() function, but with non-DE genes.
        calNorm<-function(counts,group,indNonDE){
            counts<-counts+1            ## Avoid zero counts issue
            ## First method: Calculate true normalization factors
based on
median of log counts difference, and stored them in
"norm_edgeR_median"
            ## variable.
            norm<-rep(NA,8)
            baseline<-counts[indNonDE,1]
            for (i in 1:length(norm))
            {
                norm_temp<-data_2$counts[indNonDE,i]
                norm[i]<-median(log(norm_temp)-log(baseline))
            }
            norm_edgeR_median<-exp(norm)
            ## Forced all factors multiple to 1
norm_edgeR_median<-norm_edgeR_median/exp(mean(log(norm_edgeR_median)))
            ## Second method: Calculate true normalization factors for
edgeR with none-DE genes, using calcNormFactors() function,
            ## and stored them in "norm_edgeR_calc" variable.
            counts_edgeR<-counts[indNonDE,]
            de <- DGEList(counts = counts_edgeR, group = group )
            de <- calcNormFactors(de,method="TMM")
            norm_edgeR_calc<-de$samples$norm.factors
list(norm_edgeR_median=norm_edgeR_median,norm_edgeR_calc=norm_edgeR_ca
lc)
         }
        library(gridExtra)
        norm_factors<-calNorm(data_2$counts,grp,data_2$indNonDE)
       edgeR_median.pfun <-
           function(counts, group, design = NULL, mc.cores = 4,
prior.df=10)
           {
                ## edgeR standard pipeline ##
                library(edgeR)
                d <- DGEList(counts = counts, group = group )
                d <- calcNormFactors(d,method="TMM")
                ### Use median of log difference of none-DE genes as
normalization factors
                d$samples$norm.factors<-norm_factors$norm_edgeR_median
                d <- estimateGLMCommonDisp(d, design = design)
                d <- estimateGLMTrendedDisp(d,design=design)
                d <- estimateGLMTagwiseDisp(d, design = design,
prior.df =
prior.df)
                f <- glmFit(d, design = design)
                lr <- glmLRT(f, coef=2)
                pval = lr$table$PValue
                padj = p.adjust(pval, "BH")
                cbind(pval = pval, padj = padj)
             }
        edgeR_calc.pfun <-
           function(counts, group, design = NULL, mc.cores = 4,
prior.df=10)
           {
                ## edgeR standard pipeline ##
                library(edgeR)
                d <- DGEList(counts = counts, group = group )
                d <- calcNormFactors(d,method="TMM")
                ### Use calcNormFactors() with none-DE genes to
calculate
normalization factors.
                d$samples$norm.factors<-norm_factors$norm_edgeR_calc
                d <- estimateGLMCommonDisp(d, design = design)
                d <- estimateGLMTrendedDisp(d,design=design)
                d <- estimateGLMTagwiseDisp(d, design = design,
prior.df =
prior.df)
                f <- glmFit(d, design = design)
                lr <- glmLRT(f, coef=2)
                pval = lr$table$PValue
                padj = p.adjust(pval, "BH")
                cbind(pval = pval, padj = padj)
            }
       5. Generate false positive plot:
           pvals_2 <- pval(data_2, method
=c("edgeR","edgeR_median","edgeR_calc"), count.type =
"counts",mc.cores = 4)
           fdPlot(pvals_2, cex = 0.6,xlim=c(0,700))
       I think the result is unexpected because edgeR_median and
edgeR_calc
have higher false positive rate than default edgeR. The reason would
be
that I am not using a proper way to calculate "true" normalization
factors
with non-DE genes? Thanks for your time!
Best regards,
Tianyu Zhan
2014-08-10 7:24 GMT-04:00 Mark Robinson <mark.robinson@imls.uzh.ch>:
> And, please keep the discussion on the list.  Others can benefit.
>
> Also, re-read the Posting guide:
> 
http://www.bioconductor.org/help/mailing-list/posting-guide/
>
> M.
>
>
>
>
> On 10.08.2014, at 13:08, Mark Robinson <mark.robinson@imls.uzh.ch>
wrote:
>
> > Hi Tianyu,
> >
> > Good questions get good answers.
> >
> > If you want my help, you should make it easier for me:
> >
> > 1. Answer my original questions. In particular, look at your data
to see
> what normalization factors you've introduced.  That may give you
some
> indication of if you've done something wrong.
> >
> > 2. Instead of sending me a bunch of files, tell me exactly what I
should
> look at.
> >
> > Mark
> >
> >
> > ----------
> > Prof. Dr. Mark Robinson
> > Statistical Bioinformatics, Institute of Molecular Life Sciences
> > University of Zurich
> > 
http://ow.ly/riRea
> >
> >
> >
> >
> >
> >
> >
> > On 09.08.2014, at 20:28, Zhan Tianyu <sewen67@gmail.com> wrote:
> >
> >> Hi Professor Robinson:
> >>     Hello,
> >>     Thanks for your response. It is true that in reality we never
know
> what is non-DE genes. However, I want to have a better understanding
of
> edgeR and its performance. I thank that if edgeR knows non-DE genes
in
> advance, it would have lower false positive rates (find more right
DE
> genes).
> >>     The attachments contains R scripts for data simulation and
false
> positive plots. You could first run the source code of "R_robust.R",
> "calNorm.R", and "simulation_normal.R". "R_robust.R" is written by
you and
> Xiaobei, and I also added three functions of edgeR:
edgdR_median.pfun,
> edgeR_calc.pfun, and edgeR_ratio.pfun. All the three methods take
advantage
> of known non-DE genes. The details of how I calculate "true"
normalization
> factors are in the file "calNorm.R" file.
> >>     Then I tested the default edgeR, and three functions with
known
> non-DE genes in two simulations: normal simulation ( Nsim() function
in the
> file "simulation_normal.R"), and NB simulation ( NBsim() in the file
> "R_robust"). However, the false positive plots show that default
edgeR
> performed the best (it has the lowest false positive rates). I think
the
> reason is that I am not using the right way to calculate true
normalization
> factors for edgeR, assuming that I know non-DE genes. I am wondering
what
> is a proper way to find the true normalization factors edgeR with
non-DE
> genes? Thanks for your time!
> >>
> >> Best regards,
> >> Tianyu Zhan
> >>
> >>
> >> 2014-08-09 10:34 GMT-04:00 Mark Robinson
<mark.robinson@imls.uzh.ch>:
> >> Hi Tianyu,
> >>
> >> Sorry for the slow response (was away).
> >>
> >> I have read your attachment, but your question and simulation is
still
> a little unclear to me.  First, some useful diagnostics for me to
see
> whether this simulation is in any way reasonable to a real RNA-seq
dataset
> are an M-A plot -- plotSmear(), perhaps highlighting the true DE
genes you
> have put in -- or a dispersion-mean plot -- plotBCV().  Also, there
is no
> code in there to show exactly what you did when you say 'I replaced
the
> "norm" in third line of above codes with the "true normalization
factors"';
> what did you exactly do?  Maybe send a completely reproducible
example ?
>  Give your sessionInfo() too.
> >>
> >>> I am wondering what is a better procedure
> >>> of finding the "true" normalization factors in edgeR, given
which
> genes are
> >>> none-DE?
> >>
> >> The other concern is that, in practice, you never know what is
non-DE.
>  So, how is this useful?
> >>
> >> Cheers, Mark
> >>
> >>
> >> ----------
> >> Prof. Dr. Mark Robinson
> >> Statistical Bioinformatics, Institute of Molecular Life Sciences
> >> University of Zurich
> >> 
http://ow.ly/riRea
> >>
> >>
> >>
> >>
> >>
> >>
> >>
> >> On 04.08.2014, at 22:44, Zhan Tianyu <sewen67@gmail.com> wrote:
> >>
> >>> Hi all,
> >>>
> >>>      I have a question concerning the normalization factors in
edgeR. I
> >>> think that if we know which genes are none-DE genes in advance,
we
> could
> >>> calculate "true" normalization factors based on those
information.
> Given
> >>> "true normalization factors", edgeR could find more right DE
genes, or
> even
> >>> find all the right DE genes when the simulation is simple.
> >>>      The procedures I used in edgeR are:
> >>>          library(edgeR)
> >>>          dge <- DGEList(counts = counts, group = group )
> >>>          norm <- calcNormFactors(dge,method="TMM")
> >>>          d <- estimateGLMCommonDisp(norm, design = design)
> >>>          d <- estimateGLMTrendedDisp(d,design=design)
> >>>          d <- estimateGLMTagwiseDisp(d, design = design,
prior.df = 10)
> >>>          f <- glmFit(d, design = design)
> >>>          lr <- glmLRT(f, coef=2)
> >>>          pval = lr$table$PValue
> >>>          padj = p.adjust(pval, "BH")
> >>>          cbind(pval = pval, padj = padj)
> >>>       I used the order of p-value or adjust p-value to label DE
genes
> >>> (for example, the first 300 genes in the order of p value from
low to
> >>> hight, in a scenario with 30% DE in 1000 genes). In a simple
simulation
> >>> with 30% asymmetry DE in 1000 total genes, the default edgeR
would
> find 80
> >>> wrong DE genes in 300 DE genes. The simulation method is in the
> attachment,
> >>> and I used 10 genes as a demonstration.
> >>>       Then I tried two methods to calculate "true" normalization
> >>> factors". The first one is using calnormFactors() function, but
using
> >>> counts from all none-DE genes. The second one is taking log of
all the
> >>> none-DE counts, and then calculate the median of
> >>> log(counts)[,i]-log(counts)[,1], where i is the index of each
> individual
> >>> (each row in the count matrix). Then I used
> norm<-norm/exp(mean(log(norm)))
> >>> to make sure that all factors multiple to one.
> >>>        After that, I replaced the "norm" in third line of above
codes
> >>> with the "true normalization factors". However, both methods
would
> find 110
> >>> wrong DE genes, which is higher than the false discovery rates
of
> default
> >>> edgeR method (80 wrong DE genes). I am wondering what is a
better
> procedure
> >>> of finding the "true" normalization factors in edgeR, given
which
> genes are
> >>> none-DE?
> >>>        Thank you!
> >>>
> >>> Best regards,
> >>> Tianyu Zhan
> >>> _______________________________________________
> >>> Bioconductor mailing list
> >>> Bioconductor@r-project.org
> >>> 
https://stat.ethz.ch/mailman/listinfo/bioconductor
> >>> Search the archives:
> 
http://news.gmane.org/gmane.science.biology.informatics.conductor
> >>
> >>
> >> <r scripts="" tianyu="" zhan.zip="">
> >
>
>
        [[alternative HTML version deleted]]