Search
Question: EdgeR LogCpm and LogFC values calculation
0
gravatar for Gordon Smyth
3.8 years ago by
Gordon Smyth32k
Walter and Eliza Hall Institute of Medical Research, Melbourne, Australia
Gordon Smyth32k wrote:
Dear Koen, > Date: Fri, 31 Jan 2014 21:09:12 +0100 > From: Koen Van den Berge <koen.vdberge at="" gmail.com=""> > To: "P.D. Moerland" <p.d.moerland at="" amc.uva.nl=""> > Cc: "bioconductor at r-project.org" <bioconductor at="" r-project.org=""> > Subject: Re: [BioC] EdgeR LogCpm and LogFC values calculation > > Dear Perry, > > Thank you for this clarification. I required this information as I > wanted to plot the change in Log2 Cpm for my various DE genes. I am having some trouble understanding what you are trying to do. Why would you need to recompute a value (the average logCPM) that is already provided as a column of the topTags output table? The logFC (which is the change in log2 cpm) is already provided as well. Note that the logCPM column of the topTags output represents average logCPM over all the libraries. This is the quantity used as the x-axis for plotBCV() and plotSmear() and other similar functions in edgeR. > Would you then consider my prior way of Log2 Cpm calculation (first > calculating cpm through cpm() function, then taking the log2) If you want log2 CPM for each individual sample, please use cpm(y, log=TRUE) rather than log2(cpm(y)) > as wrong for plotting this certain result as it might not be a very > correct way in dealing with the count data, or are it still log 2 cpm > values, but simply calculated in an other fashion and maybe with a > somewhat different interpretation? Yes, it makes sense to plot the logCPM values that you get from cpm(), but you should view this as being a descriptive plot of the data rather than an exact representation of the fitted model. You cannot reproduce the logFC values from topTags() from them. The edgeR logFC is more complicated and better. > Likewise, starting from the code you have provided me with, would you > consider > > abundance2 <- t(t(y)+prior.count.scaled) > > to be more correct information for plotting the log2 cpm values for DE > genes; or are these values only relevant when combined with the > mglmOneGroup() function for calculating the context specific average? I really feel that you should not need to hack the edgeR code in this way. If the above hasn't already answered your questions, can you describe more precisely the plot you are tring to create? Are you trying to plot the logCPM for each sample, or for each experimental group, or just the logFC? What are you trying to learn? Best wishes Gordon > Thank you in advance, > Koen Van den Berge > On 31 Jan 2014, at 14:13, P.D. Moerland <p.d.moerland at="" amc.uva.nl=""> wrote: > >> Dear Koen, >> >> The source code of the function aveLogCPM (included in the package >> source of the current release version 3.4.2) shows that the function >> mglmOneGroup needs some additional parameters and other default values >> than the ones you used: >> >> prior.count <- 2 >> dispersion <- 0.05 >> # y is the matrix of the raw counts >> y <- as.matrix(y) >> if(is.null(lib.size)) lib.size <- colSums(y) >> prior.count.scaled <- lib.size/mean(lib.size) * prior.count >> offset <- log(lib.size+2*prior.count.scaled) >> abundance <- mglmOneGroup(t(t(y)+prior.count.scaled),dispersion=dis persion,offset=offset) >> (abundance+log(1e6)) / log(2) >> >> On a small test dataset this gives me the same values as the AveLogCPM component of a DGELRT object. Could you give this a try? >> >> best, >> Perry >> >> --- >> Perry Moerland, PhD >> Room J1B-215 >> Bioinformatics Laboratory, Department of Clinical Epidemiology, Biostatistics and Bioinformatics >> Academic Medical Center, University of Amsterdam >> Postbus 22660, 1100 DD Amsterdam, The Netherlands >> tel: +31 20 5666945 >> p.d.moerland at amc.uva.nl, http://www.bioinformaticslaboratory.nl/ >> >> >> -----Original Message----- >> From: bioconductor-bounces at r-project.org [mailto:bioconductor- bounces at r-project.org] On Behalf Of Koen Van den Berge >> Sent: Friday, January 31, 2014 1:08 PM >> To: bioconductor at r-project.org >> Subject: [BioC] EdgeR LogCpm and LogFC values calculation >> >> Dear Bioconductor mailing list, >> >> I am currently researching the differences in RNA-Seq data analysis, >> comparing the two well known EdgeR and Voom methods. However, there is >> one thing I can not manage to reproduce, namely the logCPM value in the >> output of the LRT table of EdgeR, after analyzing a certain contrast or >> coefficient. I understand from the manual and helpfile, that this >> logCPM value is a log2 counts per million, normalized for library >> sizes. I also understand that it is not a simple mean that is being >> used, but rather the mglmOneGroup() function. However, when I try to >> recalculate this myself, I can not obtain the same value. My workflow >> in doing so looks like this: >> >> #Calculate through the table >> counts <- read.delim("file.txt") >> Treat <- factor(rep(c(rep(c("Cont","DPN","OHT"),4)),each=2))[-19] >> #Delete removed sample 19 by authors >> >> y.s <- DGEList(counts=counts.filtered.smart,group=Treat) >> y.s <- calcNormFactors(y.s) >> y.common <- estimateGLMCommonDisp(y, design, verbose=TRUE) y.trended <- estimateGLMTrendedDisp(y.common, design) y.tagwise <- estimateGLMTagwiseDisp(y.trended, design,prior.df=22) lrt <- glmLRT(fit,coef= c(5,6,9,8)) lrt$table$logCPM >> >> #Calculate manually >> cpmstest <- cpm(y, normalized.lib.sizes = TRUE) LogCpmstest <- log(cpmstest + 0.5, base = 2) >> mglmOneGroup(LogCpmstest) #not identical >> >> mean(LogCpmstest) #not identical >> >> >> mglmOneGroup(y.tagwise) #also non-identical >> >> Have you got any recommendations in what should be changed in this manual calculation workflow? What am I doing wrong? >> >> Any help would be greatly appreciated. >> >> Sincerely, Koen. ______________________________________________________________________ The information in this email is confidential and intend...{{dropped:4}}
ADD COMMENTlink modified 3.8 years ago by Koen Van den Berge90 • written 3.8 years ago by Gordon Smyth32k
0
gravatar for Koen Van den Berge
3.8 years ago by
Ghent University, Belgium
Koen Van den Berge90 wrote:
Dear Gordon, Thank you for your reply. Since the LRT table only gives a LogCPM for the average of each gene, my desire was to calculate this for each sample separately, for all genes. In this way, I wanted to create a dot plot - as added in the appendix - which could visually show the changes in expression values for the genes we have found to be significantly differentially expressed (note I had three samples in the control group and four in the treatment group). This is my motivation in reaching out to the individual LogCPM values and since these did not coincide with the ones provided by the table, I was worried. Since the fitted model reproduces a single coefficient for each treatment group, I guess it will not be possible to calculate sample- wise log2 cpms that would be similar to the ones provided by the table, and maybe I should stick to my descriptive log2 cpm plots as provided in the appendix? Thank you in advance, Koen -------------- next part -------------- A non-text attachment was scrubbed... Name: DPN24Genes.pdf Type: application/pdf Size: 10620 bytes Desc: not available URL: <https: stat.ethz.ch="" pipermail="" bioconductor="" attachments="" 20140202="" 5ed0edcd="" attachment.pdf=""> -------------- next part -------------- On 02 Feb 2014, at 03:47, Gordon K Smyth <smyth at="" wehi.edu.au=""> wrote: > Dear Koen, > >> Date: Fri, 31 Jan 2014 21:09:12 +0100 >> From: Koen Van den Berge <koen.vdberge at="" gmail.com=""> >> To: "P.D. Moerland" <p.d.moerland at="" amc.uva.nl=""> >> Cc: "bioconductor at r-project.org" <bioconductor at="" r-project.org=""> >> Subject: Re: [BioC] EdgeR LogCpm and LogFC values calculation >> >> Dear Perry, >> >> Thank you for this clarification. I required this information as I wanted to plot the change in Log2 Cpm for my various DE genes. > > I am having some trouble understanding what you are trying to do. Why would you need to recompute a value (the average logCPM) that is already provided as a column of the topTags output table? The logFC (which is the change in log2 cpm) is already provided as well. > > Note that the logCPM column of the topTags output represents average logCPM over all the libraries. This is the quantity used as the x-axis for plotBCV() and plotSmear() and other similar functions in edgeR. > >> Would you then consider my prior way of Log2 Cpm calculation (first calculating cpm through cpm() function, then taking the log2) > > If you want log2 CPM for each individual sample, please use > > cpm(y, log=TRUE) > > rather than > > log2(cpm(y)) > >> as wrong for plotting this certain result as it might not be a very correct way in dealing with the count data, or are it still log 2 cpm values, but simply calculated in an other fashion and maybe with a somewhat different interpretation? > > Yes, it makes sense to plot the logCPM values that you get from cpm(), but you should view this as being a descriptive plot of the data rather than an exact representation of the fitted model. You cannot reproduce the logFC values from topTags() from them. The edgeR logFC is more complicated and better. > >> Likewise, starting from the code you have provided me with, would you consider >> >> abundance2 <- t(t(y)+prior.count.scaled) >> >> to be more correct information for plotting the log2 cpm values for DE genes; or are these values only relevant when combined with the mglmOneGroup() function for calculating the context specific average? > > I really feel that you should not need to hack the edgeR code in this way. > > If the above hasn't already answered your questions, can you describe more precisely the plot you are tring to create? Are you trying to plot the logCPM for each sample, or for each experimental group, or just the logFC? What are you trying to learn? > > Best wishes > Gordon > > >> Thank you in advance, >> Koen Van den Berge > > > >> On 31 Jan 2014, at 14:13, P.D. Moerland <p.d.moerland at="" amc.uva.nl=""> wrote: >> >>> Dear Koen, >>> >>> The source code of the function aveLogCPM (included in the package source of the current release version 3.4.2) shows that the function mglmOneGroup needs some additional parameters and other default values than the ones you used: >>> >>> prior.count <- 2 >>> dispersion <- 0.05 >>> # y is the matrix of the raw counts >>> y <- as.matrix(y) >>> if(is.null(lib.size)) lib.size <- colSums(y) >>> prior.count.scaled <- lib.size/mean(lib.size) * prior.count >>> offset <- log(lib.size+2*prior.count.scaled) >>> abundance <- mglmOneGroup(t(t(y)+prior.count.scaled),dispersion=di spersion,offset=offset) >>> (abundance+log(1e6)) / log(2) >>> >>> On a small test dataset this gives me the same values as the AveLogCPM component of a DGELRT object. Could you give this a try? >>> >>> best, >>> Perry >>> >>> --- >>> Perry Moerland, PhD >>> Room J1B-215 >>> Bioinformatics Laboratory, Department of Clinical Epidemiology, Biostatistics and Bioinformatics >>> Academic Medical Center, University of Amsterdam >>> Postbus 22660, 1100 DD Amsterdam, The Netherlands >>> tel: +31 20 5666945 >>> p.d.moerland at amc.uva.nl, http://www.bioinformaticslaboratory.nl/ >>> >>> >>> -----Original Message----- >>> From: bioconductor-bounces at r-project.org [mailto:bioconductor- bounces at r-project.org] On Behalf Of Koen Van den Berge >>> Sent: Friday, January 31, 2014 1:08 PM >>> To: bioconductor at r-project.org >>> Subject: [BioC] EdgeR LogCpm and LogFC values calculation >>> >>> Dear Bioconductor mailing list, >>> >>> I am currently researching the differences in RNA-Seq data analysis, comparing the two well known EdgeR and Voom methods. However, there is one thing I can not manage to reproduce, namely the logCPM value in the output of the LRT table of EdgeR, after analyzing a certain contrast or coefficient. I understand from the manual and helpfile, that this logCPM value is a log2 counts per million, normalized for library sizes. I also understand that it is not a simple mean that is being used, but rather the mglmOneGroup() function. However, when I try to recalculate this myself, I can not obtain the same value. My workflow in doing so looks like this: >>> >>> #Calculate through the table >>> counts <- read.delim("file.txt") >>> Treat <- factor(rep(c(rep(c("Cont","DPN","OHT"),4)),each=2))[-19] >>> #Delete removed sample 19 by authors >>> >>> y.s <- DGEList(counts=counts.filtered.smart,group=Treat) >>> y.s <- calcNormFactors(y.s) >>> y.common <- estimateGLMCommonDisp(y, design, verbose=TRUE) y.trended <- estimateGLMTrendedDisp(y.common, design) y.tagwise <- estimateGLMTagwiseDisp(y.trended, design,prior.df=22) lrt <- glmLRT(fit,coef= c(5,6,9,8)) lrt$table$logCPM >>> >>> #Calculate manually >>> cpmstest <- cpm(y, normalized.lib.sizes = TRUE) LogCpmstest <- log(cpmstest + 0.5, base = 2) >>> mglmOneGroup(LogCpmstest) #not identical >>> >>> mean(LogCpmstest) #not identical >>> >>> >>> mglmOneGroup(y.tagwise) #also non-identical >>> >>> Have you got any recommendations in what should be changed in this manual calculation workflow? What am I doing wrong? >>> >>> Any help would be greatly appreciated. >>> >>> Sincerely, Koen. > > ______________________________________________________________________ > The information in this email is confidential and inte...{{dropped:9}}
ADD COMMENTlink written 3.8 years ago by Koen Van den Berge90
On Sun, 2 Feb 2014, Koen Van den Berge wrote: > Dear Gordon, > > Thank you for your reply. > Since the LRT table only gives a LogCPM for the average of each gene, my > desire was to calculate this for each sample separately, for all genes. > In this way, I wanted to create a dot plot - as added in the appendix - > which could visually show the changes in expression values for the genes > we have found to be significantly differentially expressed (note I had > three samples in the control group and four in the treatment group). > This is my motivation in reaching out to the individual LogCPM values > and since these did not coincide with the ones provided by the table, I > was worried. > Since the fitted model reproduces a single coefficient for each > treatment group, I guess it will not be possible to calculate > sample-wise log2 cpms that would be similar to the ones provided by the > table, and maybe I should stick to my descriptive log2 cpm plots as > provided in the appendix? Yes. As you say, the fitted model doesn't estimate sample-wise values. You need to plot the data, not the model. Gordon > Thank you in advance, > Koen > > ______________________________________________________________________ The information in this email is confidential and intend...{{dropped:4}}
ADD REPLYlink written 3.8 years ago by Gordon Smyth32k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.2.0
Traffic: 208 users visited in the last hour