EdgeR LogCpm and LogFC values calculation

0

Entering edit mode

Koen Van den Berge ▴ 180

@koen-van-den-berge-6369

Last seen 13 months ago

Ghent University, Belgium

Dear Bioconductor mailing list, I am currently researching the differences in RNA-Seq data analysis, comparing the two well known EdgeR and Voom methods. However, there is one thing I can not manage to reproduce, namely the logCPM value in the output of the LRT table of EdgeR, after analyzing a certain contrast or coefficient. I understand from the manual and helpfile, that this logCPM value is a log2 counts per million, normalized for library sizes. I also understand that it is not a simple mean that is being used, but rather the mglmOneGroup() function. However, when I try to recalculate this myself, I can not obtain the same value. My workflow in doing so looks like this: #Calculate through the table counts <- read.delim("file.txt") Treat <- factor(rep(c(rep(c("Cont","DPN","OHT"),4)),each=2))[-19] #Delete removed sample 19 by authors y.s <- DGEList(counts=counts.filtered.smart,group=Treat) y.s <- calcNormFactors(y.s) y.common <- estimateGLMCommonDisp(y, design, verbose=TRUE) y.trended <- estimateGLMTrendedDisp(y.common, design) y.tagwise <- estimateGLMTagwiseDisp(y.trended, design,prior.df=22) lrt <- glmLRT(fit,coef= c(5,6,9,8)) lrt$table$logCPM #Calculate manually cpmstest <- cpm(y, normalized.lib.sizes = TRUE) LogCpmstest <- log(cpmstest + 0.5, base = 2) mglmOneGroup(LogCpmstest) #not identical mean(LogCpmstest) #not identical mglmOneGroup(y.tagwise) #also non-identical Have you got any recommendations in what should be changed in this manual calculation workflow? What am I doing wrong? Any help would be greatly appreciated. Sincerely, Koen. [[alternative HTML version deleted]]

edgeR edgeR • 4.3k views

ADD COMMENT • link updated 10.2 years ago by Perry Moerland ▴ 120 • written 10.2 years ago by Koen Van den Berge ▴ 180

0

Entering edit mode

Perry Moerland ▴ 120

@perry-moerland-1109

Last seen 2.1 years ago

Bioinformatics Laboratory, Academic Med…

Dear Koen, The source code of the function aveLogCPM (included in the package source of the current release version 3.4.2) shows that the function mglmOneGroup needs some additional parameters and other default values than the ones you used: prior.count <- 2 dispersion <- 0.05 # y is the matrix of the raw counts y <- as.matrix(y) if(is.null(lib.size)) lib.size <- colSums(y) prior.count.scaled <- lib.size/mean(lib.size) * prior.count offset <- log(lib.size+2*prior.count.scaled) abundance <- mglmOneGroup(t(t(y)+prior.count.scaled),dispersion=disper sion,offset=offset) (abundance+log(1e6)) / log(2) On a small test dataset this gives me the same values as the AveLogCPM component of a DGELRT object. Could you give this a try? best, Perry --- Perry Moerland, PhD Room J1B-215 Bioinformatics Laboratory, Department of Clinical Epidemiology, Biostatistics and Bioinformatics Academic Medical Center, University of Amsterdam Postbus 22660, 1100 DD Amsterdam, The Netherlands tel: +31 20 5666945 p.d.moerland at amc.uva.nl, http://www.bioinformaticslaboratory.nl/ -----Original Message----- From: bioconductor-bounces@r-project.org [mailto:bioconductor- bounces@r-project.org] On Behalf Of Koen Van den Berge Sent: Friday, January 31, 2014 1:08 PM To: bioconductor at r-project.org Subject: [BioC] EdgeR LogCpm and LogFC values calculation Dear Bioconductor mailing list, I am currently researching the differences in RNA-Seq data analysis, comparing the two well known EdgeR and Voom methods. However, there is one thing I can not manage to reproduce, namely the logCPM value in the output of the LRT table of EdgeR, after analyzing a certain contrast or coefficient. I understand from the manual and helpfile, that this logCPM value is a log2 counts per million, normalized for library sizes. I also understand that it is not a simple mean that is being used, but rather the mglmOneGroup() function. However, when I try to recalculate this myself, I can not obtain the same value. My workflow in doing so looks like this: #Calculate through the table counts <- read.delim("file.txt") Treat <- factor(rep(c(rep(c("Cont","DPN","OHT"),4)),each=2))[-19] #Delete removed sample 19 by authors y.s <- DGEList(counts=counts.filtered.smart,group=Treat) y.s <- calcNormFactors(y.s) y.common <- estimateGLMCommonDisp(y, design, verbose=TRUE) y.trended <- estimateGLMTrendedDisp(y.common, design) y.tagwise <- estimateGLMTagwiseDisp(y.trended, design,prior.df=22) lrt <- glmLRT(fit,coef= c(5,6,9,8)) lrt$table$logCPM #Calculate manually cpmstest <- cpm(y, normalized.lib.sizes = TRUE) LogCpmstest <- log(cpmstest + 0.5, base = 2) mglmOneGroup(LogCpmstest) #not identical mean(LogCpmstest) #not identical mglmOneGroup(y.tagwise) #also non-identical Have you got any recommendations in what should be changed in this manual calculation workflow? What am I doing wrong? Any help would be greatly appreciated. Sincerely, Koen. [[alternative HTML version deleted]] _______________________________________________ Bioconductor mailing list Bioconductor at r-project.org https://stat.ethz.ch/mailman/listinfo/bioconductor Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor ________________________________ AMC Disclaimer : http://www.amc.nl/disclaimer

ADD COMMENT • link 10.2 years ago Perry Moerland ▴ 120

0

Entering edit mode

Dear Perry, Thank you for this clarification. I required this information as I wanted to plot the change in Log2 Cpm for my various DE genes. Would you then consider my prior way of Log2 Cpm calculation (first calculating cpm through cpm() function, then taking the log2) as wrong for plotting this certain result as it might not be a very correct way in dealing with the count data, or are it still log 2 cpm values, but simply calculated in an other fashion and maybe with a somewhat different interpretation? Likewise, starting from the code you have provided me with, would you consider abundance2 <- t(t(y)+prior.count.scaled) to be more correct information for plotting the log2 cpm values for DE genes; or are these values only relevant when combined with the mglmOneGroup() function for calculating the context specific average? Thank you in advance, Koen Van den Berge On 31 Jan 2014, at 14:13, P.D. Moerland <p.d.moerland at="" amc.uva.nl=""> wrote: > Dear Koen, > > The source code of the function aveLogCPM (included in the package source of the current release version 3.4.2) shows that the function mglmOneGroup needs some additional parameters and other default values than the ones you used: > > prior.count <- 2 > dispersion <- 0.05 > # y is the matrix of the raw counts > y <- as.matrix(y) > if(is.null(lib.size)) lib.size <- colSums(y) > prior.count.scaled <- lib.size/mean(lib.size) * prior.count > offset <- log(lib.size+2*prior.count.scaled) > abundance <- mglmOneGroup(t(t(y)+prior.count.scaled),dispersion=disp ersion,offset=offset) > (abundance+log(1e6)) / log(2) > > On a small test dataset this gives me the same values as the AveLogCPM component of a DGELRT object. Could you give this a try? > > best, > Perry > > --- > Perry Moerland, PhD > Room J1B-215 > Bioinformatics Laboratory, Department of Clinical Epidemiology, Biostatistics and Bioinformatics > Academic Medical Center, University of Amsterdam > Postbus 22660, 1100 DD Amsterdam, The Netherlands > tel: +31 20 5666945 > p.d.moerland at amc.uva.nl, http://www.bioinformaticslaboratory.nl/ > > > -----Original Message----- > From: bioconductor-bounces at r-project.org [mailto:bioconductor- bounces at r-project.org] On Behalf Of Koen Van den Berge > Sent: Friday, January 31, 2014 1:08 PM > To: bioconductor at r-project.org > Subject: [BioC] EdgeR LogCpm and LogFC values calculation > > Dear Bioconductor mailing list, > > I am currently researching the differences in RNA-Seq data analysis, comparing the two well known EdgeR and Voom methods. However, there is one thing I can not manage to reproduce, namely the logCPM value in the output of the LRT table of EdgeR, after analyzing a certain contrast or coefficient. I understand from the manual and helpfile, that this logCPM value is a log2 counts per million, normalized for library sizes. I also understand that it is not a simple mean that is being used, but rather the > mglmOneGroup() function. However, when I try to recalculate this myself, I can not obtain the same value. My workflow in doing so looks like this: > > #Calculate through the table > counts <- read.delim("file.txt") > Treat <- factor(rep(c(rep(c("Cont","DPN","OHT"),4)),each=2))[-19] > #Delete removed sample 19 by authors > > y.s <- DGEList(counts=counts.filtered.smart,group=Treat) > y.s <- calcNormFactors(y.s) > y.common <- estimateGLMCommonDisp(y, design, verbose=TRUE) y.trended <- estimateGLMTrendedDisp(y.common, design) y.tagwise <- estimateGLMTagwiseDisp(y.trended, design,prior.df=22) lrt <- glmLRT(fit,coef= c(5,6,9,8)) lrt$table$logCPM > > #Calculate manually > cpmstest <- cpm(y, normalized.lib.sizes = TRUE) LogCpmstest <- log(cpmstest + 0.5, base = 2) > mglmOneGroup(LogCpmstest) #not identical > > mean(LogCpmstest) #not identical > > > mglmOneGroup(y.tagwise) #also non-identical > > Have you got any recommendations in what should be changed in this manual calculation workflow? What am I doing wrong? > > Any help would be greatly appreciated. > > Sincerely, Koen. > > [[alternative HTML version deleted]] > > _______________________________________________ > Bioconductor mailing list > Bioconductor at r-project.org > https://stat.ethz.ch/mailman/listinfo/bioconductor > Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor > ________________________________ > > AMC Disclaimer : http://www.amc.nl/disclaimer > > ________________________________

ADD REPLY • link 10.2 years ago Koen Van den Berge ▴ 180

Login before adding your answer.