Question about median of replicates

0

Entering edit mode

Sandra Fernandez Moya ▴ 20

@sandra-fernandez-moya-6682

Last seen 9.6 years ago

Hello, I have a very important question, cause we are going to submit a paper in a few hours and now we realize that maybe we have an error, so I want to check with experts beforehand. The thing is I used EdgeR for comparison of 2 groups, Control and Group1, and 3 samples, Control:1 and Group1:2; I followed the basic protocol, because I dont know so much about this analyses and I get a final logFC, that it makes sense and also we checked in the lab. But, now, the referees asked if the logFC was from the data Group1 normalized and with the mean of both and we realize that it was not the mean but seems to be the summ of the counts from each sample of Group1 the ones that the software take for the analysis. Maybe I did something wrong, but can you confirm me this? It shouldnt be, but does EdgeR summ the counts of each replicate and uses it for the analysis?Thanks a lot, and I wait for the answer!Sandra [[alternative HTML version deleted]]

edgeR edgeR • 1.2k views

ADD COMMENT • link 9.7 years ago Sandra Fernandez Moya ▴ 20

0

Entering edit mode

Steve Lianoglou ★ 13k

@steve-lianoglou-2771

Last seen 14 months ago

United States

Hi, On Wed, Jul 30, 2014 at 3:44 PM, Sandra Fernandez Moya <dedeusan at="" hotmail.com=""> wrote: > Hello, I have a very important question, cause we are going to submit a paper in a few hours and now we realize that maybe we have an error, so I want to check with experts beforehand. The thing is I used EdgeR for comparison of 2 groups, Control and Group1, and 3 samples, Control:1 and Group1:2; I followed the basic protocol, because I dont know so much about this analyses and I get a final logFC, that it makes sense and also we checked in the lab. But, now, the referees asked if the logFC was from the data Group1 normalized and with the mean of both and we realize that it was not the mean but seems to be the summ of the counts from each sample of Group1 the ones that the software take for the analysis. I don't follow what you are saying here. > Maybe I did something wrong, but can you confirm me this? It shouldnt be, but does EdgeR summ the counts of each replicate and uses it for the analysis?Thanks a lot, and I wait for the answer!Sandra We can't say much without looking at your code and data. In edgeR you construct a DGEList to hold your data and perform analysis over. Let's say this DGEList is called `dlist` in your code, what is the output of: R> dlist$samples So we know what your experimental design looks like. Then please provide all the code you used to do the analysis, and finally the output of `sessionInfo()` -steve -- Steve Lianoglou Computational Biologist Genentech

ADD COMMENT • link 9.7 years ago Steve Lianoglou ★ 13k

0

Entering edit mode

James W. MacDonald 65k

@james-w-macdonald-5106

Last seen 4 hours ago

United States

Hi Sandra, The logFC is the coefficient from your model estimating the difference in logCPM between groups. It will not be the sum of the counts from each group. It will be rather close to the difference between the log mean counts per million (logCPM) that you can compute from your data, but not exactly. This is because the coefficients are estimated internally by edgeR, and cannot be computed directly. As an example, I ran the example for glmFit (just to get some data), and then modified slightly to conform to your experiment: example(glmFit) nlibs <- 6 x <- factor(rep(1:2, each=3), labels = c("Trt","Cont")) design <- model.matrix(~x) d <- DGEList(y) d <- calcNormFactors(d) fit <- glmFit(d, design, dispersion=dispersion.true) results <- glmLRT(fit, coef=2) topTags(results) Coefficient: xCont logFC logCPM LR PValue FDR Gene60 -2.510450 13.90319 11.493249 0.0006984944 0.06984944 Gene95 -2.006865 13.82370 7.636606 0.0057195447 0.27359986 Gene18 2.191870 13.56029 6.987521 0.0082079958 0.27359986 Gene23 -1.873228 13.74293 6.450792 0.0110902864 0.27725716 Then we can compute the mean difference between the logCPM for the first gene (Gene60): z <- rowMeans(cpm(d, log=TRUE)[,4:6]) - rowMeans(cpm(d, log=TRUE)[,1:3]) z[60] Gene60 -2.797037 So you can see that the value I get when I compute by hand is close to the value reported by edgeR, but not the same. This is because there is no closed form solution for the model we are fitting (e.g., you can't just calculate the answer by hand), so the coefficients have to be estimated iteratively by R. Best, Jim On 7/30/2014 6:44 PM, Sandra Fernandez Moya wrote: > Hello, I have a very important question, cause we are going to submit a paper in a few hours and now we realize that maybe we have an error, so I want to check with experts beforehand. The thing is I used EdgeR for comparison of 2 groups, Control and Group1, and 3 samples, Control:1 and Group1:2; I followed the basic protocol, because I dont know so much about this analyses and I get a final logFC, that it makes sense and also we checked in the lab. But, now, the referees asked if the logFC was from the data Group1 normalized and with the mean of both and we realize that it was not the mean but seems to be the summ of the counts from each sample of Group1 the ones that the software take for the analysis. Maybe I did something wrong, but can you confirm me this? It shouldnt be, but does EdgeR summ the counts of each replicate and uses it for the analysis?Thanks a lot, and I wait for the answer!Sandra > [[alternative HTML version deleted]] > > _______________________________________________ > Bioconductor mailing list > Bioconductor at r-project.org > https://stat.ethz.ch/mailman/listinfo/bioconductor > Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor > -- James W. MacDonald, M.S. Biostatistician University of Washington Environmental and Occupational Health Sciences 4225 Roosevelt Way NE, # 100 Seattle WA 98105-6099

ADD COMMENT • link 9.7 years ago James W. MacDonald 65k

0

Entering edit mode

Sandra Fernandez Moya ▴ 20

@sandra-fernandez-moya-6682

Last seen 9.6 years ago

Dear Steve,thanks for your fast reply.About the CC, sorry, I didnt realize.I will repeat it now with the new data provide by you now. We realize that the number of DE genes were too high, so we just take in account for later analysis the higher ones. Also, exons are not properly called. In this organisms, there is no introns, so no necessity of analysis of splice variants. For some reason, the ones who made the gff file that I take for alignment, called it like this, but they are considered as genes. About the edgeR, the analysis was made time ago, so I will do it again with the new version.Again, thank you for the patience and the long reply. Sandra > From: lianoglou.steve@gene.com > To: dedeusan@hotmail.com > Subject: Re: [BioC] Question about median of replicates > Date: Thu, 31 Jul 2014 14:19:29 -0700 > > Sandra, > > First, as I mentioned previously: PLEASE include the bioconductor list > when seeking for and replying to help this way others can benefit from > the help, and also help you better than I can. > > I would normally just CC the list in this reply, but I won't here. > > Your analysis looks (more or less) correct, but: > > (1) you have missed an "estimateTagwiseDisp" call after your > "estimateCommonDisp" > > (2) one normally filters out rows by logCPM, and not by the raw counts > > (3) you *still* haven't provided your sessionInfo so we can verify you > are using the latest versions of the software, but if you aren't -- you > should upgrade. > > You end with asking: > > > So that is why we asked ourselves what is the basics of EdgeR, because > > now we have in all our data at least 1 fold less than before, but the > > most important thing is that we dont know still why. So I was relieved > > because I think it is not such a big deal and that the analysis is > > getting real results, but still dont understand why exists this > > difference. Can you give me a small explanation if it is possible? > > Maybe I put something wrong in the analysis...Sandra > > One thing to understand is that edgeR (or DESeq2, or limma) is not > basic, so it's hard to understand "the basics" without a certain degree > of statistical sophistication. > > I didn't quite follow the math example that you provided as the > formatting came through weird, so I'm not sure what the "1 logFC > difference" you are describing is. > > James' email to you outlined an easy example (with real/simulated data > that you can generate using the same code he wrote in his email) of how > the logFC's calculated by edgeR will be different than those you > calculate by hand -- simply because you can't just calculate it by hand. > > The details of why don't really matter for you. edgeR is a widely used > piece of software written by card-carrying statisticians and published > under peer review, so as far as you should be concerned, its results are > correct as long as you perform the analysis correctly. > > I'll end with just pointing out that the number of genes you are > identifying as DE are quite high, so if it were me, I'd be suspicious of > something and double check lots of things. > > Also, I just reviewed your code again and it looks like you are counting > exon expression instead of gene expression? If this is the case you > should have mentioned this from the get go! and also you are doing it > wrong. you can try to use edgeR::spliceVariants or the DEXSeq package. > > -steve [[alternative HTML version deleted]]

ADD COMMENT • link 9.7 years ago Sandra Fernandez Moya ▴ 20

Login before adding your answer.