Question: edgeR and sagenhaft
0
10.4 years ago by
Naomi Altman6.0k
Naomi Altman6.0k wrote:
I have 4 large tag datasets A1, A2 and B1, B2. The purpose of the experiment was to determine differences in gene expression between A and B. A1 and B1 were done together as batch 1, and A2 and B2 were done together as batch 2. I several analyses and am completely puzzled. First I ran sage.test (Fisher's exact test) on A1, B1 and on A2, B2. The results were strongly concordant in that there was a lot of overlap in the significant gene list, and the same genes were up/down regulated (on the whole). Then I ran edgeR on all 4 samples. A large number of genes were declared significantly differentially expressed, but it was almost completely disjoint from the genes "found" by sage.test. (Fewer than 10 out of 4000). The $r$ values were strongly clustered around 2, although some were huge. Incidentally, the "exact" component of the output does not seem to be described in ?edgeR, but I understand it to be the p-value from the test. Then I tested for batch effects by using sage.test on A1, A2 and on B1, B2 and finally on A1 U B1 and A2 U B2. A fairly large number of genes showed strong batch effects. These overlapped more with the genotype within batch sage.test results than with the edgeR results. Just to make things more confusing, the grad student who ran the samples used the normal approximation to the Poisson to test genotype effects within batch. These were highly concordant between batches as well, but did not match the sage.test results. I thought the p-values would be similar at least for genes with large counts, but they were not. I am inclined to go with combining the sage.test results, but any advice would be very welcome Thanks, Naomi S. Altman 814-865-3791 (voice) Associate Professor Dept. of Statistics 814-863-7114 (fax) Penn State University 814-865-1348 (Statistics) University Park, PA 16802-2111
sage go edger • 555 views
modified 10.4 years ago • written 10.4 years ago by Naomi Altman6.0k
0
10.4 years ago by
Mark Robinson1.1k
Mark Robinson1.1k wrote:
Hi Naomi. Curious. A bit difficult to diagnose without digging into it. There is probably a reasonable explanation for all of this. For what its worth, a few comments/queries below. > I have 4 large tag datasets A1, A2 and B1, B2. The purpose of the > experiment was to determine differences in gene expression between A and > B. > A1 and B1 were done together as batch 1, and A2 and B2 were done > together as batch 2. First question: are these technical replicates or biological? If technical, you may consider the 'doPoisson=TRUE' option of deDGE() since that effectively sets r large (dispersion small), making it a Poisson calculation. > I several analyses and am completely puzzled. > > First I ran sage.test (Fisher's exact test) on A1, B1 and on A2, > B2. The results were strongly concordant in that there was a lot of > overlap in the significant gene list, > and the same genes were up/down regulated (on the whole). > > Then I ran edgeR on all 4 samples. A large number of genes were > declared significantly differentially expressed, but it was almost > completely disjoint from the genes "found" by sage.test. (Fewer than > 10 out of 4000). The $r$ values were strongly clustered around 2, > although some were huge. Incidentally, the "exact" component of the > output does not seem to be described in ?edgeR, but I understand it > to be the p-value from the test. 'r' values around 2 suggest there is significant variation over and above Poisson. But, maybe this is due to batch effects. Indeed, the 'exact' element is the p-value from the exact test proposed in the paper. What do you use for 'lib.size' -- total number of reads? Are they drastically different from batch-to-batch/sample-to-sample? How do the batch effects manifest -- more total reads giving higher overall counts, or something different? > Then I tested for batch effects by using sage.test on A1, A2 and on > B1, B2 and finally on A1 U B1 and A2 U B2. A fairly large number of > genes showed strong batch effects. These overlapped more with the > genotype within batch sage.test results than with the edgeR results. Strong batch effects that aren't explained by total counts would result in higher dispersion estimates (lower values of 'r') in edgeR, thus giving fewer DE genes. So, maybe this explains some of the lower overlap here. > Just to make things more confusing, the grad student who ran the > samples used the normal approximation to the Poisson to test genotype > effects within batch. These > were highly concordant between batches as well, but did not match the > sage.test results. I thought the p-values would be similar at least > for genes with large counts, but they were not. > > I am inclined to go with combining the sage.test results, but any > advice would be very welcome Not sure I've really contributed much, but there must be a reasonable explanation. Mark > > Thanks, > > Naomi S. Altman 814-865-3791 (voice) > Associate Professor > Dept. of Statistics 814-863-7114 (fax) > Penn State University 814-865-1348 (Statistics) > University Park, PA 16802-2111 > > _______________________________________________ > Bioconductor mailing list > Bioconductor at stat.math.ethz.ch > https://stat.ethz.ch/mailman/listinfo/bioconductor > Search the archives: > http://news.gmane.org/gmane.science.biology.informatics.conductor >
0
10.4 years ago by
Naomi Altman6.0k
Naomi Altman6.0k wrote:
Here is some of the discrepant output. A1, A2, B1, B2 are the 4 samples with tag counts. My commands were y=cbind(A1,B1,A2,B2) d<- DGEList(data = y, group = c(1,2,1,2), lib.size = apply(y,2,sum)) alpha <- alpha.approxeb(d) ms <- deDGE(d, alpha = alpha$alpha) Here are the first 5 cases where both samples had Fisher's exact test p-values <.001 and the edgeR exact statistic >.05. These are raw p-values. A1 B1 A2 B2 Sagenhaft p-value A1 vs B1 Sagenhaft p-value A1 vs B1 ms$exact 46 5 39 12 7.546362e-11 3.605042e-06 0.76524756 33 4 45 13 1.128468e-07 3.791248e-07 0.97325389 0 55 13 49 1.070343e-15 1.668152e-04 0.08834922 69 179 92 526 2.088333e-09 3.816029e-55 0.56544378 109 36 122 63 3.988028e-12 4.556175e-09 0.86574462 Naomi S. Altman 814-865-3791 (voice) Associate Professor Dept. of Statistics 814-863-7114 (fax) Penn State University 814-865-1348 (Statistics) University Park, PA 16802-2111 [[alternative HTML version deleted]]
What are the library sizes? On 19/02/2009, at 3:06 PM, Naomi Altman wrote: > Here is some of the discrepant output. A1, A2, B1, B2 are the 4 > samples with tag counts. > > My commands were > y=cbind(A1,B1,A2,B2) > d<- DGEList(data = y, group = c(1,2,1,2), lib.size = apply(y,2,sum)) > alpha <- alpha.approxeb(d) > ms <- deDGE(d, alpha = alpha$alpha) > > Here are the first 5 cases where both samples had Fisher's exact test > p-values <.001 and the edgeR exact statistic >.05. These > are raw p-values. > > > A1 B1 A2 B2 > Sagenhaft p-value A1 vs B1 Sagenhaft p-value A1 vs B1 ms >$exact > 46 5 39 12 > 7.546362e-11 3.605042e-06 > 0.76524756 > 33 4 45 13 > 1.128468e-07 3.791248e-07 > 0.97325389 > 0 55 13 49 > 1.070343e-15 1.668152e-04 > 0.08834922 > 69 179 92 526 > 2.088333e-09 3.816029e-55 > 0.56544378 > 109 36 122 63 > 3.988028e-12 4.556175e-09 > 0.86574462 > > > > Naomi S. Altman 814-865-3791 (voice) > Associate Professor > Dept. of Statistics 814-863-7114 (fax) > Penn State University 814-865-1348 > (Statistics) > University Park, PA 16802-2111 > > [[alternative HTML version deleted]] > > _______________________________________________ > Bioconductor mailing list > Bioconductor at stat.math.ethz.ch > https://stat.ethz.ch/mailman/listinfo/bioconductor > Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor ------------------------------ Mark Robinson Epigenetics Laboratory, Garvan Bioinformatics Division, WEHI e: m.robinson at garvan.org.au e: mrobinson at wehi.edu.au p: +61 (0)3 9345 2628 f: +61 (0)3 9347 0852
A1 B1 A2 B2 1376421 1577198 1948700 2448499 At 11:48 PM 2/18/2009, Mark Robinson wrote: >What are the library sizes? > > >On 19/02/2009, at 3:06 PM, Naomi Altman wrote: > >>Here is some of the discrepant output. A1, A2, B1, B2 are the 4 >>samples with tag counts. >> >>My commands were >>y=cbind(A1,B1,A2,B2) >>d<- DGEList(data = y, group = c(1,2,1,2), lib.size = apply(y,2,sum)) >>alpha <- alpha.approxeb(d) >>ms <- deDGE(d, alpha = alpha$alpha) >> >>Here are the first 5 cases where both samples had Fisher's exact test >>p-values <.001 and the edgeR exact statistic >.05. These >>are raw p-values. >> >> >> A1 B1 A2 B2 >>Sagenhaft p-value A1 vs B1 Sagenhaft p-value A1 vs B1 ms$exact >> 46 5 39 12 >> 7.546362e-11 3.605042e-06 >> 0.76524756 >> 33 4 45 13 >> 1.128468e-07 3.791248e-07 >> 0.97325389 >> 0 55 13 49 >> 1.070343e-15 1.668152e-04 >> 0.08834922 >> 69 179 92 526 >>2.088333e-09 3.816029e-55 >> 0.56544378 >> 109 36 122 63 >>3.988028e-12 4.556175e-09 >> 0.86574462 >> >> >> >>Naomi S. Altman 814-865-3791 (voice) >>Associate Professor >>Dept. of Statistics 814-863-7114 (fax) >>Penn State University 814-865-1348 >>(Statistics) >>University Park, PA 16802-2111 >> >> [[alternative HTML version deleted]] >> >>_______________________________________________ >>Bioconductor mailing list >>Bioconductor at stat.math.ethz.ch >>https://stat.ethz.ch/mailman/listinfo/bioconductor >>Search the archives: >>http://news.gmane.org/gmane.science.biology.informatics.conductor > >------------------------------ >Mark Robinson >Epigenetics Laboratory, Garvan >Bioinformatics Division, WEHI >e: m.robinson at garvan.org.au >e: mrobinson at wehi.edu.au >p: +61 (0)3 9345 2628 >f: +61 (0)3 9347 0852 >------------------------------ > > > > Naomi S. Altman 814-865-3791 (voice) Associate Professor Dept. of Statistics 814-863-7114 (fax) Penn State University 814-865-1348 (Statistics) University Park, PA 16802-2111
Hi Naomi. Thanks for this. Well, maybe this isn't too surprising. The sagenhaft p-value does not even attempt to take into account that there is variability between the replicates. edgeR does and I believe this is the source of the discrepancy. These genes all have some amount of variability that based on the dispersion estimates (your phi=2?), is too large for these genes to be considered significantly different. A couple other considerations: 1. When you built your DGEList, did you do any row filtering? I typically remove any rows in the table that have <k total="" counts="" (say="" k="3)" since="" those="" will="" have="" very="" little="" information.="" usually="" i="" keep="" the="" library="" sizes="" fixed="" from="" the="" outset.="" the="" reason="" i="" ask="" is="" maybe="" there="" are="" a="" lot="" of="" these="" rows="" and="" i="" don't="" know="" how="" much="" this="" influences="" the="" dispersion="" smoothing.="" you="" could="" think="" of="" this="" as="" including="" a="" whole="" bunch="" of="" empty="" spots="" in="" a="" limma="" analysis,="" which="" you="" probably="" wouldn't="" want="" to="" do="" ...="" 2.="" you="" could="" also="" look="" at="" the="" 'pseudo'="" element="" of="" your="" 'ms'="" dedgelist="" object.="" this="" is="" the="" table="" that="" the="" exact="" test="" operates="" on.="" we="" can="" talk="" more="" offline.="" hth.="" mark=""> A1 B1 A2 B2 > 1376421 1577198 1948700 2448499 > > > > At 11:48 PM 2/18/2009, Mark Robinson wrote: > >>What are the library sizes? >> >> >>On 19/02/2009, at 3:06 PM, Naomi Altman wrote: >> >>>Here is some of the discrepant output. A1, A2, B1, B2 are the 4 >>>samples with tag counts. >>> >>>My commands were >>>y=cbind(A1,B1,A2,B2) >>>d<- DGEList(data = y, group = c(1,2,1,2), lib.size = apply(y,2,sum)) >>>alpha <- alpha.approxeb(d) >>>ms <- deDGE(d, alpha = alpha$alpha) >>> >>>Here are the first 5 cases where both samples had Fisher's exact test >>>p-values <.001 and the edgeR exact statistic >.05. These >>>are raw p-values. >>> >>> >>> A1 B1 A2 B2 >>>Sagenhaft p-value A1 vs B1 Sagenhaft p-value A1 vs B1 ms >>>$exact >>> 46 5 39 12 >>> 7.546362e-11 3.605042e-06 >>> 0.76524756 >>> 33 4 45 13 >>> 1.128468e-07 3.791248e-07 >>> 0.97325389 >>> 0 55 13 49 >>> 1.070343e-15 1.668152e-04 >>> 0.08834922 >>> 69 179 92 526 >>>2.088333e-09 3.816029e-55 >>> 0.56544378 >>> 109 36 122 63 >>>3.988028e-12 4.556175e-09 >>> 0.86574462 >>> >>> >>> >>>Naomi S. Altman 814-865-3791 (voice) >>>Associate Professor >>>Dept. of Statistics 814-863-7114 (fax) >>>Penn State University 814-865-1348 >>>(Statistics) >>>University Park, PA 16802-2111 >>> >>> [[alternative HTML version deleted]] >>> >>>_______________________________________________ >>>Bioconductor mailing list >>>Bioconductor at stat.math.ethz.ch >>>https://stat.ethz.ch/mailman/listinfo/bioconductor >>>Search the archives: >>>http://news.gmane.org/gmane.science.biology.informatics.conductor >> >>------------------------------ >>Mark Robinson >>Epigenetics Laboratory, Garvan >>Bioinformatics Division, WEHI >>e: m.robinson at garvan.org.au >>e: mrobinson at wehi.edu.au >>p: +61 (0)3 9345 2628 >>f: +61 (0)3 9347 0852 >>------------------------------ >> >> >> >> > > Naomi S. Altman 814-865-3791 (voice) > Associate Professor > Dept. of Statistics 814-863-7114 (fax) > Penn State University 814-865-1348 (Statistics) > University Park, PA 16802-2111 > >