edgeR: common vs tagwise dispersion
1
0
Entering edit mode
Ann Hess ▴ 340
@ann-hess-251
Last seen 10.2 years ago
I am using edgeR to look for differentially abundant ?segments? between two groups (data generated using high throughput sequencing). I have 3 (pooled) biological reps per group and a total of 18760 segments (83 rows with zero count are removed by edger). As a first approach, I used the common dispersion method and found the estimated common dispersion to be 0.135. After looking at the top 10 segments, I find that there tends to be a single large value (different for each segment) that is bringing up the logFC. I tried using moderated tagwise dispersion (using prior.n=50 and 25) and found that the results are largely the same as common dispersion approach (not shown). When I look at the tagwise dispersion values for the top 10 hits, I find that the estimated tagwise dispersion values are greater that the estimated common dispersion (not shown). To look into things further, I ran the same analysis but now with prior.n=0 (no moderation/squeezing). The top 10 hits are now completely different and the estimated tagwise dispersion values for the top 10 are very small. (Looking at the top 10 seems to suggest that I could use a Poisson distribution.) Questions: 1. Should I be concerned that the results are so different depending on whether common dispersion (almost equivalent to moderated tagwise dispersion) or no-moderation tagwise dispersion is used? Based on FDR<0.05, there is only about 10% overlap between the two approaches. 2. I?m not sure how to interpret the tagwise dispersion values for the top hits: common dispersion method picks up segments with large tagwise dispersion, no moderation method picks up segments with small tagwise dispersion. I am using edgeR_1.4.7 with R version 2.10.1. > #COMMON DISPERSION APPROACH > library(edgeR) > df <- DGEList(counts=Reads, group=c(0,0,0,1,1,1), genes=Annotation$Description) > df$samples group lib.size C1 0 4488940 C2 0 2437107 C3 0 2600316 T1 1 3935852 T2 1 3806079 T3 1 3913694 > df <- estimateCommonDisp(df) > df$common.dispersion [1] 0.1346658 > df.com<-exactTest(df) Comparison of groups: 1 ? 0 > CDtop10<-topTagsdf.com)$table > CDtop10[,-1] logConc logFC PValue FDR 6145 -12.77011 7.490945 2.002637e-37 3.740325e-33 15580 -12.32428 6.621865 2.854360e-32 2.665544e-28 1565 -12.21365 6.311500 2.936737e-30 1.828315e-26 15718 -13.94448 -6.050904 6.517136e-28 3.043014e-24 1154 -13.69624 -5.326718 1.143794e-23 4.272527e-20 15630 -17.02012 5.975869 1.743145e-21 5.426120e-18 341 -18.60859 -6.565039 4.125655e-19 1.100784e-15 16351 -14.64956 4.565285 1.375746e-18 3.211850e-15 6468 -15.86990 -4.712918 3.248713e-18 6.741802e-15 4891 -16.96347 5.181179 7.436516e-18 1.388918e-14 > CDtopIDs<-as.numeric(row.names(CDtop10)) > df$counts[CDtopIDs,] C1 C2 C3 T1 T2 T3 [1,] 31 36 28 45 57 22440 [2,] 77 45 61 22745 47 55 [3,] 49 68 85 76 210 21738 [4,] 35 3729 34 37 22 32 [5,] 28 69 3636 25 25 89 [6,] 4 2 3 7 25 668 [7,] 2 3 188 1 0 2 [8,] 51 8 23 301 296 1619 [9,] 12 10 652 14 13 11 [10,] 4 5 3 540 6 10 > #TAGWISE DISPERSION APPROACH > fprior <- estimateSmoothing(df) > fprior [1] 6329.643 #I also tried prior.n=25 and prior.n=50, but results not shown. > df<-estimateTagwiseDisp(df, prior.n = 0) > quantile(df$tagwise.dispersion) 0% 25% 50% 75% 100% 1.001001e-03 2.151338e-02 7.667087e-02 1.715599e-01 9.990000e+02 > df.tgw<-exactTest(df,common.disp=FALSE) > TGWtop10<-topTags(df.tgw)$table > TGWtop10[,-1] logConc logFC PValue FDR 2659 -13.14601 -0.7454279 2.722274e-25 5.084391e-21 11865 -14.21354 -1.4925866 5.769090e-22 5.387465e-18 13066 -16.14612 -1.3689788 2.176835e-15 1.355225e-11 12381 -15.12798 -0.9772265 5.676831e-15 2.650654e-11 17206 -17.37537 -2.0234616 1.808245e-14 5.722016e-11 1172 -13.15737 -0.5535657 1.838202e-14 5.722016e-11 8678 -15.78098 -1.4312623 5.231726e-13 1.395899e-09 251 -15.03996 -0.8806583 1.137238e-12 2.655024e-09 8466 -14.50024 -0.7195308 1.861731e-12 3.863507e-09 8472 -15.35444 -0.9235374 5.519857e-12 1.030944e-08 > TGWtopIDs<-as.numeric(row.names(TGWtop10)) > df$counts[TGWtopIDs,] C1 C2 C3 T1 T2 T3 [1,] 685 346 337 340 340 313 [2,] 382 199 256 121 103 142 [3,] 97 59 55 29 36 35 [4,] 171 91 111 87 73 72 [5,] 54 24 35 12 8 14 [6,] 629 295 345 354 352 347 [7,] 138 58 84 41 47 38 [8,] 200 89 96 93 75 87 [9,] 231 138 157 149 115 128 [10,] 145 87 81 74 75 53 > df$tagwise.dispersion[TGWtopIDs] [1] 0.001001001 0.011153172 0.001001001 0.001001001 0.001001001 [6] 0.001001001 0.011153172 0.001001001 0.001001001 0.001001001 > df$tagwise.dispersion[CDtopIDs] [1] 3.1369561 3.0528706 2.6123364 2.4261316 2.4261316 1.8815105 [7] 3.2246046 0.6054731 2.0582920 2.1059294
edgeR edgeR • 1.3k views
ADD COMMENT
0
Entering edit mode
Mark Robinson ★ 1.1k
@mark-robinson-2171
Last seen 10.2 years ago
Hi Ann. See comments below. On 2010-04-21, at 12:18 AM, Ann Hess wrote: > I am using edgeR to look for differentially abundant ?segments? > between two groups (data generated using high throughput sequencing). > I have 3 (pooled) biological reps per group and a total of 18760 > segments (83 rows with zero count are removed by edger). > > As a first approach, I used the common dispersion method and found the > estimated common dispersion to be 0.135. The common dispersion value shows substantial biological variation between your replicates, with a CV of nearly 40% in expression levels between samples. This is typical of what we have observed when the biological replicates are separate individuals or animals. > After looking at the top 10 > segments, I find that there tends to be a single large value > (different for each segment) that is bringing up the logFC. We've seen this in other datasets also. It seems to be a type of biological variation that RNA-seq makes very evident. > I tried using moderated tagwise dispersion (using prior.n=50 and 25) > and found that the results are largely the same as common dispersion > approach (not shown). The prior.n values you're choosing are basically too large. We generally recommend around 30-50 prior df. Since you have 4 residual df per segment, this translates to prior.n=10. You could even go a little lower, but not prior.n=0. > When I look at the tagwise dispersion values > for the top 10 hits, I find that the estimated tagwise dispersion > values are greater that the estimated common dispersion (not shown). > > To look into things further, I ran the same analysis but now with > prior.n=0 (no moderation/squeezing). The top 10 hits are now > completely different and the estimated tagwise dispersion values for > the top 10 are very small. (Looking at the top 10 seems to suggest > that I could use a Poisson distribution.) We don't recommend no moderation. There are just too few df to estimate the dispersion reliably for individual transcripts. The top segments in such a list will naturally tend to have small dispersions, because you're essentially selecting for this. It isn't necessarily evidence that the Poisson distribution is a good fit. > Questions: > 1. Should I be concerned that the results are so different depending > on whether common dispersion (almost equivalent to moderated tagwise > dispersion) or no-moderation tagwise dispersion is used? Based on > FDR<0.05, there is only about 10% overlap between the two approaches. No, this is to be expected. You are comparing the two extremes of dispersion estimation. The reason we developed moderated methods is that no-moderation is just not reliable. > 2. I?m not sure how to interpret the tagwise dispersion values for the > top hits: common dispersion method picks up segments with large > tagwise dispersion, no moderation method picks up segments with small > tagwise dispersion. > > I am using edgeR_1.4.7 with R version 2.10.1. You'd probably be better off choosing a smaller prior.n which gives you a more even handed compromise between common and tagwise dispersion. Hope that helps. Cheers, Mark >> #COMMON DISPERSION APPROACH >> library(edgeR) >> df <- DGEList(counts=Reads, group=c(0,0,0,1,1,1), genes=Annotation$Description) >> df$samples > group lib.size > C1 0 4488940 > C2 0 2437107 > C3 0 2600316 > T1 1 3935852 > T2 1 3806079 > T3 1 3913694 >> df <- estimateCommonDisp(df) >> df$common.dispersion > [1] 0.1346658 >> df.com<-exactTest(df) > Comparison of groups: 1 ? 0 >> CDtop10<-topTagsdf.com)$table >> CDtop10[,-1] > logConc logFC PValue FDR > 6145 -12.77011 7.490945 2.002637e-37 3.740325e-33 > 15580 -12.32428 6.621865 2.854360e-32 2.665544e-28 > 1565 -12.21365 6.311500 2.936737e-30 1.828315e-26 > 15718 -13.94448 -6.050904 6.517136e-28 3.043014e-24 > 1154 -13.69624 -5.326718 1.143794e-23 4.272527e-20 > 15630 -17.02012 5.975869 1.743145e-21 5.426120e-18 > 341 -18.60859 -6.565039 4.125655e-19 1.100784e-15 > 16351 -14.64956 4.565285 1.375746e-18 3.211850e-15 > 6468 -15.86990 -4.712918 3.248713e-18 6.741802e-15 > 4891 -16.96347 5.181179 7.436516e-18 1.388918e-14 >> CDtopIDs<-as.numeric(row.names(CDtop10)) >> df$counts[CDtopIDs,] > C1 C2 C3 T1 T2 T3 > [1,] 31 36 28 45 57 22440 > [2,] 77 45 61 22745 47 55 > [3,] 49 68 85 76 210 21738 > [4,] 35 3729 34 37 22 32 > [5,] 28 69 3636 25 25 89 > [6,] 4 2 3 7 25 668 > [7,] 2 3 188 1 0 2 > [8,] 51 8 23 301 296 1619 > [9,] 12 10 652 14 13 11 > [10,] 4 5 3 540 6 10 > > >> #TAGWISE DISPERSION APPROACH >> fprior <- estimateSmoothing(df) >> fprior > [1] 6329.643 > > #I also tried prior.n=25 and prior.n=50, but results not shown. >> df<-estimateTagwiseDisp(df, prior.n = 0) >> quantile(df$tagwise.dispersion) > 0% 25% 50% 75% 100% > 1.001001e-03 2.151338e-02 7.667087e-02 1.715599e-01 9.990000e+02 >> df.tgw<-exactTest(df,common.disp=FALSE) >> TGWtop10<-topTags(df.tgw)$table >> TGWtop10[,-1] > logConc logFC PValue FDR > 2659 -13.14601 -0.7454279 2.722274e-25 5.084391e-21 > 11865 -14.21354 -1.4925866 5.769090e-22 5.387465e-18 > 13066 -16.14612 -1.3689788 2.176835e-15 1.355225e-11 > 12381 -15.12798 -0.9772265 5.676831e-15 2.650654e-11 > 17206 -17.37537 -2.0234616 1.808245e-14 5.722016e-11 > 1172 -13.15737 -0.5535657 1.838202e-14 5.722016e-11 > 8678 -15.78098 -1.4312623 5.231726e-13 1.395899e-09 > 251 -15.03996 -0.8806583 1.137238e-12 2.655024e-09 > 8466 -14.50024 -0.7195308 1.861731e-12 3.863507e-09 > 8472 -15.35444 -0.9235374 5.519857e-12 1.030944e-08 >> TGWtopIDs<-as.numeric(row.names(TGWtop10)) >> df$counts[TGWtopIDs,] > C1 C2 C3 T1 T2 T3 > [1,] 685 346 337 340 340 313 > [2,] 382 199 256 121 103 142 > [3,] 97 59 55 29 36 35 > [4,] 171 91 111 87 73 72 > [5,] 54 24 35 12 8 14 > [6,] 629 295 345 354 352 347 > [7,] 138 58 84 41 47 38 > [8,] 200 89 96 93 75 87 > [9,] 231 138 157 149 115 128 > [10,] 145 87 81 74 75 53 > >> df$tagwise.dispersion[TGWtopIDs] > [1] 0.001001001 0.011153172 0.001001001 0.001001001 0.001001001 > [6] 0.001001001 0.011153172 0.001001001 0.001001001 0.001001001 >> df$tagwise.dispersion[CDtopIDs] > [1] 3.1369561 3.0528706 2.6123364 2.4261316 2.4261316 1.8815105 > [7] 3.2246046 0.6054731 2.0582920 2.1059294 > > _______________________________________________ > Bioconductor mailing list > Bioconductor at stat.math.ethz.ch > https://stat.ethz.ch/mailman/listinfo/bioconductor > Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor ------------------------------ Mark Robinson, PhD (Melb) Epigenetics Laboratory, Garvan Bioinformatics Division, WEHI e: m.robinson at garvan.org.au e: mrobinson at wehi.edu.au p: +61 (0)3 9345 2628 f: +61 (0)3 9347 0852 ------------------------------ ______________________________________________________________________ The information in this email is confidential and intend...{{dropped:6}}
ADD COMMENT

Login before adding your answer.

Traffic: 580 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6