edgeR: common vs tagwise dispersion

0

Entering edit mode

Ann Hess ▴ 340

@ann-hess-251

Last seen 11.4 years ago

I am using edgeR to look for differentially abundant ?segments? between two groups (data generated using high throughput sequencing). I have 3 (pooled) biological reps per group and a total of 18760 segments (83 rows with zero count are removed by edger). As a first approach, I used the common dispersion method and found the estimated common dispersion to be 0.135. After looking at the top 10 segments, I find that there tends to be a single large value (different for each segment) that is bringing up the logFC. I tried using moderated tagwise dispersion (using prior.n=50 and 25) and found that the results are largely the same as common dispersion approach (not shown). When I look at the tagwise dispersion values for the top 10 hits, I find that the estimated tagwise dispersion values are greater that the estimated common dispersion (not shown). To look into things further, I ran the same analysis but now with prior.n=0 (no moderation/squeezing). The top 10 hits are now completely different and the estimated tagwise dispersion values for the top 10 are very small. (Looking at the top 10 seems to suggest that I could use a Poisson distribution.) Questions: 1. Should I be concerned that the results are so different depending on whether common dispersion (almost equivalent to moderated tagwise dispersion) or no-moderation tagwise dispersion is used? Based on FDR<0.05, there is only about 10% overlap between the two approaches. 2. I?m not sure how to interpret the tagwise dispersion values for the top hits: common dispersion method picks up segments with large tagwise dispersion, no moderation method picks up segments with small tagwise dispersion. I am using edgeR_1.4.7 with R version 2.10.1. > #COMMON DISPERSION APPROACH > library(edgeR) > df <- DGEList(counts=Reads, group=c(0,0,0,1,1,1), genes=Annotation$Description) > df$samples group lib.size C1 0 4488940 C2 0 2437107 C3 0 2600316 T1 1 3935852 T2 1 3806079 T3 1 3913694 > df <- estimateCommonDisp(df) > df$common.dispersion [1] 0.1346658 > df.com<-exactTest(df) Comparison of groups: 1 ? 0 > CDtop10<-topTagsdf.com)$table > CDtop10[,-1] logConc logFC PValue FDR 6145 -12.77011 7.490945 2.002637e-37 3.740325e-33 15580 -12.32428 6.621865 2.854360e-32 2.665544e-28 1565 -12.21365 6.311500 2.936737e-30 1.828315e-26 15718 -13.94448 -6.050904 6.517136e-28 3.043014e-24 1154 -13.69624 -5.326718 1.143794e-23 4.272527e-20 15630 -17.02012 5.975869 1.743145e-21 5.426120e-18 341 -18.60859 -6.565039 4.125655e-19 1.100784e-15 16351 -14.64956 4.565285 1.375746e-18 3.211850e-15 6468 -15.86990 -4.712918 3.248713e-18 6.741802e-15 4891 -16.96347 5.181179 7.436516e-18 1.388918e-14 > CDtopIDs<-as.numeric(row.names(CDtop10)) > df$counts[CDtopIDs,] C1 C2 C3 T1 T2 T3 [1,] 31 36 28 45 57 22440 [2,] 77 45 61 22745 47 55 [3,] 49 68 85 76 210 21738 [4,] 35 3729 34 37 22 32 [5,] 28 69 3636 25 25 89 [6,] 4 2 3 7 25 668 [7,] 2 3 188 1 0 2 [8,] 51 8 23 301 296 1619 [9,] 12 10 652 14 13 11 [10,] 4 5 3 540 6 10 > #TAGWISE DISPERSION APPROACH > fprior <- estimateSmoothing(df) > fprior [1] 6329.643 #I also tried prior.n=25 and prior.n=50, but results not shown. > df<-estimateTagwiseDisp(df, prior.n = 0) > quantile(df$tagwise.dispersion) 0% 25% 50% 75% 100% 1.001001e-03 2.151338e-02 7.667087e-02 1.715599e-01 9.990000e+02 > df.tgw<-exactTest(df,common.disp=FALSE) > TGWtop10<-topTags(df.tgw)$table > TGWtop10[,-1] logConc logFC PValue FDR 2659 -13.14601 -0.7454279 2.722274e-25 5.084391e-21 11865 -14.21354 -1.4925866 5.769090e-22 5.387465e-18 13066 -16.14612 -1.3689788 2.176835e-15 1.355225e-11 12381 -15.12798 -0.9772265 5.676831e-15 2.650654e-11 17206 -17.37537 -2.0234616 1.808245e-14 5.722016e-11 1172 -13.15737 -0.5535657 1.838202e-14 5.722016e-11 8678 -15.78098 -1.4312623 5.231726e-13 1.395899e-09 251 -15.03996 -0.8806583 1.137238e-12 2.655024e-09 8466 -14.50024 -0.7195308 1.861731e-12 3.863507e-09 8472 -15.35444 -0.9235374 5.519857e-12 1.030944e-08 > TGWtopIDs<-as.numeric(row.names(TGWtop10)) > df$counts[TGWtopIDs,] C1 C2 C3 T1 T2 T3 [1,] 685 346 337 340 340 313 [2,] 382 199 256 121 103 142 [3,] 97 59 55 29 36 35 [4,] 171 91 111 87 73 72 [5,] 54 24 35 12 8 14 [6,] 629 295 345 354 352 347 [7,] 138 58 84 41 47 38 [8,] 200 89 96 93 75 87 [9,] 231 138 157 149 115 128 [10,] 145 87 81 74 75 53 > df$tagwise.dispersion[TGWtopIDs] [1] 0.001001001 0.011153172 0.001001001 0.001001001 0.001001001 [6] 0.001001001 0.011153172 0.001001001 0.001001001 0.001001001 > df$tagwise.dispersion[CDtopIDs] [1] 3.1369561 3.0528706 2.6123364 2.4261316 2.4261316 1.8815105 [7] 3.2246046 0.6054731 2.0582920 2.1059294

edgeR edgeR • 1.5k views

ADD COMMENT • link updated 15.7 years ago by Mark Robinson ★ 1.1k • written 15.8 years ago by Ann Hess ▴ 340

0

Entering edit mode

Mark Robinson ★ 1.1k

@mark-robinson-2171

Last seen 11.4 years ago

Hi Ann. See comments below. On 2010-04-21, at 12:18 AM, Ann Hess wrote: > I am using edgeR to look for differentially abundant ?segments? > between two groups (data generated using high throughput sequencing). > I have 3 (pooled) biological reps per group and a total of 18760 > segments (83 rows with zero count are removed by edger). > > As a first approach, I used the common dispersion method and found the > estimated common dispersion to be 0.135. The common dispersion value shows substantial biological variation between your replicates, with a CV of nearly 40% in expression levels between samples. This is typical of what we have observed when the biological replicates are separate individuals or animals. > After looking at the top 10 > segments, I find that there tends to be a single large value > (different for each segment) that is bringing up the logFC. We've seen this in other datasets also. It seems to be a type of biological variation that RNA-seq makes very evident. > I tried using moderated tagwise dispersion (using prior.n=50 and 25) > and found that the results are largely the same as common dispersion > approach (not shown). The prior.n values you're choosing are basically too large. We generally recommend around 30-50 prior df. Since you have 4 residual df per segment, this translates to prior.n=10. You could even go a little lower, but not prior.n=0. > When I look at the tagwise dispersion values > for the top 10 hits, I find that the estimated tagwise dispersion > values are greater that the estimated common dispersion (not shown). > > To look into things further, I ran the same analysis but now with > prior.n=0 (no moderation/squeezing). The top 10 hits are now > completely different and the estimated tagwise dispersion values for > the top 10 are very small. (Looking at the top 10 seems to suggest > that I could use a Poisson distribution.) We don't recommend no moderation. There are just too few df to estimate the dispersion reliably for individual transcripts. The top segments in such a list will naturally tend to have small dispersions, because you're essentially selecting for this. It isn't necessarily evidence that the Poisson distribution is a good fit. > Questions: > 1. Should I be concerned that the results are so different depending > on whether common dispersion (almost equivalent to moderated tagwise > dispersion) or no-moderation tagwise dispersion is used? Based on > FDR<0.05, there is only about 10% overlap between the two approaches. No, this is to be expected. You are comparing the two extremes of dispersion estimation. The reason we developed moderated methods is that no-moderation is just not reliable. > 2. I?m not sure how to interpret the tagwise dispersion values for the > top hits: common dispersion method picks up segments with large > tagwise dispersion, no moderation method picks up segments with small > tagwise dispersion. > > I am using edgeR_1.4.7 with R version 2.10.1. You'd probably be better off choosing a smaller prior.n which gives you a more even handed compromise between common and tagwise dispersion. Hope that helps. Cheers, Mark >> #COMMON DISPERSION APPROACH >> library(edgeR) >> df <- DGEList(counts=Reads, group=c(0,0,0,1,1,1), genes=Annotation$Description) >> df$samples > group lib.size > C1 0 4488940 > C2 0 2437107 > C3 0 2600316 > T1 1 3935852 > T2 1 3806079 > T3 1 3913694 >> df <- estimateCommonDisp(df) >> df$common.dispersion > [1] 0.1346658 >> df.com<-exactTest(df) > Comparison of groups: 1 ? 0 >> CDtop10<-topTagsdf.com)$table >> CDtop10[,-1] > logConc logFC PValue FDR > 6145 -12.77011 7.490945 2.002637e-37 3.740325e-33 > 15580 -12.32428 6.621865 2.854360e-32 2.665544e-28 > 1565 -12.21365 6.311500 2.936737e-30 1.828315e-26 > 15718 -13.94448 -6.050904 6.517136e-28 3.043014e-24 > 1154 -13.69624 -5.326718 1.143794e-23 4.272527e-20 > 15630 -17.02012 5.975869 1.743145e-21 5.426120e-18 > 341 -18.60859 -6.565039 4.125655e-19 1.100784e-15 > 16351 -14.64956 4.565285 1.375746e-18 3.211850e-15 > 6468 -15.86990 -4.712918 3.248713e-18 6.741802e-15 > 4891 -16.96347 5.181179 7.436516e-18 1.388918e-14 >> CDtopIDs<-as.numeric(row.names(CDtop10)) >> df$counts[CDtopIDs,] > C1 C2 C3 T1 T2 T3 > [1,] 31 36 28 45 57 22440 > [2,] 77 45 61 22745 47 55 > [3,] 49 68 85 76 210 21738 > [4,] 35 3729 34 37 22 32 > [5,] 28 69 3636 25 25 89 > [6,] 4 2 3 7 25 668 > [7,] 2 3 188 1 0 2 > [8,] 51 8 23 301 296 1619 > [9,] 12 10 652 14 13 11 > [10,] 4 5 3 540 6 10 > > >> #TAGWISE DISPERSION APPROACH >> fprior <- estimateSmoothing(df) >> fprior > [1] 6329.643 > > #I also tried prior.n=25 and prior.n=50, but results not shown. >> df<-estimateTagwiseDisp(df, prior.n = 0) >> quantile(df$tagwise.dispersion) > 0% 25% 50% 75% 100% > 1.001001e-03 2.151338e-02 7.667087e-02 1.715599e-01 9.990000e+02 >> df.tgw<-exactTest(df,common.disp=FALSE) >> TGWtop10<-topTags(df.tgw)$table >> TGWtop10[,-1] > logConc logFC PValue FDR > 2659 -13.14601 -0.7454279 2.722274e-25 5.084391e-21 > 11865 -14.21354 -1.4925866 5.769090e-22 5.387465e-18 > 13066 -16.14612 -1.3689788 2.176835e-15 1.355225e-11 > 12381 -15.12798 -0.9772265 5.676831e-15 2.650654e-11 > 17206 -17.37537 -2.0234616 1.808245e-14 5.722016e-11 > 1172 -13.15737 -0.5535657 1.838202e-14 5.722016e-11 > 8678 -15.78098 -1.4312623 5.231726e-13 1.395899e-09 > 251 -15.03996 -0.8806583 1.137238e-12 2.655024e-09 > 8466 -14.50024 -0.7195308 1.861731e-12 3.863507e-09 > 8472 -15.35444 -0.9235374 5.519857e-12 1.030944e-08 >> TGWtopIDs<-as.numeric(row.names(TGWtop10)) >> df$counts[TGWtopIDs,] > C1 C2 C3 T1 T2 T3 > [1,] 685 346 337 340 340 313 > [2,] 382 199 256 121 103 142 > [3,] 97 59 55 29 36 35 > [4,] 171 91 111 87 73 72 > [5,] 54 24 35 12 8 14 > [6,] 629 295 345 354 352 347 > [7,] 138 58 84 41 47 38 > [8,] 200 89 96 93 75 87 > [9,] 231 138 157 149 115 128 > [10,] 145 87 81 74 75 53 > >> df$tagwise.dispersion[TGWtopIDs] > [1] 0.001001001 0.011153172 0.001001001 0.001001001 0.001001001 > [6] 0.001001001 0.011153172 0.001001001 0.001001001 0.001001001 >> df$tagwise.dispersion[CDtopIDs] > [1] 3.1369561 3.0528706 2.6123364 2.4261316 2.4261316 1.8815105 > [7] 3.2246046 0.6054731 2.0582920 2.1059294 > > _______________________________________________ > Bioconductor mailing list > Bioconductor at stat.math.ethz.ch > https://stat.ethz.ch/mailman/listinfo/bioconductor > Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor ------------------------------ Mark Robinson, PhD (Melb) Epigenetics Laboratory, Garvan Bioinformatics Division, WEHI e: m.robinson at garvan.org.au e: mrobinson at wehi.edu.au p: +61 (0)3 9345 2628 f: +61 (0)3 9347 0852 ------------------------------ ______________________________________________________________________ The information in this email is confidential and intend...{{dropped:6}}

ADD COMMENT • link 15.7 years ago Mark Robinson ★ 1.1k

Login before adding your answer.