Entering edit mode
Dear Ann,
On 2010-04-21, at 12:18 AM, Ann Hess wrote:
> I am using edgeR to look for differentially abundant segments
between
> two groups (data generated using high throughput sequencing). I have
3
> (pooled) biological reps per group and a total of 18760 segments (83
> rows with zero count are removed by edger).
>
> As a first approach, I used the common dispersion method and found
the
> estimated common dispersion to be 0.135.
Your common dispersion value shows substantial biological variation
between your replicates, with a CV of nearly 40% in expression levels
between samples. This is typical of what we have observed when the
biological replicates are separate individuals or animals.
> After looking at the top 10 segments, I find that there tends to be
a
> single large value (different for each segment) that is bringing up
the
> logFC.
We've seen this in other datasets also. It seems to be a type of
biological variation that RNA-seq makes very evident.
> I tried using moderated tagwise dispersion (using prior.n=50 and 25)
and
> found that the results are largely the same as common dispersion
> approach (not shown).
The prior.n values you're choosing are basically too large. We
generally
recommend around 30-50 prior df. Since you have 4 residual df per
segment, this translates to prior.n=10. You could even go a little
lower,
but not prior.n=0.
> When I look at the tagwise dispersion values for the top 10 hits, I
find
> that the estimated tagwise dispersion values are greater that the
> estimated common dispersion (not shown).
>
> To look into things further, I ran the same analysis but now with
> prior.n=0 (no moderation/squeezing). The top 10 hits are now
completely
> different and the estimated tagwise dispersion values for the top 10
are
> very small. (Looking at the top 10 seems to suggest that I could
use a
> Poisson distribution.)
We don't recommend no moderation. There are just too few df to
estimate
the dispersion reliable for individual transcripts. The top segments
in
such a list will naturally tend to have small dispersions, because
you're
essentially selecting for this. It isn't good evidence that the
Poisson
distribution is a good fit.
> Questions:
> 1. Should I be concerned that the results are so different
depending
> on whether common dispersion (almost equivalent to moderated tagwise
> dispersion) or no-moderation tagwise dispersion is used? Based on
> FDR<0.05, there is only about 10% overlap between the two
approaches.
No, this is to be expected. The reason we developed moderated methods
is
that no-moderation is just not reliable.
> 2. Im not sure how to interpret the tagwise dispersion values for
the
> top hits: common dispersion method picks up segments with large
tagwise
> dispersion, no moderation method picks up segments with small
tagwise
> dispersion.
In a genome wide experiment, some segments will have large and some
will
have very small sample variances just by chance. Furthermore, some
features are genuinely hypervariable between individuals. The best
strategy is likely to be to choose a smaller prior.n, which will give
you
a more even handed compromise between common and tagwise dispersion.
This will downweight the segments with very large variation between
individuals in the toplist, while still giving stable estimates of
tagwise
dispersion.
Best wishes
Gordon
-----------------------------------------------
Associate Professor Gordon K Smyth,
NHMRC Senior Research Fellow,
Bioinformatics Division,
Walter and Eliza Hall Institute of Medical Research,
1G Royal Parade, Parkville, Vic 3052, Australia.
Tel: (03) 9345 2326, Fax (03) 9347 0852,
smyth at wehi.edu.au
http://www.wehi.edu.au
http://www.statsci.org/smyth
> I am using edgeR_1.4.7 with R version 2.10.1.
>
>> #COMMON DISPERSION APPROACH
>> library(edgeR)
>> df <- DGEList(counts=Reads, group=c(0,0,0,1,1,1),
genes=Annotation$Description)
>> df$samples
> group lib.size
> C1 0 4488940
> C2 0 2437107
> C3 0 2600316
> T1 1 3935852
> T2 1 3806079
> T3 1 3913694
>> df <- estimateCommonDisp(df)
>> df$common.dispersion
> [1] 0.1346658
>> df.com<-exactTest(df)
> Comparison of groups: 1 ? 0
>> CDtop10<-topTagsdf.com)$table
>> CDtop10[,-1]
> logConc logFC PValue FDR
> 6145 -12.77011 7.490945 2.002637e-37 3.740325e-33
> 15580 -12.32428 6.621865 2.854360e-32 2.665544e-28
> 1565 -12.21365 6.311500 2.936737e-30 1.828315e-26
> 15718 -13.94448 -6.050904 6.517136e-28 3.043014e-24
> 1154 -13.69624 -5.326718 1.143794e-23 4.272527e-20
> 15630 -17.02012 5.975869 1.743145e-21 5.426120e-18
> 341 -18.60859 -6.565039 4.125655e-19 1.100784e-15
> 16351 -14.64956 4.565285 1.375746e-18 3.211850e-15
> 6468 -15.86990 -4.712918 3.248713e-18 6.741802e-15
> 4891 -16.96347 5.181179 7.436516e-18 1.388918e-14
>> CDtopIDs<-as.numeric(row.names(CDtop10))
>> df$counts[CDtopIDs,]
> C1 C2 C3 T1 T2 T3
> [1,] 31 36 28 45 57 22440
> [2,] 77 45 61 22745 47 55
> [3,] 49 68 85 76 210 21738
> [4,] 35 3729 34 37 22 32
> [5,] 28 69 3636 25 25 89
> [6,] 4 2 3 7 25 668
> [7,] 2 3 188 1 0 2
> [8,] 51 8 23 301 296 1619
> [9,] 12 10 652 14 13 11
> [10,] 4 5 3 540 6 10
>
>
>> #TAGWISE DISPERSION APPROACH
>> fprior <- estimateSmoothing(df)
>> fprior
> [1] 6329.643
>
> #I also tried prior.n=25 and prior.n=50, but results not shown.
>> df<-estimateTagwiseDisp(df, prior.n = 0)
>> quantile(df$tagwise.dispersion)
> 0% 25% 50% 75% 100%
> 1.001001e-03 2.151338e-02 7.667087e-02 1.715599e-01 9.990000e+02
>> df.tgw<-exactTest(df,common.disp=FALSE)
>> TGWtop10<-topTags(df.tgw)$table
>> TGWtop10[,-1]
> logConc logFC PValue FDR
> 2659 -13.14601 -0.7454279 2.722274e-25 5.084391e-21
> 11865 -14.21354 -1.4925866 5.769090e-22 5.387465e-18
> 13066 -16.14612 -1.3689788 2.176835e-15 1.355225e-11
> 12381 -15.12798 -0.9772265 5.676831e-15 2.650654e-11
> 17206 -17.37537 -2.0234616 1.808245e-14 5.722016e-11
> 1172 -13.15737 -0.5535657 1.838202e-14 5.722016e-11
> 8678 -15.78098 -1.4312623 5.231726e-13 1.395899e-09
> 251 -15.03996 -0.8806583 1.137238e-12 2.655024e-09
> 8466 -14.50024 -0.7195308 1.861731e-12 3.863507e-09
> 8472 -15.35444 -0.9235374 5.519857e-12 1.030944e-08
>> TGWtopIDs<-as.numeric(row.names(TGWtop10))
>> df$counts[TGWtopIDs,]
> C1 C2 C3 T1 T2 T3
> [1,] 685 346 337 340 340 313
> [2,] 382 199 256 121 103 142
> [3,] 97 59 55 29 36 35
> [4,] 171 91 111 87 73 72
> [5,] 54 24 35 12 8 14
> [6,] 629 295 345 354 352 347
> [7,] 138 58 84 41 47 38
> [8,] 200 89 96 93 75 87
> [9,] 231 138 157 149 115 128
> [10,] 145 87 81 74 75 53
>
>> df$tagwise.dispersion[TGWtopIDs]
> [1] 0.001001001 0.011153172 0.001001001 0.001001001 0.001001001
> [6] 0.001001001 0.011153172 0.001001001 0.001001001 0.001001001
>> df$tagwise.dispersion[CDtopIDs]
> [1] 3.1369561 3.0528706 2.6123364 2.4261316 2.4261316 1.8815105
> [7] 3.2246046 0.6054731 2.0582920 2.1059294
______________________________________________________________________
The information in this email is confidential and
intend...{{dropped:4}}