Entering edit mode
Hi Michael,
Here's a link to a previous reply that I made to a similar question
last
month:
https://www.stat.math.ethz.ch/pipermail/bioconductor/2012-May/045483
.html
The short answer is to set the argument prior.n of the
estimateTagwiseDisp() function in edgeR to a smaller value. The
default
in edgeR is to use a largish value for the prior df, which means that
edgeR squeezes the tagwise dispersions strongly towards the global
value,
meaning that it isn't able to adapt sufficiently to individual genes
with
outliers such as yours.
Your data has 14 libraries and two groups, so there are 12 residual
degrees of freedom for each gene. The prior degrees of freedom are
set to
20 by default, so the prior number of observations defaults to prior.n
=
20/12 = 1.67 for your data. Try instead a smaller value like prior.n
=
6/12. The smaller prior.n is, the more edgeR will de-prioritize those
hyper variable genes.
It would preferable if edgeR did this for you, adapting to the
characteristics of your data automatically. A new version of edgeR
should
be able to do that in a few months.
Best wishes
Gordon
> Date: Thu, 28 Jun 2012 17:43:09 -0500
> From: "Michael Salbaum" <michael.salbaum at="" pbrc.edu="">
> To: <bioconductor at="" r-project.org="">
> Subject: [BioC] Differential gene expression: EdgeR / DESeq and
> identifying noise/outliers
>
> Hi everyone,
>
> I am working on a differential gene expression paradigm; n=7,
3?-expression tag sequencing, AB SOLiD5500XL. We look for expression
of ~26,000 RefSeq genes; as input, I?m using a bottom-shaved counts
table (~16,500 genes), i.e. rows with a total count sum of 12 or less
have been trimmed off. Using edgeR, I find 2569 genes (1224 up / 1345
down) to be differentially expressed at FDR better than 0.05.
> I?ve noticed that in this list are many genes where the differential
expression call appears to be driven by one ?outlier? on one side of
the paradigm, for instance:
> Nodal WT: 7 7 5 4 19 6 1 Het: 320 2 16 8 6 1 13 logFC: 2.65 FDR:
0.00814198
>
> Of course, this is not desirable for follow-up studies, and I?m
wondering whether there?s a way to filter out such situations.
>
> Right now, I?ve resorted to looking at the coefficient of variation
calculated from normalized data as a potential discriminator.
>
> I?ve also used DESeq on the same count data set, which identifies
1569 genes (719 up / 850 down) at padj<0.05. Of these 1569, 1544 are
also called by edgeR, so, excellent agreement. Relaxing the cutoff to
padj<0.1 gives another 761 genes, of which 585 make the FDR<0.05
cutoff in edgeR.
>
> Plotting ?log10(FDR) against the coefficient of variation shows that
~90% the genes called by both DESeq and edgeR (at padj<0.05) have a CV
of 0.3 or better. The same plot for the genes called only by edgeR (at
padj<0.05) seems to show that this group contains many genes at higher
CV ? I suspect these are the outlier-driven ones mentioned above.
>
> I?m perfectly happy with the outcome (?and with the procedure of
using
> two programs and continue with the intersect between the two?), but
> those single outlier-driven genes were irksome ? particularly so
because
> they were of a nature that got us excited, in biological terms ?.
So, to
> avoid the letdown, I?d appreciate advice how not to get those in the
> first place.
>
> And I apologize for the long post, but as a newbie, I figured I
better include what info I have.
>
> I?m not sure this is proper etiquette here, but here is a link to
the plots:
> http://inlinethumb04.webshots.com/51523/2373514640050256648S600x600Q
85.jpg
>
> R version 2.15.1 (2012-06-22) -- "Roasted Marshmallows"
> ?edgeR? version 2.6.7
>
> library(edgeR)
> Loading required package: limma
> x <-
read.delim("/Users/jms/Desktop/SAGE3/SAGE3F.txt",row.names="Gene")
>> head(x)
> WT_1 WT_2 WT_3 WT_4 WT_5 WT_6 WT_7 HET_1 HET_2 HET_3
HET_4 HET_5 HET_6 HET_7
> 0610005C13Rik 5 1 6 0 5 2 18 34 7 13
18 28 1 13
>> group <- factor(c(1,1,1,1,1,1,1,2,2,2,2,2,2,2))
>> y <- DGEList(counts=x,group=group)
> Calculating library sizes from column totals.
>> y <- estimateCommonDisp(y, verbose=TRUE)
> Disp = 0.06254 , BCV = 0.2501
>> y <- estimateTagwiseDisp(y)
>> et <- exactTest(y)
>> topTags(et)
>> summary(de <- decideTestsDGE(et, p=0.05, adjust="BH"))
> [,1]
> -1 1345
> 0 14155
> 1 1224
>
>
>
> ?DESeq? version 1.8.3
> library(DESeq)
> Loading required package: Biobase
> Loading required package: BiocGenerics
> Attaching package: ?BiocGenerics?
> The following object(s) are masked from ?package:stats?:
> xtabs
> The following object(s) are masked from ?package:base?:
> anyDuplicated, cbind, colnames, duplicated, eval, Filter, Find,
get, intersect, lapply, Map, mapply,
> mget, order, paste, pmax, pmax.int, pmin, pmin.int, Position,
rbind, Reduce, rep.int, rownames, sapply,
> setdiff, table, tapply, union, unique
> Welcome to Bioconductor
> Vignettes contain introductory material; view with
'browseVignettes()'. To cite Bioconductor, see
> 'citation("Biobase")', and for packages 'citation("pkgname")'.
> Loading required package: locfit
> locfit 1.5-8 2012-04-25
>> f = "/Users/jms/Desktop/SAGE3/SAGE3F.txt"
>> countsTable <- read.table( f, header=TRUE, row.names=1,
stringsAsFactors=TRUE )
>> head(countsTable)
> WT_1 WT_2 WT_3 WT_4 WT_5 WT_6 WT_7 HET_1 HET_2 HET_3
HET_4 HET_5 HET_6 HET_7
> 0610005C13Rik 5 1 6 0 5 2 18 34 7 13
18 28 1 13
>> conds <- c( "WT", "WT", "WT", "WT", "WT", "WT", "WT", "HET", "HET",
"HET", "HET", "HET", "HET", "HET" )
>> cds <- newCountDataSet( countsTable, conds )
>> cds <- estimateSizeFactors( cds )
>> sizeFactors( cds )
> WT_1 WT_2 WT_3 WT_4 WT_5 WT_6 WT_7
HET_1 HET_2 HET_3 HET_4 HET_5
> 1.3598468 0.9759626 1.0788248 1.0080744 1.0097703 0.5502473
0.8747237 1.0877753 0.5544828 0.9594141 1.6036028 1.4868868
> HET_6 HET_7
> 0.8422686 1.5328149
>> cds <- estimateDispersions( cds )
>> res <- nbinomTest( cds, "WT", "HET" )
>> write.table (res, sep = "\t", file =
"/Users/jms/Desktop/SAGE3/SAGE3F_DE.txt", col.names = NA)
>> write.table (counts( cds, normalized=TRUE ), sep = "\t", file =
"/Users/jms/Desktop/SAGE3/SAGE3F_NORM.txt", col.names = NA)
>
> Cheers, michael
>
>
> J. Michael Salbaum, Ph.D.
> Associate Professor
> Pennington Biomedical Research Center
> Louisiana State University System
> 6400 Perkins Road
> Baton Rouge, LA 70808
>
> (225) 763-2782
______________________________________________________________________
The information in this email is confidential and
intend...{{dropped:5}}