Entering edit mode
Michael Salbaum
▴
80
@michael-salbaum-5309
Last seen 10.2 years ago
Hi everyone,
I am working on a differential gene expression paradigm; n=7,
3-expression tag sequencing, AB SOLiD5500XL. We look for expression
of ~26,000 RefSeq genes; as input, Im using a bottom-shaved counts
table (~16,500 genes), i.e. rows with a total count sum of 12 or less
have been trimmed off. Using edgeR, I find 2569 genes (1224 up / 1345
down) to be differentially expressed at FDR better than 0.05.
Ive noticed that in this list are many genes where the differential
expression call appears to be driven by one outlier on one side of
the paradigm, for instance:
Nodal WT: 7 7 5 4 19 6 1 Het: 320 2 16 8 6 1 13 logFC: 2.65 FDR:
0.00814198
Of course, this is not desirable for follow-up studies, and Im
wondering whether theres a way to filter out such situations.
Right now, Ive resorted to looking at the coefficient of variation
calculated from normalized data as a potential discriminator.
Ive also used DESeq on the same count data set, which identifies 1569
genes (719 up / 850 down) at padj<0.05. Of these 1569, 1544 are also
called by edgeR, so, excellent agreement. Relaxing the cutoff to
padj<0.1 gives another 761 genes, of which 585 make the FDR<0.05
cutoff in edgeR.
Plotting log10(FDR) against the coefficient of variation shows that
~90% the genes called by both DESeq and edgeR (at padj<0.05) have a CV
of 0.3 or better. The same plot for the genes called only by edgeR (at
padj<0.05) seems to show that this group contains many genes at higher
CV I suspect these are the outlier-driven ones mentioned above.
Im perfectly happy with the outcome (
and with the procedure of using two programs and continue with the
intersect between the two
), but those single outlier-driven genes were irksome particularly
so because they were of a nature that got us excited, in biological
terms ?. So, to avoid the letdown, Id appreciate advice how not to
get those in the first place.
And I apologize for the long post, but as a newbie, I figured I better
include what info I have.
Im not sure this is proper etiquette here, but here is a link to the
plots:
http://inlinethumb04.webshots.com/51523/2373514640050256648S600x600Q85
.jpg
R version 2.15.1 (2012-06-22) -- "Roasted Marshmallows"
edgeR version 2.6.7
library(edgeR)
Loading required package: limma
x <-
read.delim("/Users/jms/Desktop/SAGE3/SAGE3F.txt",row.names="Gene")
> head(x)
WT_1 WT_2 WT_3 WT_4 WT_5 WT_6 WT_7 HET_1 HET_2 HET_3
HET_4 HET_5 HET_6 HET_7
0610005C13Rik 5 1 6 0 5 2 18 34 7 13
18 28 1 13
> group <- factor(c(1,1,1,1,1,1,1,2,2,2,2,2,2,2))
> y <- DGEList(counts=x,group=group)
Calculating library sizes from column totals.
> y <- estimateCommonDisp(y, verbose=TRUE)
Disp = 0.06254 , BCV = 0.2501
> y <- estimateTagwiseDisp(y)
> et <- exactTest(y)
> topTags(et)
> summary(de <- decideTestsDGE(et, p=0.05, adjust="BH"))
[,1]
-1 1345
0 14155
1 1224
DESeq version 1.8.3
library(DESeq)
Loading required package: Biobase
Loading required package: BiocGenerics
Attaching package: BiocGenerics
The following object(s) are masked from package:stats:
xtabs
The following object(s) are masked from package:base:
anyDuplicated, cbind, colnames, duplicated, eval, Filter, Find,
get, intersect, lapply, Map, mapply,
mget, order, paste, pmax, pmax.int, pmin, pmin.int, Position,
rbind, Reduce, rep.int, rownames, sapply,
setdiff, table, tapply, union, unique
Welcome to Bioconductor
Vignettes contain introductory material; view with
'browseVignettes()'. To cite Bioconductor, see
'citation("Biobase")', and for packages 'citation("pkgname")'.
Loading required package: locfit
locfit 1.5-8 2012-04-25
> f = "/Users/jms/Desktop/SAGE3/SAGE3F.txt"
> countsTable <- read.table( f, header=TRUE, row.names=1,
stringsAsFactors=TRUE )
> head(countsTable)
WT_1 WT_2 WT_3 WT_4 WT_5 WT_6 WT_7 HET_1 HET_2 HET_3
HET_4 HET_5 HET_6 HET_7
0610005C13Rik 5 1 6 0 5 2 18 34 7 13
18 28 1 13
> conds <- c( "WT", "WT", "WT", "WT", "WT", "WT", "WT", "HET", "HET",
"HET", "HET", "HET", "HET", "HET" )
> cds <- newCountDataSet( countsTable, conds )
> cds <- estimateSizeFactors( cds )
> sizeFactors( cds )
WT_1 WT_2 WT_3 WT_4 WT_5 WT_6 WT_7
HET_1 HET_2 HET_3 HET_4 HET_5
1.3598468 0.9759626 1.0788248 1.0080744 1.0097703 0.5502473 0.8747237
1.0877753 0.5544828 0.9594141 1.6036028 1.4868868
HET_6 HET_7
0.8422686 1.5328149
> cds <- estimateDispersions( cds )
> res <- nbinomTest( cds, "WT", "HET" )
> write.table (res, sep = "\t", file =
"/Users/jms/Desktop/SAGE3/SAGE3F_DE.txt", col.names = NA)
> write.table (counts( cds, normalized=TRUE ), sep = "\t", file =
"/Users/jms/Desktop/SAGE3/SAGE3F_NORM.txt", col.names = NA)
Cheers, michael
J. Michael Salbaum, Ph.D.
Associate Professor
Pennington Biomedical Research Center
Louisiana State University System
6400 Perkins Road
Baton Rouge, LA 70808
(225) 763-2782
[[alternative HTML version deleted]]