Question

3 replicates vs 2 replicates biased outcome?

0

Entering edit mode

kan.jiang • 0

@kanjiang-11217

Last seen 7.8 years ago

Recently I used both cuffdiff and Deseq2 to find the differential genes. The outcome is very different. Cuffdiff gives me about 1600 changed genes and Deseq2 gives me 12347 genes. The overlapping is only 25%, which is very low. I expected some number like 50%. Cuffdiff unique is about 5% and Deseq2 unique is about 70%. The huge difference actually comes out from the downregulated genes, which is 12347(hard to believehttp://imgur.com/a/jq6Qw). I am not sure if there is something fundamentally wrong when I fed my data to the Deseq2. Please help me to find it out. Many thanks in advance!

There are 5 samples, 3 in con and 2 in exp. There is no warning message. The code I used

deseqSampleMatrix <- data.frame( 

    row.names=sampleNames,

condition = as.factor(conditions),

samplename = sampleNames)

dds <- DESeqDataSetFromMatrix(countData = as.matrix(exprMat), colData=deseqSampleMatrix, design=~condition)

dds

dds <- DESeq(dds)

res <- results(dds, contrast=c("condition","exp","con"))

res <- res[complete.cases(res),]

sum(res$padj < 0.05, na.rm=TRUE)

sum( res$padj < 0.1, na.rm=TRUE ) # 14514

sum(res$log2FoldChange > 2 & res$padj < 0.05, na.rm = TRUE) # 853

sum(res$log2FoldChange < 2 & res$padj < 0.05, na.rm = TRUE) # 12347

sum(res$log2FoldChange == 2 & res$padj < 0.05, na.rm = TRUE) # 0

as.data.frame(colData(dds))
     condition samplename sizeFactor
con1       con       con1 0.6391152
con2       con       con2 1.1517159
con3       con       con3 0.8991909
exp1       exp       exp1 1.4485410
exp2       exp       exp2 1.0539740

optional parameter: independentFiltering=FALSE, cooksCutoff=FALSE does not make big difference for the outcome

sessionInfo()
R version 3.3.1 (2016-06-21)
Platform: x86_64-apple-darwin13.4.0 (64-bit)
Running under: OS X 10.9.5 (Mavericks)

locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8

attached base packages:
[1] parallel stats4 stats graphics grDevices utils datasets methods base

other attached packages:
[1] pheatmap_1.0.8             gplots_3.0.1               RColorBrewer_1.1-2
[4] ggplot2_2.1.0              DESeq2_1.12.3              SummarizedExperiment_1.2.3
[7] Biobase_2.32.0             GenomicRanges_1.24.2       GenomeInfoDb_1.8.3
[10] IRanges_2.6.1              S4Vectors_0.10.2           BiocGenerics_0.18.0

loaded via a namespace (and not attached):
[1] Rcpp_0.12.6          plyr_1.8.4           XVector_0.12.1       bitops_1.0-6
[5] tools_3.3.1          zlibbioc_1.18.0      digest_0.6.10        rpart_4.1-10
[9] RSQLite_1.0.0        annotate_1.50.0      gtable_0.2.0         lattice_0.20-33
[13] Matrix_1.2-6         DBI_0.4-1            gridExtra_2.2.1      genefilter_1.54.2
[17] cluster_2.0.4        caTools_1.17.1       gtools_3.5.0         locfit_1.5-9.1
[21] grid_3.3.1           nnet_7.3-12          data.table_1.9.6     AnnotationDbi_1.34.4
[25] XML_3.98-1.4         survival_2.39-5      BiocParallel_1.6.3   foreign_0.8-66
[29] gdata_2.17.0         latticeExtra_0.6-28 Formula_1.2-1        geneplotter_1.50.0
[33] Hmisc_3.17-4         scales_0.4.0         splines_3.3.1        colorspace_1.2-6
[37] xtable_1.8-2         labeling_0.3         KernSmooth_2.23-15   acepack_1.3-3.3
[41] munsell_0.4.3        chron_2.3-47

The dispersion plot is attached for the convenience,which is shown a lot of outliers here to me.

deseq2 • 615 views

ADD COMMENT • link 7.8 years ago kan.jiang • 0

score 0 · Answer 1 · 2016-08-03

Your criteria for counting genes that are significantly up and down are wrong. You're using a log2 fold change threshold of 2, which is equivalent to 4-fold upregulation. This means that if a gene is 3.9-fold up and has a significant FDR, you are counting it as a down-regulated gene. So it's pretty clear why you are getting so many more "down-regulated" genes. If you want to split the genes into upregulated and downregulated, you should simply look for positive and negative log2 fold change values (i.e. greater than or less than 0). Also, it's not clear what the last line, which yields a count of zero, is supposed to be measuring. You're never going to get a gene with a fold change of exactly 4.

score 0 · Answer 2 · 2016-08-03

Hi Thompson,

Thanks for the quick response. Thanks point it out. Sorry! I made mistake for the post: my updated code:

sum( res$padj < 0.05, na.rm=TRUE ) # 13255
sum(res$log2FoldChange > 2 & res$padj < 0.05, na.rm = TRUE) # 854
sum(res$log2FoldChange < -2 & res$padj < 0.05, na.rm = TRUE) # 5426

sum(res$log2FoldChange == 2 & res$padj < 0.05, na.rm = TRUE) # 0

Still Deseq2 give me 5426 down genes and 854 up genes, which is 6.35 fold than that of the down genes. Cuffdiff give me 876 down and 661 up genes, which is not very different from that of the up genes. Is it artifact or possible fact?