Question

different statiscal results in DESeq2 from the same samples

0

Entering edit mode

Mehmet Ilyas Cosacak • 0

@mehmet-ilyas-cosacak-9020

Last seen 6.1 years ago

Germany/Dresden/ CRTD - DZNE

Hi,

On 02.01.2017, I ran an analysis with DESeq2. I ran a similar analysis recently. I used the same input data for both runs! Please see "raw reads" in the excel file (https://www.dropbox.com/s/g7fyqvn1chdnesh/deseq2_bugreport_23102017.xlsx?dl=0). Even tough the normalized reads and basemean are same in both runs, the log2FoldChange, padj, etc are completely different. Moreover, I think there are two bugs:

In the previous one ( I do not have a session info for that, ran on 02.01.2017), for "gene21828" the log2FoldChange is calculated wrongly!

In the new one; for "gene23073","gene8822", the log2FoldChange is calculated wrongly or in another words: even though the reads number are blow 30 reads, the log2FoldChange and padj values are calculated as significantly expressed genes. (see gene8822).

Especially the number of DE genes in the new run is over 600 genes [abs(log2FoldChange) >= 1.0]. In the previous one [abs(log2FoldChange) >= 1.0], it is around 30 genes. 20 fold differences in the number of DE genes!

I could not find anything that I may doing wrong, that is why I am posting.

p.s. : I have a pipeline for RNASeq data analysis that is why there are several packages, such as goseq, GOStats, topGO, ...

thanks,

ilyas.

> sessionInfo()
R version 3.4.1 (2017-06-30)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows >= 8 x64 (build 9200)

Matrix products: default

locale:
[1] LC_COLLATE=English_United States.1252  LC_CTYPE=English_United States.1252    LC_MONETARY=English_United States.1252
[4] LC_NUMERIC=C                           LC_TIME=English_United States.1252    

attached base packages:
 [1] grid      stats4    parallel  stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
 [1] org.Mm.eg.db_3.4.1            xtable_1.8-2                  WriteXLS_4.0.0                stringr_1.2.0                
 [5] statmod_1.4.30                scales_0.5.0                  splitstackshape_1.4.2         Rgraphviz_2.20.0             
 [9] reshape2_1.4.2                topGO_2.28.0                  SparseM_1.77                  ReportingTools_2.16.0        
[13] knitr_1.17                    RColorBrewer_1.1-2            png_0.1-7                     plotrix_3.6-6                
[17] PerformanceAnalytics_1.4.3541 xts_0.10-0                    zoo_1.8-0                     pathview_1.16.5              
[21] org.Hs.eg.db_3.4.1            pheatmap_1.0.8                org.Dr.eg.db_3.4.1            LSD_3.0                      
[25] KEGGREST_1.16.1               KEGG.db_3.2.3                 GSEABase_1.38.1               gdata_2.18.0                 
[29] gplots_3.0.1                  GOstats_2.42.0                graph_1.54.0                  Category_2.42.1              
[33] Matrix_1.2-10                 goseq_1.28.0                  BiasedUrn_1.07                GO.db_3.4.1                  
[37] ggrepel_0.6.5                 ggplot2_2.2.1                 GenomicFeatures_1.28.4        geneplotter_1.54.0           
[41] annotate_1.54.0               XML_3.98-1.9                  lattice_0.20-35               geneLenDataBase_1.12.0       
[45] genefilter_1.58.1             fdrtool_1.2.15                edgeR_3.18.1                  limma_3.32.5                 
[49] dtplyr_0.0.2                  plyr_1.8.4                    DESeq2_1.16.1                 SummarizedExperiment_1.6.3   
[53] DelayedArray_0.2.7            matrixStats_0.52.2            GenomicRanges_1.28.4          GenomeInfoDb_1.12.2          
[57] data.table_1.10.4             calibrate_1.7.2               MASS_7.3-47                   biomaRt_2.32.1               
[61] BiocParallel_1.10.1           AnnotationForge_1.18.2        AnnotationDbi_1.38.2          IRanges_2.10.3               
[65] S4Vectors_0.14.3              Biobase_2.36.2                BiocGenerics_0.22.0           BiocInstaller_1.26.1         

loaded via a namespace (and not attached):
 [1] backports_1.1.0               Hmisc_4.0-3                   AnnotationHub_2.8.2           lazyeval_0.2.0               
 [5] splines_3.4.1                 digest_0.6.12                 ensembldb_2.0.4               htmltools_0.3.6              
 [9] magrittr_1.5                  checkmate_1.8.3               memoise_1.1.0                 BSgenome_1.44.1              
[13] cluster_2.0.6                 Biostrings_2.44.2             R.utils_2.5.0                 ggbio_1.24.1                 
[17] colorspace_1.3-2              blob_1.1.0                    dplyr_0.7.2                   RCurl_1.95-4.8               
[21] bindr_0.1                     survival_2.41-3               VariantAnnotation_1.22.3      glue_1.1.1                   
[25] gtable_0.2.0                  zlibbioc_1.22.0               XVector_0.16.0                DBI_0.7                      
[29] GGally_1.3.2                  Rcpp_0.12.12                  htmlTable_1.9                 foreign_0.8-69               
[33] bit_1.1-12                    OrganismDbi_1.18.0            Formula_1.2-2                 htmlwidgets_0.9              
[37] httr_1.3.1                    acepack_1.4.1                 R.methodsS3_1.7.1             pkgconfig_2.0.1              
[41] reshape_0.8.7                 nnet_7.3-12                   locfit_1.5-9.1                rlang_0.1.2                  
[45] munsell_0.4.3                 tools_3.4.1                   RSQLite_2.0                   yaml_2.1.14                  
[49] bit64_0.9-7                   caTools_1.17.1                AnnotationFilter_1.0.0        bindrcpp_0.2                 
[53] RBGL_1.52.0                   nlme_3.1-131                  mime_0.5                      R.oo_1.21.0                  
[57] KEGGgraph_1.38.1              compiler_3.4.1                curl_2.8.1                    interactiveDisplayBase_1.14.0
[61] PFAM.db_3.4.1                 tibble_1.3.4                  stringi_1.1.5                 ProtGenerics_1.8.0           
[65] bitops_1.0-6                  httpuv_1.3.5                  rtracklayer_1.36.4            hwriter_1.3.2                
[69] R6_2.2.2                      latticeExtra_0.6-28           KernSmooth_2.23-15            gridExtra_2.2.1              
[73] dichromat_2.0-0               gtools_3.5.0                  assertthat_0.2.0              GenomicAlignments_1.12.2     
[77] Rsamtools_1.28.0              GenomeInfoDbData_0.99.0       mgcv_1.8-17                   rpart_4.1-11                 
[81] biovizBase_1.24.0             shiny_1.0.5                   base64enc_0.1-3

deseq2 • 938 views

ADD COMMENT • link updated 6.5 years ago by Michael Love 41k • written 6.5 years ago by Mehmet Ilyas Cosacak • 0

score 0 · Answer 1 · 2017-10-22

0

Entering edit mode

Michael Love 41k

@mikelove

Last seen 32 minutes ago

United States

It’s possible for results to change if you change software versions, although typically the changes are not large in terms of ranking of test statistic. These can be exaggerated when thresholding on adj pvalue.

ADD COMMENT • link 6.5 years ago Michael Love 41k

0

Entering edit mode

Hi Michael,

thanks for the reply.

Yes, it is possible to have different results. But, the difference is definitely high. 28 genes versus 428 genes!

If we focus on the current one, I still think there is some statistical error.

For Instance, if you use the input_file (https://www.dropbox.com/s/ophijagobfsb1zw/input_file.tab?dl=0) and the script below:

library("DESeq2") rawCounts = read.delim("input_file.tab", header = TRUE, row.names = "GeneNo") head(rawCounts) conditions = gsub("\\_R[1-3]","",colnames(rawCounts)) coldata <- data.frame(row.names = colnames(rawCounts), condition = as.factor(conditions)) dds <- DESeqDataSetFromMatrix(countData = rawCounts, colData = coldata, design = ~ condition) dds <- DESeq(dds, parallel = T) res = as.data.frame(results(dds)) head(res) normalizedReads = as.data.frame(counts(dds, normalized = TRUE)) head(normalizedReads) resWithNormReads = merge(res,round(normalizedReads,1), by = "row.names") rownames(resWithNormReads) = resWithNormReads$Row.names resWithNormReads$Row.names = NULL head(resWithNormReads)

subset.baseMean.le.10 = subset(resWithNormReads, abs(log2FoldChange) >= 1.0 & padj < 0.1 & baseMean <= 10) nrow(subset.baseMean.le.10)

subset.baseMean.le.10 = subset.baseMean.le.10[order(subset.baseMean.le.10$baseMean, decreasing = FALSE),] head(subset.baseMean.le.10)

# as you can see it calculates 43 DE genes, though they have low read numbers (baseMean <= 10).

with(subset.baseMean.le.10, plot(baseMean,log2FoldChange, pch = 20, cex = 1)) abline(h =c(-1,1))

subset.sd = subset(resWithNormReads, baseMean <= 100 & abs(log2FoldChange) >= 1.0 & padj < 0.1 & rowSds(as.matrix(resWithNormReads[,c("Treatment_R1","Treatment_R2","Treatment_R3")])) > 1.5*baseMean)

subset.sd = subset.sd[orderrowSdsas.matrixsubset.sd[,c("Treatment_R1","Treatment_R2","Treatment_R3")])), decreasing = TRUE),] subset.sd = subset.sd[ordersubset.sd$baseMean, decreasing = TRUE),]

nrowsubset.sd) headsubset.sd) subset.sd

# as you can see, with high standart deviation, the p-value and adj p-value are significant for 47 genes.

t.pvals = c() for(i in 1:nrowsubset.sd)) t.pvals = c(t.pvals,t.testsubset.sd[i, 7:9],subset.sd[i, 10:12])$p.value)

subset.sd = cbindsubset.sd, t.pvals) subset.sd

# if we use t.test on the normalized reads in for each gene, there is no significant genes.

I hope I could explain the points that concern me.

best,

ilyas.

ADD REPLY • link 6.5 years ago Mehmet Ilyas Cosacak • 0

1

Entering edit mode

It is known that DESeq2 is more powerful than a row-wise t-test, as it shares information across genes and uses a count-based generalized linear model to model the count data. It's not surprising then that DESeq2 finds genes with high LFC and low mean of normalized counts differentially expressed although they are not called DE by a row-wise t-test of normalized counts. It helps to view the LFC of the genes in context. You can see with the code below that they deviate from the bulk of genes more compatible with the null. The model, using information from all genes, determines that these LFCs are not compatible with LFC=0, with an expected FDR bound of 10%. If you want to hone in on genes with more evidence of strong DE by testing an effect size larger than LFC=0, you can adjust the the lfcThreshold when you call results().

res <- results(dds)
plotMA(res, ylim=c(-8,8))
with(res[which(res$baseMean < 10 & res$padj < .1),], 
  points(baseMean, log2FoldChange, cex=2, col="blue"))

ADD REPLY • link 6.5 years ago Michael Love 41k

0

Entering edit mode

Hi Micheal,

thanks for your replies. I am not comparing the power of DESeq2 with t.test. What I am trying to say, with high standard deviation, still we get significantly expressed genes. Moreover, with low read numbers the DESeq2 calculates a log2FoldChange as significant.

Please check the following genes. I just choose the genes with highest standard deviation in treatments.

GeneNo,baseMean,log2FoldChange,lfcSE,stat,pvalue,padj,Control_R1,Control_R2,Control_R3,Treatment_R1,Treatment_R2,Treatment_R3,Control_R1,Control_R2,Control_R3,Treatment_R1,Treatment_R2,Treatment_R3
gene10745,8.7,-2.9,0.9,-3.3,0.001,0.018,16,15,15,4,1,1,17,14,15,4,1,1
gene23244,5.3,3.9,1.4,2.8,0.006,0.071,0,0,2,2,8,20,0,0,2,2,8,20
gene10950,5.6,3.3,1.3,2.6,0.008,0.092,1,0,2,2,6,23,1,0,2,2,6,23
gene31422,8.4,2.6,1.0,2.7,0.007,0.082,3,1,3,2,13,29,3,1,3,2,13,29
gene4276,8.2,4.6,1.4,3.3,0.001,0.021,1,0,1,0,18,30,1,0,1,0,18,30
gene26244,10.9,2.4,0.9,2.7,0.008,0.089,1,7,3,3,12,41,1,6,3,3,12,40
gene31956,12.6,2.5,0.8,3.0,0.003,0.044,1,5,6,6,12,47,1,5,6,6,12,46
gene11865,13.6,2.3,0.8,2.9,0.004,0.053,4,6,4,3,19,47,4,5,4,3,18,46
gene31541,13.4,4.7,1.7,2.8,0.005,0.067,0,1,2,1,22,56,0,1,2,1,21,55
gene9356,15.8,3.7,0.9,4.0,0.000,0.002,1,4,2,2,30,58,1,4,2,2,29,57
gene24596,16.5,3.1,0.8,3.8,0.000,0.005,1,5,5,5,23,62,1,5,5,5,22,61
gene1462,17.7,3.0,0.8,3.8,0.000,0.005,0,6,6,9,17,70,0,5,6,10,17,69
gene26360,19.2,2.5,0.7,3.3,0.001,0.018,1,8,9,7,22,70,1,7,9,7,21,69
gene16582,24.1,5.6,1.6,3.5,0.000,0.011,0,1,2,3,45,96,0,1,2,3,44,95
gene4051,24.8,4.9,0.9,5.5,0.000,0.000,0,2,3,18,21,106,0,2,3,19,20,104
gene23242,38.7,6.8,1.3,5.4,0.000,0.000,1,0,1,0,94,141,1,0,1,0,91,139
gene18091,37.8,8.7,2.1,4.1,0.000,0.002,0,0,0,6,68,157,0,0,0,6,66,155
gene30410,46.4,6.2,1.8,3.3,0.001,0.017,0,2,2,1,109,170,0,2,2,1,106,167
gene12518,38.3,5.6,1.6,3.6,0.000,0.009,0,3,2,5,53,171,0,3,2,5,52,168
gene28454,53.3,5.8,1.8,3.3,0.001,0.022,0,2,4,2,108,210,0,2,4,2,105,207
gene19387,57.6,3.9,1.2,3.3,0.001,0.019,3,13,6,16,45,268,3,12,6,17,44,264
gene7378,96.6,4.4,1.2,3.6,0.000,0.008,5,5,16,12,198,354,5,5,16,13,193,349
gene24726,88.3,6.7,1.6,4.2,0.000,0.001,1,0,4,12,164,358,1,0,4,13,159,353
gene4110,74.8,8.7,2.4,3.6,0.000,0.007,1,0,0,2,37,416,1,0,0,2,36,410
gene2082,203.7,4.0,1.2,3.4,0.001,0.015,32,19,20,20,397,755,34,17,20,21,386,743
gene30052,248.4,5.1,1.4,3.6,0.000,0.009,7,3,32,22,540,914,8,3,32,23,525,900

best,

ilyas.

ADD REPLY • link 6.5 years ago Mehmet Ilyas Cosacak • 0

1

Entering edit mode

I took a look at your data earlier and I’m not concerned with the genes I highlighted in the plot above. You can follow the suggestions above if you want to raise the LFC or lower the expected FDR.