Search
Question: different statiscal results in DESeq2 from the same samples
0
gravatar for Mehmet Ilyas Cosacak
26 days ago by
Germany/Dresden/ CRTD - DZNE
Mehmet Ilyas Cosacak0 wrote:

Hi,

On 02.01.2017, I ran an analysis with DESeq2. I ran a similar analysis recently. I used the same input data for both runs! Please see "raw reads" in the excel file (https://www.dropbox.com/s/g7fyqvn1chdnesh/deseq2_bugreport_23102017.xlsx?dl=0). Even tough the normalized reads and basemean are same in both runs, the log2FoldChange, padj, etc are completely different. Moreover, I think there are two bugs:

In the previous one ( I do not have a session info for that, ran on 02.01.2017), for "gene21828" the log2FoldChange is calculated wrongly!

In the new one; for "gene23073","gene8822", the log2FoldChange is calculated wrongly or in another words: even though the reads number are blow 30 reads, the log2FoldChange and padj values are calculated as significantly expressed genes. (see gene8822).

Especially the number of DE genes in the new run is over 600 genes [abs(log2FoldChange) >= 1.0]. In the previous one [abs(log2FoldChange) >= 1.0], it is around 30 genes. 20 fold differences in the number of DE genes!

I could not find anything that I may doing wrong, that is why I am posting.

p.s. : I have a pipeline for RNASeq data analysis that is why there are several packages, such as goseq, GOStats, topGO,  ...

thanks,

ilyas.

> sessionInfo()
R version 3.4.1 (2017-06-30)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows >= 8 x64 (build 9200)

Matrix products: default

locale:
[1] LC_COLLATE=English_United States.1252  LC_CTYPE=English_United States.1252    LC_MONETARY=English_United States.1252
[4] LC_NUMERIC=C                           LC_TIME=English_United States.1252    

attached base packages:
 [1] grid      stats4    parallel  stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
 [1] org.Mm.eg.db_3.4.1            xtable_1.8-2                  WriteXLS_4.0.0                stringr_1.2.0                
 [5] statmod_1.4.30                scales_0.5.0                  splitstackshape_1.4.2         Rgraphviz_2.20.0             
 [9] reshape2_1.4.2                topGO_2.28.0                  SparseM_1.77                  ReportingTools_2.16.0        
[13] knitr_1.17                    RColorBrewer_1.1-2            png_0.1-7                     plotrix_3.6-6                
[17] PerformanceAnalytics_1.4.3541 xts_0.10-0                    zoo_1.8-0                     pathview_1.16.5              
[21] org.Hs.eg.db_3.4.1            pheatmap_1.0.8                org.Dr.eg.db_3.4.1            LSD_3.0                      
[25] KEGGREST_1.16.1               KEGG.db_3.2.3                 GSEABase_1.38.1               gdata_2.18.0                 
[29] gplots_3.0.1                  GOstats_2.42.0                graph_1.54.0                  Category_2.42.1              
[33] Matrix_1.2-10                 goseq_1.28.0                  BiasedUrn_1.07                GO.db_3.4.1                  
[37] ggrepel_0.6.5                 ggplot2_2.2.1                 GenomicFeatures_1.28.4        geneplotter_1.54.0           
[41] annotate_1.54.0               XML_3.98-1.9                  lattice_0.20-35               geneLenDataBase_1.12.0       
[45] genefilter_1.58.1             fdrtool_1.2.15                edgeR_3.18.1                  limma_3.32.5                 
[49] dtplyr_0.0.2                  plyr_1.8.4                    DESeq2_1.16.1                 SummarizedExperiment_1.6.3   
[53] DelayedArray_0.2.7            matrixStats_0.52.2            GenomicRanges_1.28.4          GenomeInfoDb_1.12.2          
[57] data.table_1.10.4             calibrate_1.7.2               MASS_7.3-47                   biomaRt_2.32.1               
[61] BiocParallel_1.10.1           AnnotationForge_1.18.2        AnnotationDbi_1.38.2          IRanges_2.10.3               
[65] S4Vectors_0.14.3              Biobase_2.36.2                BiocGenerics_0.22.0           BiocInstaller_1.26.1         

loaded via a namespace (and not attached):
 [1] backports_1.1.0               Hmisc_4.0-3                   AnnotationHub_2.8.2           lazyeval_0.2.0               
 [5] splines_3.4.1                 digest_0.6.12                 ensembldb_2.0.4               htmltools_0.3.6              
 [9] magrittr_1.5                  checkmate_1.8.3               memoise_1.1.0                 BSgenome_1.44.1              
[13] cluster_2.0.6                 Biostrings_2.44.2             R.utils_2.5.0                 ggbio_1.24.1                 
[17] colorspace_1.3-2              blob_1.1.0                    dplyr_0.7.2                   RCurl_1.95-4.8               
[21] bindr_0.1                     survival_2.41-3               VariantAnnotation_1.22.3      glue_1.1.1                   
[25] gtable_0.2.0                  zlibbioc_1.22.0               XVector_0.16.0                DBI_0.7                      
[29] GGally_1.3.2                  Rcpp_0.12.12                  htmlTable_1.9                 foreign_0.8-69               
[33] bit_1.1-12                    OrganismDbi_1.18.0            Formula_1.2-2                 htmlwidgets_0.9              
[37] httr_1.3.1                    acepack_1.4.1                 R.methodsS3_1.7.1             pkgconfig_2.0.1              
[41] reshape_0.8.7                 nnet_7.3-12                   locfit_1.5-9.1                rlang_0.1.2                  
[45] munsell_0.4.3                 tools_3.4.1                   RSQLite_2.0                   yaml_2.1.14                  
[49] bit64_0.9-7                   caTools_1.17.1                AnnotationFilter_1.0.0        bindrcpp_0.2                 
[53] RBGL_1.52.0                   nlme_3.1-131                  mime_0.5                      R.oo_1.21.0                  
[57] KEGGgraph_1.38.1              compiler_3.4.1                curl_2.8.1                    interactiveDisplayBase_1.14.0
[61] PFAM.db_3.4.1                 tibble_1.3.4                  stringi_1.1.5                 ProtGenerics_1.8.0           
[65] bitops_1.0-6                  httpuv_1.3.5                  rtracklayer_1.36.4            hwriter_1.3.2                
[69] R6_2.2.2                      latticeExtra_0.6-28           KernSmooth_2.23-15            gridExtra_2.2.1              
[73] dichromat_2.0-0               gtools_3.5.0                  assertthat_0.2.0              GenomicAlignments_1.12.2     
[77] Rsamtools_1.28.0              GenomeInfoDbData_0.99.0       mgcv_1.8-17                   rpart_4.1-11                 
[81] biovizBase_1.24.0             shiny_1.0.5                   base64enc_0.1-3 
ADD COMMENTlink modified 26 days ago by Michael Love14k • written 26 days ago by Mehmet Ilyas Cosacak0
0
gravatar for Michael Love
26 days ago by
Michael Love14k
United States
Michael Love14k wrote:

It’s possible for results to change if you change software versions, although typically the changes are not large in terms of ranking of test statistic. These can be exaggerated when thresholding on adj pvalue.

ADD COMMENTlink written 26 days ago by Michael Love14k

Hi Michael,

thanks for the reply.

Yes, it is possible to have different results. But, the difference is definitely high. 28 genes versus 428 genes!

If we focus on the current one, I still think there is some statistical error.

For Instance, if you use the input_file (https://www.dropbox.com/s/ophijagobfsb1zw/input_file.tab?dl=0) and the script below:

library("DESeq2")
rawCounts = read.delim("input_file.tab", header = TRUE, row.names = "GeneNo")
head(rawCounts)
conditions = gsub("\\_R[1-3]","",colnames(rawCounts))
coldata <- data.frame(row.names = colnames(rawCounts), condition = as.factor(conditions))
dds <- DESeqDataSetFromMatrix(countData = rawCounts, colData = coldata, design = ~ condition)
dds <- DESeq(dds, parallel = T)
res = as.data.frame(results(dds))
head(res)
normalizedReads = as.data.frame(counts(dds, normalized = TRUE))
head(normalizedReads)
 
resWithNormReads = merge(res,round(normalizedReads,1), by = "row.names")
rownames(resWithNormReads) = resWithNormReads$Row.names
resWithNormReads$Row.names = NULL
head(resWithNormReads)

subset.baseMean.le.10 = subset(resWithNormReads, abs(log2FoldChange) >= 1.0 & padj < 0.1 & baseMean <= 10)
nrow(subset.baseMean.le.10)

subset.baseMean.le.10 = subset.baseMean.le.10[order(subset.baseMean.le.10$baseMean, decreasing = FALSE),]
head(subset.baseMean.le.10)

# as you can see it calculates 43 DE genes, though they have low read numbers (baseMean <= 10).

with(subset.baseMean.le.10, plot(baseMean,log2FoldChange, pch = 20, cex = 1))
abline(h =c(-1,1))

subset.sd = subset(resWithNormReads, baseMean <= 100 & abs(log2FoldChange) >= 1.0 & padj < 0.1 & rowSds(as.matrix(resWithNormReads[,c("Treatment_R1","Treatment_R2","Treatment_R3")])) > 1.5*baseMean)

subset.sd = subset.sd[orderrowSdsas.matrixsubset.sd[,c("Treatment_R1","Treatment_R2","Treatment_R3")])), decreasing = TRUE),]
subset.sd = subset.sd[ordersubset.sd$baseMean, decreasing = TRUE),]

nrowsubset.sd)
headsubset.sd)
subset.sd

# as you can see, with high standart deviation, the p-value and adj p-value are significant for 47 genes.

t.pvals = c()
for(i in 1:nrowsubset.sd)) t.pvals = c(t.pvals,t.testsubset.sd[i, 7:9],subset.sd[i, 10:12])$p.value)

subset.sd = cbindsubset.sd, t.pvals)
subset.sd

# if we use t.test on the normalized reads in for each gene, there is  no significant genes.

I hope I could explain the points that concern me.

best,

ilyas.

 

 

 

ADD REPLYlink written 26 days ago by Mehmet Ilyas Cosacak0
1

It is known that DESeq2 is more powerful than a row-wise t-test, as it shares information across genes and uses a count-based generalized linear model to model the count data. It's not surprising then that DESeq2 finds genes with high LFC and low mean of normalized counts differentially expressed although they are not called DE by a row-wise t-test of normalized counts. It helps to view the LFC of the genes in context. You can see with the code below that they deviate from the bulk of genes more compatible with the null. The model, using information from all genes, determines that these LFCs are not compatible with LFC=0, with an expected FDR bound of 10%. If you want to hone in on genes with more evidence of strong DE by testing an effect size larger than LFC=0, you can adjust the the lfcThreshold when you call results().

res <- results(dds)
plotMA(res, ylim=c(-8,8))
with(res[which(res$baseMean < 10 & res$padj < .1),], 
  points(baseMean, log2FoldChange, cex=2, col="blue"))
ADD REPLYlink modified 26 days ago • written 26 days ago by Michael Love14k

Hi Micheal,

thanks for your replies. I am not comparing the power of DESeq2 with t.test. What I am trying to say, with high standard deviation, still we get significantly expressed genes.  Moreover, with low read numbers the DESeq2 calculates a log2FoldChange as significant. 

Please check the following genes. I just choose the genes with highest standard deviation in treatments.

GeneNo,baseMean,log2FoldChange,lfcSE,stat,pvalue,padj,Control_R1,Control_R2,Control_R3,Treatment_R1,Treatment_R2,Treatment_R3,Control_R1,Control_R2,Control_R3,Treatment_R1,Treatment_R2,Treatment_R3
gene10745,8.7,-2.9,0.9,-3.3,0.001,0.018,16,15,15,4,1,1,17,14,15,4,1,1
gene23244,5.3,3.9,1.4,2.8,0.006,0.071,0,0,2,2,8,20,0,0,2,2,8,20
gene10950,5.6,3.3,1.3,2.6,0.008,0.092,1,0,2,2,6,23,1,0,2,2,6,23
gene31422,8.4,2.6,1.0,2.7,0.007,0.082,3,1,3,2,13,29,3,1,3,2,13,29
gene4276,8.2,4.6,1.4,3.3,0.001,0.021,1,0,1,0,18,30,1,0,1,0,18,30
gene26244,10.9,2.4,0.9,2.7,0.008,0.089,1,7,3,3,12,41,1,6,3,3,12,40
gene31956,12.6,2.5,0.8,3.0,0.003,0.044,1,5,6,6,12,47,1,5,6,6,12,46
gene11865,13.6,2.3,0.8,2.9,0.004,0.053,4,6,4,3,19,47,4,5,4,3,18,46
gene31541,13.4,4.7,1.7,2.8,0.005,0.067,0,1,2,1,22,56,0,1,2,1,21,55
gene9356,15.8,3.7,0.9,4.0,0.000,0.002,1,4,2,2,30,58,1,4,2,2,29,57
gene24596,16.5,3.1,0.8,3.8,0.000,0.005,1,5,5,5,23,62,1,5,5,5,22,61
gene1462,17.7,3.0,0.8,3.8,0.000,0.005,0,6,6,9,17,70,0,5,6,10,17,69
gene26360,19.2,2.5,0.7,3.3,0.001,0.018,1,8,9,7,22,70,1,7,9,7,21,69
gene16582,24.1,5.6,1.6,3.5,0.000,0.011,0,1,2,3,45,96,0,1,2,3,44,95
gene4051,24.8,4.9,0.9,5.5,0.000,0.000,0,2,3,18,21,106,0,2,3,19,20,104
gene23242,38.7,6.8,1.3,5.4,0.000,0.000,1,0,1,0,94,141,1,0,1,0,91,139
gene18091,37.8,8.7,2.1,4.1,0.000,0.002,0,0,0,6,68,157,0,0,0,6,66,155
gene30410,46.4,6.2,1.8,3.3,0.001,0.017,0,2,2,1,109,170,0,2,2,1,106,167
gene12518,38.3,5.6,1.6,3.6,0.000,0.009,0,3,2,5,53,171,0,3,2,5,52,168
gene28454,53.3,5.8,1.8,3.3,0.001,0.022,0,2,4,2,108,210,0,2,4,2,105,207
gene19387,57.6,3.9,1.2,3.3,0.001,0.019,3,13,6,16,45,268,3,12,6,17,44,264
gene7378,96.6,4.4,1.2,3.6,0.000,0.008,5,5,16,12,198,354,5,5,16,13,193,349
gene24726,88.3,6.7,1.6,4.2,0.000,0.001,1,0,4,12,164,358,1,0,4,13,159,353
gene4110,74.8,8.7,2.4,3.6,0.000,0.007,1,0,0,2,37,416,1,0,0,2,36,410
gene2082,203.7,4.0,1.2,3.4,0.001,0.015,32,19,20,20,397,755,34,17,20,21,386,743
gene30052,248.4,5.1,1.4,3.6,0.000,0.009,7,3,32,22,540,914,8,3,32,23,525,900

best,

ilyas.

ADD REPLYlink written 23 days ago by Mehmet Ilyas Cosacak0
1

I took a look at your data earlier and I’m not concerned with the genes I highlighted in the plot above. You can follow the suggestions above if you want to raise the LFC or lower the expected FDR.

 

ADD REPLYlink written 23 days ago by Michael Love14k

Thanks a lot for your suggestions. best, ilyas.

ADD REPLYlink written 23 days ago by Mehmet Ilyas Cosacak0
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.2.0
Traffic: 157 users visited in the last hour