I ran the following code on a RNAseq dataset and found a strange problem in the result. The raw data set, the code, and the DESeq2 output were shared in a Google Drive.
In the DESeq2 output, I ranked the genes by the ascending order of the column "padj". So, the most significant genes are at the top of the list. The comparison is between 10 females vs. 20 males. I appended the normalized read counts of the 30 samples, calculated using counts(dds, normalized=T), to the DESeq2 statistics. When we examined the gene list, we found something strange. Three genes, ENSG00000274049.4, ENSG00000248871.1, and ENSG00000067048.16 (shaded in the Excel file in the Google Drive), came to our attention.
For ENSG00000067048.16, the normalized read counts for female is small (nine 0 and one 6.065) while that for male are large (min 719, mean 1612, most >1000). Based on the normalized counts, this genes looks like a DEG. Biologically, ENSG00000067048.16 is a Y-linked gene (DDX3Y) and it makes good sense that this gene is a DEG between males and females. However, it did not pass the statistical test.
On the contrary, for ENSG00000274049.4 and ENSG00000248871.1, their expression was detected in one female (similar to ENSG00000067048.16, but with higher expression value) and in 3 or 6 males (in fewer males and lower expression than ENSG00000067048.16). Based on the normalized counts, these two genes are less likely to be DEGs than ENSG00000067048.16. However, it is quite the contrary.
I tried two versions of DESeq2 (v1.30.0 and v1.22.1), and obtained similar results. How to explain the contradiction between DESeq2 statistics and the normalized read counts? How to improve the ranking of the genes? Thank you!
# R code and output
> countData <- read.csv("test_read_counts.csv", header = T, row.names = 1)
> colData <- data.frame(row.names = colnames(countData), sampleCondition = c(rep("F", 10), rep("M", 20)) )
> dds <- DESeqDataSetFromMatrix(countData = countData,colData = colData,design = ~ sampleCondition)
Warning message:
In DESeqDataSet(se, design = design, ignoreRank) :
some variables in design formula are characters, converting to factors
> dds <- DESeq(dds)
estimating size factors
estimating dispersions
gene-wise dispersion estimates
mean-dispersion relationship
final dispersion estimates
fitting model and testing
-- replacing outliers and refitting for 1417 genes
-- DESeq argument 'minReplicatesForReplace' = 7
-- original counts are preserved in counts(dds)
estimating dispersions
fitting model and testing
> normcnts <- counts(dds, normalized=T)
> vs <- results(dds,contrast = c("sampleCondition", "F", "M"))
> all( rownames(vs) == rownames(normcnts) ) # TRUE
[1] TRUE
> vs <- cbind(genes=rownames(vs), as.data.frame(vs), normcnts)
> write.xlsx(vs, file="DESeq2_fulltable.xlsx", row.names=FALSE)
> vs[c("ENSG00000274049.4", "ENSG00000248871.1", "ENSG00000067048.16"),,drop=F]
genes baseMean log2FoldChange lfcSE stat pvalue padj
ENSG00000274049.4 ENSG00000274049.4 31.494597 -22.884342 3.128491 -7.314818 2.577305e-13 3.842762e-10
ENSG00000248871.1 ENSG00000248871.1 3.892869 -20.862786 3.133571 -6.657831 2.778978e-11 3.824729e-08
ENSG00000067048.16 ENSG00000067048.16 1111.375194 -3.870912 1.074958 -3.600988 3.170103e-04 2.181519e-01
F01 F02 F03 F04 F05 F06 F07 F08 F09 F10 M01 M02 M03 M04
ENSG00000274049.4 835.4399 0.00000 0.000000 0 0 0 0 0 0 0 0.000 318.2762 0.000 0.000
ENSG00000248871.1 0.0000 20.52886 0.000000 0 0 0 0 0 0 0 0.000 0.0000 0.000 0.000
ENSG00000067048.16 0.0000 0.00000 6.065433 0 0 0 0 0 0 0 1975.981 1804.6816 1355.837 1386.681
M05 M06 M07 M08 M09 M10 M11 M12 M13
ENSG00000274049.4 0.00000 298.5307 73.1588 0.00000 0.000 0.000 2.440631 0.000 0.000
ENSG00000248871.1 37.78198 0.0000 0.0000 79.00411 0.000 0.000 0.000000 0.000 0.000
ENSG00000067048.16 1997.18402 719.5798 1818.6417 1654.91244 1164.236 1558.293 1647.426216 1479.445 2173.444
M14 M15 M16 M17 M18 M19 M20
ENSG00000274049.4 778.514 0.000 0.000 0.000 0.0000 0.000 252.4316
ENSG00000248871.1 0.000 0.000 0.000 0.000 364.8889 0.000 0.0000
ENSG00000067048.16 1353.131 1896.412 2005.148 2492.951 1135.7470 1431.869 1194.8429
>
> sessionInfo()
R version 4.0.3 (2020-10-10)
Platform: x86_64-apple-darwin17.0 (64-bit)
Running under: macOS Catalina 10.15.7
Matrix products: default
BLAS: /System/Library/Frameworks/Accelerate.framework/Versions/A/Frameworks/vecLib.framework/Versions/A/libBLAS.dylib
LAPACK: /Library/Frameworks/R.framework/Versions/4.0/Resources/lib/libRlapack.dylib
locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
attached base packages:
[1] parallel stats4 stats graphics grDevices utils datasets methods base
other attached packages:
[1] DESeq2_1.30.0 SummarizedExperiment_1.20.0 Biobase_2.50.0
[4] MatrixGenerics_1.2.0 matrixStats_0.57.0 GenomicRanges_1.42.0
[7] GenomeInfoDb_1.26.2 IRanges_2.24.1 S4Vectors_0.28.1
[10] BiocGenerics_0.36.0 openxlsx_4.2.3
loaded via a namespace (and not attached):
[1] genefilter_1.72.0 locfit_1.5-9.4 tidyselect_1.0.0 purrr_0.3.4
[5] splines_4.0.3 lattice_0.20-41 colorspace_1.4-1 vctrs_0.2.4
[9] blob_1.2.1 XML_3.99-0.3 survival_3.2-7 rlang_0.4.6
[13] pillar_1.4.4 glue_1.4.0 DBI_1.1.0 BiocParallel_1.24.1
[17] bit64_0.9-7 RColorBrewer_1.1-2 GenomeInfoDbData_1.2.3 lifecycle_0.2.0
[21] zlibbioc_1.36.0 munsell_0.5.0 gtable_0.3.0 zip_2.1.1
[25] memoise_1.1.0 geneplotter_1.68.0 AnnotationDbi_1.52.0 Rcpp_1.0.4.6
[29] xtable_1.8-4 scales_1.1.0 DelayedArray_0.16.0 annotate_1.68.0
[33] XVector_0.30.0 bit_1.1-15.2 ggplot2_3.3.0 digest_0.6.25
[37] stringi_1.4.6 dplyr_0.8.5 grid_4.0.3 tools_4.0.3
[41] bitops_1.0-6 magrittr_1.5 RCurl_1.98-1.2 RSQLite_2.2.0
[45] tibble_3.0.1 pkgconfig_2.0.3 crayon_1.3.4 ellipsis_0.3.0
[49] Matrix_1.2-18 assertthat_0.2.1 httr_1.4.1 rstudioapi_0.11
[53] R6_2.4.1 compiler_4.0.3
Thank you for your quick reply, Michael. I followed your suggestion and re-ran the analysis. The two previous genes, ENSG00000274049.4 and ENSG00000248871.1, were filtered out, but the problem still exists. I shared the updated output (DESeq2_fulltable_Jan4.2021.xlsx) in the previous Google Drive. I singled out two genes, ENSG00000183878.15 (UTY) and ENSG00000067048.16 (DDX3Y), to show the problem (Please refer to the picture below or the Excel sheet "genes").
For ENSG00000183878.15 (UTY), it's detected in 1 female with normalized count 6.5 and in all 20 males (min 319.8, max 1664.3, mean 844.9).
For ENSG00000067048.16 (DDX3Y), it's also detected in 1 female with normalized count 6.1 and in all 20 males (min 719.6, max 2493.0, mean 1612.3). The normalized counts for DDX3Y are higher than that of UTY in 19 out of 20 males.
Based on the normalized counts, I will expect the fold change (male vs. female) for DDX3Y is larger than that of UTY (because DDX3Y has larger expression values in male and smaller expression values in female compared to UTY). But it's quite the contrary - the log2FC for DDX3Y is 3.87 while that for UTY is 11.37. The padj for DDX3Y is 0.7997 while that for UTY is 1.096737e-137. It turns out that UTY is an extremely significant gene while DDX3Y is far from being a DEG, which does not make sense to me.
Could you please take a deeper look at this problem, Michael? Thank you!
This is solved with
minReplicatesForReplace=Inf
. The outlier replacement routine is impairing inference here, and its not needed after the pre-filtering step.You can remove the files from Drive if you want.
Perfect! Thank you!