Something strange in the ranking of genes by DESeq2
1
0
Entering edit mode
Dejian Zhao ▴ 10
@dejian-zhao
Last seen 3.2 years ago

I ran the following code on a RNAseq dataset and found a strange problem in the result. The raw data set, the code, and the DESeq2 output were shared in a Google Drive.

In the DESeq2 output, I ranked the genes by the ascending order of the column "padj". So, the most significant genes are at the top of the list. The comparison is between 10 females vs. 20 males. I appended the normalized read counts of the 30 samples, calculated using counts(dds, normalized=T), to the DESeq2 statistics. When we examined the gene list, we found something strange. Three genes, ENSG00000274049.4, ENSG00000248871.1, and ENSG00000067048.16 (shaded in the Excel file in the Google Drive), came to our attention.

For ENSG00000067048.16, the normalized read counts for female is small (nine 0 and one 6.065) while that for male are large (min 719, mean 1612, most >1000). Based on the normalized counts, this genes looks like a DEG. Biologically, ENSG00000067048.16 is a Y-linked gene (DDX3Y) and it makes good sense that this gene is a DEG between males and females. However, it did not pass the statistical test.

On the contrary, for ENSG00000274049.4 and ENSG00000248871.1, their expression was detected in one female (similar to ENSG00000067048.16, but with higher expression value) and in 3 or 6 males (in fewer males and lower expression than ENSG00000067048.16). Based on the normalized counts, these two genes are less likely to be DEGs than ENSG00000067048.16. However, it is quite the contrary.

I tried two versions of DESeq2 (v1.30.0 and v1.22.1), and obtained similar results. How to explain the contradiction between DESeq2 statistics and the normalized read counts? How to improve the ranking of the genes? Thank you!


# R code and output

> countData <- read.csv("test_read_counts.csv", header = T, row.names = 1)
> colData <- data.frame(row.names = colnames(countData), sampleCondition = c(rep("F", 10), rep("M", 20)) )
> dds <- DESeqDataSetFromMatrix(countData = countData,colData = colData,design = ~ sampleCondition)
Warning message:
In DESeqDataSet(se, design = design, ignoreRank) :
  some variables in design formula are characters, converting to factors
> dds <- DESeq(dds)
estimating size factors
estimating dispersions
gene-wise dispersion estimates
mean-dispersion relationship
final dispersion estimates
fitting model and testing
-- replacing outliers and refitting for 1417 genes
-- DESeq argument 'minReplicatesForReplace' = 7 
-- original counts are preserved in counts(dds)
estimating dispersions
fitting model and testing
> normcnts <- counts(dds, normalized=T)
> vs <- results(dds,contrast = c("sampleCondition", "F", "M"))
> all( rownames(vs) == rownames(normcnts) ) # TRUE
[1] TRUE
> vs <- cbind(genes=rownames(vs), as.data.frame(vs), normcnts)
> write.xlsx(vs, file="DESeq2_fulltable.xlsx", row.names=FALSE)
> vs[c("ENSG00000274049.4", "ENSG00000248871.1", "ENSG00000067048.16"),,drop=F]
                                genes    baseMean log2FoldChange    lfcSE      stat       pvalue         padj
ENSG00000274049.4   ENSG00000274049.4   31.494597     -22.884342 3.128491 -7.314818 2.577305e-13 3.842762e-10
ENSG00000248871.1   ENSG00000248871.1    3.892869     -20.862786 3.133571 -6.657831 2.778978e-11 3.824729e-08
ENSG00000067048.16 ENSG00000067048.16 1111.375194      -3.870912 1.074958 -3.600988 3.170103e-04 2.181519e-01
                        F01      F02      F03 F04 F05 F06 F07 F08 F09 F10      M01       M02      M03      M04
ENSG00000274049.4  835.4399  0.00000 0.000000   0   0   0   0   0   0   0    0.000  318.2762    0.000    0.000
ENSG00000248871.1    0.0000 20.52886 0.000000   0   0   0   0   0   0   0    0.000    0.0000    0.000    0.000
ENSG00000067048.16   0.0000  0.00000 6.065433   0   0   0   0   0   0   0 1975.981 1804.6816 1355.837 1386.681
                          M05      M06       M07        M08      M09      M10         M11      M12      M13
ENSG00000274049.4     0.00000 298.5307   73.1588    0.00000    0.000    0.000    2.440631    0.000    0.000
ENSG00000248871.1    37.78198   0.0000    0.0000   79.00411    0.000    0.000    0.000000    0.000    0.000
ENSG00000067048.16 1997.18402 719.5798 1818.6417 1654.91244 1164.236 1558.293 1647.426216 1479.445 2173.444
                        M14      M15      M16      M17       M18      M19       M20
ENSG00000274049.4   778.514    0.000    0.000    0.000    0.0000    0.000  252.4316
ENSG00000248871.1     0.000    0.000    0.000    0.000  364.8889    0.000    0.0000
ENSG00000067048.16 1353.131 1896.412 2005.148 2492.951 1135.7470 1431.869 1194.8429
>
> sessionInfo()
R version 4.0.3 (2020-10-10)
Platform: x86_64-apple-darwin17.0 (64-bit)
Running under: macOS Catalina 10.15.7

Matrix products: default
BLAS:   /System/Library/Frameworks/Accelerate.framework/Versions/A/Frameworks/vecLib.framework/Versions/A/libBLAS.dylib
LAPACK: /Library/Frameworks/R.framework/Versions/4.0/Resources/lib/libRlapack.dylib

locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8

attached base packages:
[1] parallel  stats4    stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
 [1] DESeq2_1.30.0               SummarizedExperiment_1.20.0 Biobase_2.50.0             
 [4] MatrixGenerics_1.2.0        matrixStats_0.57.0          GenomicRanges_1.42.0       
 [7] GenomeInfoDb_1.26.2         IRanges_2.24.1              S4Vectors_0.28.1           
[10] BiocGenerics_0.36.0         openxlsx_4.2.3             

loaded via a namespace (and not attached):
 [1] genefilter_1.72.0      locfit_1.5-9.4         tidyselect_1.0.0       purrr_0.3.4           
 [5] splines_4.0.3          lattice_0.20-41        colorspace_1.4-1       vctrs_0.2.4           
 [9] blob_1.2.1             XML_3.99-0.3           survival_3.2-7         rlang_0.4.6           
[13] pillar_1.4.4           glue_1.4.0             DBI_1.1.0              BiocParallel_1.24.1   
[17] bit64_0.9-7            RColorBrewer_1.1-2     GenomeInfoDbData_1.2.3 lifecycle_0.2.0       
[21] zlibbioc_1.36.0        munsell_0.5.0          gtable_0.3.0           zip_2.1.1             
[25] memoise_1.1.0          geneplotter_1.68.0     AnnotationDbi_1.52.0   Rcpp_1.0.4.6          
[29] xtable_1.8-4           scales_1.1.0           DelayedArray_0.16.0    annotate_1.68.0       
[33] XVector_0.30.0         bit_1.1-15.2           ggplot2_3.3.0          digest_0.6.25         
[37] stringi_1.4.6          dplyr_0.8.5            grid_4.0.3             tools_4.0.3           
[41] bitops_1.0-6           magrittr_1.5           RCurl_1.98-1.2         RSQLite_2.2.0         
[45] tibble_3.0.1           pkgconfig_2.0.3        crayon_1.3.4           ellipsis_0.3.0        
[49] Matrix_1.2-18          assertthat_0.2.1       httr_1.4.1             rstudioapi_0.11       
[53] R6_2.4.1               compiler_4.0.3
DESeq2 • 786 views
ADD COMMENT
2
Entering edit mode
@mikelove
Last seen 1 hour ago
United States

The number of outliers reported is indicating that you'd benefit from more strict pre-filtering. It will help both genes with spiky single large counts, and with the multiple testing burden.

I'd recommend:

keep <- rowSums(counts(dds) >= 10) >= 8
dds <- dds[keep,]
dds <- DESeq(dds, test="LRT", reduced=~1)
res <- results(dds)
ADD COMMENT
0
Entering edit mode

Thank you for your quick reply, Michael. I followed your suggestion and re-ran the analysis. The two previous genes, ENSG00000274049.4 and ENSG00000248871.1, were filtered out, but the problem still exists. I shared the updated output (DESeq2_fulltable_Jan4.2021.xlsx) in the previous Google Drive. I singled out two genes, ENSG00000183878.15 (UTY) and ENSG00000067048.16 (DDX3Y), to show the problem (Please refer to the picture below or the Excel sheet "genes").

For ENSG00000183878.15 (UTY), it's detected in 1 female with normalized count 6.5 and in all 20 males (min 319.8, max 1664.3, mean 844.9).

For ENSG00000067048.16 (DDX3Y), it's also detected in 1 female with normalized count 6.1 and in all 20 males (min 719.6, max 2493.0, mean 1612.3). The normalized counts for DDX3Y are higher than that of UTY in 19 out of 20 males.

Two typical genes showing the ranking problem

Based on the normalized counts, I will expect the fold change (male vs. female) for DDX3Y is larger than that of UTY (because DDX3Y has larger expression values in male and smaller expression values in female compared to UTY). But it's quite the contrary - the log2FC for DDX3Y is 3.87 while that for UTY is 11.37. The padj for DDX3Y is 0.7997 while that for UTY is 1.096737e-137. It turns out that UTY is an extremely significant gene while DDX3Y is far from being a DEG, which does not make sense to me.

Could you please take a deeper look at this problem, Michael? Thank you!


# Re-run the analysis after filtering the data

> countData <- read.csv("../test_read_counts.csv", header = T, row.names = 1)
> GTFinfo    = read.table(file="hg38_gencode_v27_GTF_info.txt", header = T)
> colData <- data.frame(row.names = colnames(countData), sampleCondition = c(rep("F", 10), rep("M", 20)) )
> dds <- DESeqDataSetFromMatrix(countData = countData, colData = colData, design = ~ sampleCondition)
> GTFinfo<-GTFinfo[match(rownames(dds), GTFinfo$GeneVer),]
> all(rownames(dds) == GTFinfo$GeneVer)
[1] TRUE
> rownames(GTFinfo)<-GTFinfo$GeneVer
> mcols(dds) <- DataFrame(mcols(dds), GTFinfo)
> keep <- rowSums(counts(dds) >= 10) >= 8
> dds <- dds[keep,]
> dds <- DESeq(dds, test="LRT", reduced=~1)
estimating size factors
estimating dispersions
gene-wise dispersion estimates
mean-dispersion relationship
final dispersion estimates
fitting model and testing
-- replacing outliers and refitting for 285 genes
-- DESeq argument 'minReplicatesForReplace' = 7 
-- original counts are preserved in counts(dds)
estimating dispersions
fitting model and testing
> res <- results(dds)
> normcnts <- counts(dds, normalized=T)
> res <- cbind(GTFinfo[rownames(res),c("GeneType","GeneName")], as.data.frame(res), normcnts)
> res[c("ENSG00000183878.15", "ENSG00000067048.16"),,drop=F]
                         GeneType GeneName  baseMean log2FoldChange     lfcSE       stat        pvalue          padj F01
ENSG00000183878.15 protein_coding      UTY  563.4732       11.37203 0.5566464 641.648757 1.463389e-141 1.096737e-137   0
ENSG00000067048.16 protein_coding    DDX3Y 1111.3864        3.87080 1.0749658   8.509714  3.532556e-03  7.997314e-01   0
                   F02      F03 F04 F05 F06 F07 F08      F09 F10      M01       M02       M03       M04       M05      M06
ENSG00000183878.15   0 0.000000   0   0   0   0   0 6.542583   0 1035.212  777.7389  417.3154  667.5174  961.2629 964.8319
ENSG00000067048.16   0 6.064568   0   0   0   0   0 0.000000   0 1975.756 1804.4883 1355.7902 1386.5047 1997.1313 719.5795
                        M07      M08       M09       M10      M11       M12       M13       M14       M15       M16
ENSG00000183878.15 1084.470  937.316  793.7273  836.2263 1085.177  788.4023  792.8797  760.0187  768.2168  785.6713
ENSG00000067048.16 1818.638 1654.911 1164.5196 1558.4218 1647.907 1479.4448 2173.4442 1353.2040 1896.4079 2004.7481
                        M17       M18       M19       M20
ENSG00000183878.15 1664.332  824.0207  633.4982  319.8196
ENSG00000067048.16 2492.957 1135.7470 1431.9451 1195.1155
ADD REPLY
2
Entering edit mode

This is solved with minReplicatesForReplace=Inf. The outlier replacement routine is impairing inference here, and its not needed after the pre-filtering step.

 > res[c("ENSG00000183878.15", "ENSG00000067048.16"),]
 log2 fold change (MLE): sampleCondition M vs F
 LRT p-value: '~ sampleCondition' vs '~ 1'
 DataFrame with 2 rows and 6 columns
                     baseMean log2FoldChange     lfcSE      stat       pvalue         padj
                    <numeric>      <numeric> <numeric> <numeric>    <numeric>    <numeric>
 ENSG00000183878.15   563.473        11.3720  0.556646   641.649 1.46339e-141 1.08123e-137
 ENSG00000067048.16  1075.091        10.4764  0.458752   351.760  1.75355e-78  3.70175e-75

You can remove the files from Drive if you want.

ADD REPLY
0
Entering edit mode

Perfect! Thank you!

ADD REPLY

Login before adding your answer.

Traffic: 762 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6