I have RNA-seq samples from yeast (~6000 genes), and there is very large variability in the total number of counts in each sample. In one of the samples I have a large number of genes with very low expression (949 genes with counts=0 and 3643 genes with counts<50). When I look at the size factors for this sample, it is much smaller then what would be expected simply by taking the total number of reads. Here is a plot of the inverse of the size factors, calculated by DESeq vs by using the total number of counts, and the sample with the low number of counts is circled in red:
I think that because this sample has many genes with few counts, then the calculation in DESeq normalization, which uses the median of the relative expression level, gets lower values than it should.
I repeated the calculation of size factors, but used only the genes that had a minimal number of counts >50 across all samples. This indeed had a large impact on the size factor of the sample with the low number of counts, shifting if from ~0.2 to ~0.5, and also made the correlation with the total number of counts look much better:
Do you suppose that the DESeq2 normalization should always ignore genes with low number of counts?
> sessionInfo() R version 3.2.1 (2015-06-18) Platform: x86_64-unknown-linux-gnu (64-bit) Running under: Red Hat Enterprise Linux Server release 6.5 (Santiago) locale:  LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8 LC_MONETARY=en_US.UTF-8  LC_MESSAGES=en_US.UTF-8 LC_PAPER=en_US.UTF-8 LC_NAME=C LC_ADDRESS=C LC_TELEPHONE=C  LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C attached base packages:  parallel stats4 stats graphics grDevices utils datasets methods base other attached packages:  zoo_1.7-12 GenomicFeatures_1.20.6 AnnotationDbi_1.30.1 Biobase_2.28.0 BiocParallel_1.2.22  amap_0.8-14 matrixStats_0.50.1 DESeq2_1.8.2 RcppArmadillo_0.6.600.4.0 Rcpp_0.12.4  GenomicRanges_1.20.8 GenomeInfoDb_1.4.3 IRanges_2.2.9 S4Vectors_0.6.6 BiocGenerics_0.14.0  ggplot2_2.1.0 gplots_3.0.1 RColorBrewer_1.1-2 dplyr_0.4.3 loaded via a namespace (and not attached):  genefilter_1.50.0 gtools_3.5.0 locfit_1.5-9.1 splines_3.2.1 lattice_0.20-33  colorspace_1.2-6 rtracklayer_1.28.10 survival_2.38-3 XML_3.98-1.4 foreign_0.8-66  DBI_0.3.1 lambda.r_1.1.7 plyr_1.8.3 zlibbioc_1.14.0 Biostrings_2.36.4  munsell_0.4.3 gtable_0.2.0 futile.logger_1.4.1 caTools_1.17.1 latticeExtra_0.6-28  biomaRt_2.24.1 geneplotter_1.46.0 acepack_1.3-3.3 KernSmooth_2.23-15 xtable_1.8-2  scales_0.4.0 gdata_2.17.0 Hmisc_3.17-2 annotate_1.46.1 XVector_0.8.0  Rsamtools_1.20.5 gridExtra_2.2.1 grid_3.2.1 tools_3.2.1 bitops_1.0-6  magrittr_1.5 RCurl_1.95-4.8 RSQLite_1.0.0 Formula_1.2-1 cluster_2.0.3  futile.options_1.0.0 assertthat_0.1 R6_2.1.2 rpart_4.1-10 GenomicAlignments_1.4.2  nnet_7.3-12