"Over-correction" in the size-factors of the DESeq2 package
2
0
Entering edit mode
@gilhornung-9503
Last seen 3.0 years ago

Hi,

I have RNA-seq samples from yeast (~6000 genes), and there is very large variability in the total number of counts in each sample. In one of the samples I have a large number of genes with very low expression (949 genes with counts=0 and 3643 genes with counts<50). When I look at the size factors for this sample, it is much smaller then what would be expected simply by taking the total number of reads. Here is a plot of the inverse of the size factors, calculated by DESeq vs by using the total number of counts, and the sample with the low number of counts is circled in red:

I think that because this sample has many genes with few counts, then the calculation in DESeq normalization, which uses the median of the relative expression level, gets lower values than it should.

I repeated the calculation of size factors, but used only the genes that had a minimal number of counts >50 across all samples. This indeed had a large impact on the size factor of the sample with the low number of counts, shifting if from ~0.2 to ~0.5, and also made the correlation with the total number of counts look much better:

Do you suppose that the DESeq2 normalization should always ignore genes with low number of counts?

Gil

> sessionInfo()
R version 3.2.1 (2015-06-18)
Platform: x86_64-unknown-linux-gnu (64-bit)
Running under: Red Hat Enterprise Linux Server release 6.5 (Santiago)

locale:
[1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C               LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8     LC_MONETARY=en_US.UTF-8
[6] LC_MESSAGES=en_US.UTF-8    LC_PAPER=en_US.UTF-8       LC_NAME=C                  LC_ADDRESS=C               LC_TELEPHONE=C
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C

attached base packages:
[1] parallel  stats4    stats     graphics  grDevices utils     datasets  methods   base

other attached packages:
[1] zoo_1.7-12                GenomicFeatures_1.20.6    AnnotationDbi_1.30.1      Biobase_2.28.0            BiocParallel_1.2.22
[6] amap_0.8-14               matrixStats_0.50.1        DESeq2_1.8.2              RcppArmadillo_0.6.600.4.0 Rcpp_0.12.4
[11] GenomicRanges_1.20.8      GenomeInfoDb_1.4.3        IRanges_2.2.9             S4Vectors_0.6.6           BiocGenerics_0.14.0
[16] ggplot2_2.1.0             gplots_3.0.1              RColorBrewer_1.1-2        dplyr_0.4.3

loaded via a namespace (and not attached):
[1] genefilter_1.50.0       gtools_3.5.0            locfit_1.5-9.1          splines_3.2.1           lattice_0.20-33
[6] colorspace_1.2-6        rtracklayer_1.28.10     survival_2.38-3         XML_3.98-1.4            foreign_0.8-66
[11] DBI_0.3.1               lambda.r_1.1.7          plyr_1.8.3              zlibbioc_1.14.0         Biostrings_2.36.4
[16] munsell_0.4.3           gtable_0.2.0            futile.logger_1.4.1     caTools_1.17.1          latticeExtra_0.6-28
[21] biomaRt_2.24.1          geneplotter_1.46.0      acepack_1.3-3.3         KernSmooth_2.23-15      xtable_1.8-2
[26] scales_0.4.0            gdata_2.17.0            Hmisc_3.17-2            annotate_1.46.1         XVector_0.8.0
[31] Rsamtools_1.20.5        gridExtra_2.2.1         grid_3.2.1              tools_3.2.1             bitops_1.0-6
[36] magrittr_1.5            RCurl_1.95-4.8          RSQLite_1.0.0           Formula_1.2-1           cluster_2.0.3
[41] futile.options_1.0.0    assertthat_0.1          R6_2.1.2                rpart_4.1-10            GenomicAlignments_1.4.2
[46] nnet_7.3-12  
deseq2 rna-seq normalization • 4.9k views
0
Entering edit mode
@mikelove
Last seen 20 hours ago
United States

The median ratio method is usually quite robust. The biggest problem for library size correction is not actually the genes with low counts, but the genes with very high counts, which have high variance and disproportionately influence estimators like the total sum. This is why the total sum is a notoriously bad estimator for library size.

Can you report back this:

round(quantile(rowMeans(counts(dds)), 0:10/10))

Also, can you try:

dds <- estimateSizeFactors(dds, type="iterate")
sf2 <- sizeFactors(dds)

And then compare this size factor estimate with the others?

0
Entering edit mode

Wow, fast answer, Michael!

Here is what you asked for:

> round(quantile(rowMeans(counts(dds)), 0:10/10))
0%   10%   20%   30%   40%   50%   60%   70%   80%   90%  100%
0     7    51    93   146   219   314   461   705  1278 30096 

the "regular" size factor of the problematic sample:

0.2152499

the size factor with type="iterate" (took me ages to run the command)

0.2808473

Not much difference there...

0
Entering edit mode

I would trust the normal size factors. Basically, the x-axis in your plot is a really bad estimator according to the RNA-seq literature, so getting close to the y=x line is not necessarily progress.

Given the global profile of counts, this sample might be worthy of removal for technical problems in generating reads. Have you done QC plots, e.g. PCA to see if it is an outlier there? If you have sufficient replicates for all conditions, you may be better off removing it.

0
Entering edit mode

Hi Michael,

To my best understanding (and you are welcome to correct me), the DESeq normalization is in place to control for changes in genes that are take up most of the counts, and can then bias the total number of counts. I am not too concerned about this in this sample, because the most highly expressed gene in this sample takes up ~2.8% of the counts. However, the fact that ~16% of the genes in my sample have zero counts, seems like a more detrimental effect. The large deviation from the y=x line of this specific sample, while other points remain close to the line is also suspicious.

This sample did not seem too different on the PCA / clustering, and I only have 2 replicates (unfortunately, I didn't design the experiment but still have to analyse it). That's why I chose to keep it. But maybe it is best to disregard it and put all of the uncertainty behind us...

1
Entering edit mode

I'll just say again, the x-axis is typically a bad estimator of library size. It will be ok for some samples but worse for others. So just guessing I'd say the x-axis (total count) is really bad for that one sample.

What you have observed is that removing more than half the genes (3643/6000) and then obtaining size factors is giving you a similar estimate to total count (typically a bad estimator), but for me that doesn't give evidence that this is a good approach. Do you see what I mean? Correlation to bad is not necessarily a desirable property.

A better way to figure out which might be a better vector of size factors is to look at MA plots for pairs of samples where one of the pair is that sample in question:

library(airway)
data(airway)
library(DESeq2)
x <- assay(airway)
sf <- estimateSizeFactorsForMatrix(x) # or however you want to estimate this
plot(x[,1]*x[,2], x[,2]/x[,1], log="xy");abline(h=sf[2]/sf[1], col="red")

0
Entering edit mode

If you only have two replicates, I would keep that sample if possible.

0
Entering edit mode

Thanks Michael.

I understand that "correlation to bad is not a desirable property", however I am concerned that one of the assumption behind the DESeq normalization, namely that the expression of most genes is reliable, does not hold in this case.

Below is the MA plot. Red line is the size factor ratio of regular DESeq normalization, in blue the DEseq normalization taking only genes with min-count>50 and in green the normalization based on total counts.

The sample with the low number of counts is 194L_S_1.

0
Entering edit mode

It's hard to tell, but you may be right that the default SF is not good for all genes here.

For me the red line is more in the center in the range 1e3 to 1e4, and the blue is better for the range 1e6 to 1e8. The green looks too low for most of the range, and it seems to be mostly affected by a few genes in the 1e8 range.

The counts for the problematic sample is a little less than sqrt() these values, so red is better for counts in the range 30-100. The counts in this range are certainly reliable enough for calculating size factors, and there are many of them, which is good. It's just that including these genes are giving a different number than just looking at the highest expressed.

You can choose whichever size factor vector you think makes sense for this dataset. It is a bit problematic when the range of library size is so large (it looks like it covers a range of ~16 from smallest to largest?) You might try to find out how to avoid such a range for future experiments.

0
Entering edit mode

Thank you, Michael.

I learned from this discussion.

Indeed, the experiment was very noisy with large variation in number of reads. I'm just trying to salvage what I can. And, ofcourse, I will treat the results with a cup of salt.

0
Entering edit mode

To get a better idea of where the majority of genes lie in this plot, I would regenerate the MA plot using smoothScatter. But I agree with Michael that the red line seems to fit better to the lower counts while the blue one fits better to the higher counts. Basically, it seems the optimal normalization would trace a downward-sloping line on the MA plot, rather than a horizontal line. As such, you might want to try something more complex than a scaling normalization. You could try full between-sample quantile normalization from the EDASeq package. If you do, make sure to use betweenLaneNormalization with offset=TRUE and pass those offsets to DESeq2 (I don't remember exactly how to do this, but I believe it is documented in the manual).

0
Entering edit mode

What is a good range for size factors? In the manual I saw it should be near to 1. How much can it vary?

0
Entering edit mode

There isn't a good range really.

If the range is large it means that some samples were sequenced very little and others much more. You wouldn't want them to be confounded with the biological condition, though.

They will be centered around 1 by construction.