DESeq Normalization Question
1
0
Entering edit mode
@stephen-turner-4916
Last seen 5.7 years ago
United States
Simon et al., I'm sure this issue has come up before, but I couldn't find an appropriate thread or answer either here or SEQanswers. What feature of the data or the distribution of counts among my samples can cause the sizeFactors to vary much more than the raw counts / library sizes? More detail: I'm using DESeq to analyze RNA-seq data mapped with STAR, counted with htseq-count. Comparing the "doubleTerm" samples to the "wt" samples, there are many genes that appear downregulated. While these samples were sequenced, on average, to a similar sequencing depth, the normalization factors are much smaller for WT, resulting in much larger normalized counts, resulting in more apparently downregulated genes in doubleTerm vs WT. > cds <- newCountDataSetFromHTSeqCount(sampleTable=sampleTable, directory=directory) > cds <- estimateSizeFactors(cds) > cds <- estimateDispersions(cds) > data.frame(sizefactors=sizeFactors(cds), rawcounts=colSums(counts(cds, normalized=FALSE))) sizefactors rawcounts S01_wt1 0.9016089 23466349 S02_wt2 0.7679168 22428603 S03_wt3 0.7952564 19841959 S04_wt4 0.7839629 18363384 S05_pten8w1 1.0301769 20859853 S06_pten8w2 0.9949514 16809588 S07_pten8w3 0.9425865 16731071 S08_pten22w1 1.0826846 18906329 S09_pten22w2 1.1640354 20164026 S10_pten22w3 1.0111748 17306468 S11_double8w1 0.7949001 17671986 S12_double8w2 1.4509978 23673557 S13_double8w3 1.1703853 22127841 S14_doubleterm2 1.0786455 19063010 S15_doubleterm4 1.1265935 19279814 S16_doubleterm6 1.3059472 22750403 Thank you. Stephen > sessionInfo() R version 3.0.0 (2013-04-03) Platform: x86_64-apple-darwin10.8.0 (64-bit) locale: [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8 attached base packages: [1] parallel stats graphics grDevices utils datasets methods base other attached packages: [1] DESeq_1.12.0 lattice_0.20-15 locfit_1.5-9 Biobase_2.20.0 [5] BiocGenerics_0.6.0 edgeR_3.2.3 limma_3.16.2 BiocInstaller_1.10.1 loaded via a namespace (and not attached): [1] annotate_1.38.0 AnnotationDbi_1.22.3 DBI_0.2-6 DESeq2_1.0.9 [5] genefilter_1.42.0 geneplotter_1.38.0 GenomicRanges_1.12.2 grid_3.0.0 [9] IRanges_1.18.0 RColorBrewer_1.0-5 RSQLite_0.11.3 splines_3.0.0 [13] stats4_3.0.0 survival_2.37-4 tools_3.0.0 XML_3.95-0.2 [17] xtable_1.7-1
Normalization DESeq Normalization DESeq • 1.3k views
ADD COMMENT
0
Entering edit mode
@ryan-c-thompson-5618
Last seen 7 months ago
Scripps Research, La Jolla, CA
Roughly speaking, the size factors are calculated such that the MA plot between any two samples will be centered vertically on zero logFC. If you have just a few genes with high abundance in just a few samples, then the counts for all other genes in those samples will be depressed because the high-abundance genes will be taking a larger fraction of the more-or-less fixed number of total reads in the sample. Hence, to normalize the counts in these samples to others, they would have to be scaled up. One example where this could happen is if you sequenced healthy cells and virus-infected cells, where a majority of mRNA might be viral. On Fri 10 May 2013 09:38:30 AM PDT, Stephen Turner wrote: > Simon et al., > > I'm sure this issue has come up before, but I couldn't find an > appropriate thread or answer either here or SEQanswers. > > What feature of the data or the distribution of counts among my > samples can cause the sizeFactors to vary much more than the raw > counts / library sizes? > > More detail: I'm using DESeq to analyze RNA-seq data mapped with STAR, > counted with htseq-count. Comparing the "doubleTerm" samples to the > "wt" samples, there are many genes that appear downregulated. While > these samples were sequenced, on average, to a similar sequencing > depth, the normalization factors are much smaller for WT, resulting in > much larger normalized counts, resulting in more apparently > downregulated genes in doubleTerm vs WT. > >> cds <- newCountDataSetFromHTSeqCount(sampleTable=sampleTable, directory=directory) >> cds <- estimateSizeFactors(cds) >> cds <- estimateDispersions(cds) >> data.frame(sizefactors=sizeFactors(cds), rawcounts=colSums(counts(cds, normalized=FALSE))) > sizefactors rawcounts > S01_wt1 0.9016089 23466349 > S02_wt2 0.7679168 22428603 > S03_wt3 0.7952564 19841959 > S04_wt4 0.7839629 18363384 > S05_pten8w1 1.0301769 20859853 > S06_pten8w2 0.9949514 16809588 > S07_pten8w3 0.9425865 16731071 > S08_pten22w1 1.0826846 18906329 > S09_pten22w2 1.1640354 20164026 > S10_pten22w3 1.0111748 17306468 > S11_double8w1 0.7949001 17671986 > S12_double8w2 1.4509978 23673557 > S13_double8w3 1.1703853 22127841 > S14_doubleterm2 1.0786455 19063010 > S15_doubleterm4 1.1265935 19279814 > S16_doubleterm6 1.3059472 22750403 > > Thank you. > > Stephen > >> sessionInfo() > R version 3.0.0 (2013-04-03) > Platform: x86_64-apple-darwin10.8.0 (64-bit) > > locale: > [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8 > > attached base packages: > [1] parallel stats graphics grDevices utils datasets > methods base > > other attached packages: > [1] DESeq_1.12.0 lattice_0.20-15 locfit_1.5-9 > Biobase_2.20.0 > [5] BiocGenerics_0.6.0 edgeR_3.2.3 limma_3.16.2 > BiocInstaller_1.10.1 > > loaded via a namespace (and not attached): > [1] annotate_1.38.0 AnnotationDbi_1.22.3 DBI_0.2-6 > DESeq2_1.0.9 > [5] genefilter_1.42.0 geneplotter_1.38.0 GenomicRanges_1.12.2 > grid_3.0.0 > [9] IRanges_1.18.0 RColorBrewer_1.0-5 RSQLite_0.11.3 > splines_3.0.0 > [13] stats4_3.0.0 survival_2.37-4 tools_3.0.0 > XML_3.95-0.2 > [17] xtable_1.7-1 > > _______________________________________________ > Bioconductor mailing list > Bioconductor at r-project.org > https://stat.ethz.ch/mailman/listinfo/bioconductor > Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor
ADD COMMENT
0
Entering edit mode
hi Stephen, Ryan description of what can cause differences here is right on. It might be worthwhile to look at boxplots of the log counts as well: counts <- counts(cds) boxplot(log10(counts[rowSums(counts) > 0,] + 1)) The estimateSizeFactors method is pretty simple to explain through code, it calls estimateSizeFactorsForMatrix which looks like: estimateSizeFactorsForMatrix <- function (counts, locfunc = median) { loggeomeans <- rowMeans(log(counts)) apply(counts, 2, function(cnts) exp(locfunc((log(cnts) - loggeomeans)[is.finite(loggeomeans)]))) } So the size factor is the median ratio of a sample over a pseudosample, which is the geometric mean of all the samples. You can examine these ratios on the log scale with an MA plot, as Ryan pointed out: loggeomeans <- rowMeans(log(counts)) sf <- estimateSizeFactorsForMatrix(counts) smoothScatter(.5*(log(counts[,1]) + loggeomeans), log(counts[,1]) - loggeomeans) abline(h=log(sf[1]),col="red") Mike On Sat, May 11, 2013 at 2:39 AM, Ryan C. Thompson <rct@thompsonclan.org>wrote: > Roughly speaking, the size factors are calculated such that the MA plot > between any two samples will be centered vertically on zero logFC. If you > have just a few genes with high abundance in just a few samples, then the > counts for all other genes in those samples will be depressed because the > high-abundance genes will be taking a larger fraction of the more- or-less > fixed number of total reads in the sample. Hence, to normalize the counts > in these samples to others, they would have to be scaled up. > > One example where this could happen is if you sequenced healthy cells and > virus-infected cells, where a majority of mRNA might be viral. > > > On Fri 10 May 2013 09:38:30 AM PDT, Stephen Turner wrote: > >> Simon et al., >> >> I'm sure this issue has come up before, but I couldn't find an >> appropriate thread or answer either here or SEQanswers. >> >> What feature of the data or the distribution of counts among my >> samples can cause the sizeFactors to vary much more than the raw >> counts / library sizes? >> >> More detail: I'm using DESeq to analyze RNA-seq data mapped with STAR, >> counted with htseq-count. Comparing the "doubleTerm" samples to the >> "wt" samples, there are many genes that appear downregulated. While >> these samples were sequenced, on average, to a similar sequencing >> depth, the normalization factors are much smaller for WT, resulting in >> much larger normalized counts, resulting in more apparently >> downregulated genes in doubleTerm vs WT. >> >> cds <- newCountDataSetFromHTSeqCount(**sampleTable=sampleTable, >>> directory=directory) >>> cds <- estimateSizeFactors(cds) >>> cds <- estimateDispersions(cds) >>> data.frame(sizefactors=**sizeFactors(cds), >>> rawcounts=colSums(counts(cds, normalized=FALSE))) >>> >> sizefactors rawcounts >> S01_wt1 0.9016089 23466349 >> S02_wt2 0.7679168 22428603 >> S03_wt3 0.7952564 19841959 >> S04_wt4 0.7839629 18363384 >> S05_pten8w1 1.0301769 20859853 >> S06_pten8w2 0.9949514 16809588 >> S07_pten8w3 0.9425865 16731071 >> S08_pten22w1 1.0826846 18906329 >> S09_pten22w2 1.1640354 20164026 >> S10_pten22w3 1.0111748 17306468 >> S11_double8w1 0.7949001 17671986 >> S12_double8w2 1.4509978 23673557 >> S13_double8w3 1.1703853 22127841 >> S14_doubleterm2 1.0786455 19063010 >> S15_doubleterm4 1.1265935 19279814 >> S16_doubleterm6 1.3059472 22750403 >> >> Thank you. >> >> Stephen >> >> sessionInfo() >>> >> R version 3.0.0 (2013-04-03) >> Platform: x86_64-apple-darwin10.8.0 (64-bit) >> >> locale: >> [1] en_US.UTF-8/en_US.UTF-8/en_US.**UTF-8/C/en_US.UTF-8/en_US.UTF-**8 >> >> attached base packages: >> [1] parallel stats graphics grDevices utils datasets >> methods base >> >> other attached packages: >> [1] DESeq_1.12.0 lattice_0.20-15 locfit_1.5-9 >> Biobase_2.20.0 >> [5] BiocGenerics_0.6.0 edgeR_3.2.3 limma_3.16.2 >> BiocInstaller_1.10.1 >> >> loaded via a namespace (and not attached): >> [1] annotate_1.38.0 AnnotationDbi_1.22.3 DBI_0.2-6 >> DESeq2_1.0.9 >> [5] genefilter_1.42.0 geneplotter_1.38.0 GenomicRanges_1.12.2 >> grid_3.0.0 >> [9] IRanges_1.18.0 RColorBrewer_1.0-5 RSQLite_0.11.3 >> splines_3.0.0 >> [13] stats4_3.0.0 survival_2.37-4 tools_3.0.0 >> XML_3.95-0.2 >> [17] xtable_1.7-1 >> >> ______________________________**_________________ >> Bioconductor mailing list >> Bioconductor@r-project.org >> https://stat.ethz.ch/mailman/**listinfo/bioconductor<https: stat.e="" thz.ch="" mailman="" listinfo="" bioconductor=""> >> Search the archives: http://news.gmane.org/gmane.** >> science.biology.informatics.**conductor<http: news.gmane.org="" gmane="" .science.biology.informatics.conductor=""> >> > > ______________________________**_________________ > Bioconductor mailing list > Bioconductor@r-project.org > https://stat.ethz.ch/mailman/**listinfo/bioconductor<https: stat.et="" hz.ch="" mailman="" listinfo="" bioconductor=""> > Search the archives: http://news.gmane.org/gmane.** > science.biology.informatics.**conductor<http: news.gmane.org="" gmane.="" science.biology.informatics.conductor=""> > [[alternative HTML version deleted]]
ADD REPLY

Login before adding your answer.

Traffic: 813 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6