estimateDispersions in DESeq

0

Entering edit mode

Flavia Nunes ▴ 10

@flavia-nunes-5354

Last seen 9.6 years ago

Dear List, I am trying to use DESeq to analyse a dataset where we have samples of 3 healthy and 3 diseased microbial communities, and we are trying to establish which OTUs are significantly more or less abundant in the healthy vs diseased samples. I tried running the new version of DESeq (1.8.3) on both a Mac and a PC running the latest version of R (2.15). Both versions give a strange result, where all OTUs have a padj value that is >0.7. I found this to be strange, because when looking at the raw count data, it is obvious that some OTUs are abundant in the one treatment (say, high counts in all of the heathy samples) and absent in the other (0 or close to 0 on all of the diseased samples). I asked a colleague to help me with the analysis, and he ran the analysis on an older version of DESeq (1.4), using the estimateVarianceFunction command instead of estimateDispersions. We saw that in the help file for estimateDispersions, that by using the sharingMode="fit-only", fitType="local" options, we should be able to get the same result as the estimateVarianceFunction. However, this is not the case. DESeq 1.4 was able to find 54 OTUs that were significantly different from healthy vs diseased samples, while DESeq 1.8.3 found that none of the OTUs were significantly different in healthy vs diseased. In a second attempt, we used the option method="per-condition" and this worked - I got the same 54 significant p-values as in the analysis with DESeq 1.4 But when I continued the analysis on other datasets (we have a number of different conditions), I again started to get odd p-values, such as 1.00 for every OTU. I changed the setting for the estimateDispersions command, trying different methods, and each time I would get a different set of p-values, but usually very high numbers, close to 1. It seems to me that the results are really sensitive to the method used to estimate dispersions, and I was wondering what are the properties of the data that I might have to look for in order to select the best method. Another unusual thing that I have noticed is that when I plot the Dispersion Estimates, the fit line deviates from the points towards the right side of the graph. This suggests to me that there must be something wrong with the fit estimate, but I do not know how I might be able to change the settings to get a better fit. I wanted to know if anyone on the list has come across a similar problem? I am using the commands below in DESeq. I can provide files of the data, as well as the results that I am receiving to anyone that might be interested in taking a closer look. WBDCountTable <- read.table( file.choose(), header=TRUE, row.names=1 ) WBDDesign <- data.frame(row.names = colnames( WBDCountTable ), condition = c( "D1", "D2", "D3", "H1", "H2", "H3"), libType = c( "single-end", "single-end", "single-end", "single-end", "single-end", "single-end" ) ) conds <- factor( c( "D", "D", "D", "H", "H", "H" ) ) cds <- newCountDataSet( WBDCountTable, conds ) cds <- estimateSizeFactors( cds ) cds <- estimateDispersions( cds, method="per-condition", fitType="local" ) plotDispEsts <- function( cds ) { plot( rowMeans( counts( cds ) ), fitInfo(cds)$perGeneDispEsts, pch = 16, cex=1, log="xy" ) xg <- 10^seq( -.5, 5, length.out=300 ) lines( xg, fitInfo(cds)$dispFun( xg ), col="red" , lwd=3) } res <- nbinomTest( cds, "D", "H" ) plotDE <- function( res ) plot(res$baseMean, res$log2FoldChange, log="x", pch=20, cex=.3, col = ifelse( res$padj < .1, "red", "black" ) ) plotDE( res ) res -- Flávia Nunes EMBO Postdoctoral Fellow Laboratory of Artificial & Natural Evolution Department of Genetics & Evolution University of Geneva Sciences III, 30 quai Ernest Ansermet 1211 Geneva 4 Switzerland <flavia.nunes@unige.ch> [[alternative HTML version deleted]]

Genetics graph DESeq Genetics graph DESeq • 2.0k views

ADD COMMENT • link updated 11.8 years ago by Wolfgang Huber ★ 13k • written 11.8 years ago by Flavia Nunes ▴ 10

0

Entering edit mode

Wolfgang Huber ★ 13k

@wolfgang-huber-3550

Last seen 10 days ago

EMBL European Molecular Biology Laborat…

Dear Flavia Thank you for your detailed and informative feedback! Simon will be better able to address the question regarding what exactly changed between the versions with respect to the different parameter choices (sharingMode, fitType, method). Two other comments: 1. The fact that you see such a strong dependence on these choices might be a symptom of there being outliers (either whole samples being outliers, or certain measurements in some samples). Outlier detection is generally difficult to automate, yet outliers can have an excessive impact on inference, especially with such small sample sizes. Did you have a look at the 'pairs' plot between all 6 communities (on a log or log-like scale)? 2. The dispersion-mean model was developed on RNA-Seq data. I have no experience how well it fits to OTU counts in metagenomics. Your observation on the plot may indicate a problem. If you don't mind, I'd be interested in having a look at your data (you can anonymize the OTUs) to see how well the dispersion-mean model used by DESeq fits that. Best wishes Wolfgang -> More below. -> Jun/21/12 2:50 PM, Flavia Nunes scripsit:: > Dear List, > > I am trying to use DESeq to analyse a dataset where we have samples of 3 > healthy and 3 diseased microbial communities, and we are trying to > establish which OTUs are significantly more or less abundant in the healthy > vs diseased samples. > > I tried running the new version of DESeq (1.8.3) on both a Mac and a PC > running the latest version of R (2.15). Both versions give a strange > result, where all OTUs have a padj value that is >0.7. I found this to be > strange, because when looking at the raw count data, it is obvious that > some OTUs are abundant in the one treatment (say, high counts in all of the > heathy samples) and absent in the other (0 or close to 0 on all of the > diseased samples). Due to the variance-sharing of 'genes' with similar mean counts, this could easily happen if there are *other* OTU with similar counts, but highly discordant values within groups, forcing a high dispersion estimate, and costing power even for those OTUs that you mention. To avoid the dispersion-mean model, if you are willing to make the jump: with 3 vs 3 samples you're in a place where sharingMode = "gene-est-only" might already be useful - but I would consider the above suggestions first. > > I asked a colleague to help me with the analysis, and he ran the analysis > on an older version of DESeq (1.4), using the estimateVarianceFunction > command instead of estimateDispersions. We saw that in the help file for > estimateDispersions, that by using the sharingMode="fit-only", > fitType="local" options, we should be able to get the same result as the > estimateVarianceFunction. However, this is not the case. DESeq 1.4 was > able to find 54 OTUs that were significantly different from healthy vs > diseased samples, while DESeq 1.8.3 found that none of the OTUs were > significantly different in healthy vs diseased. > > In a second attempt, we used the option method="per-condition" and this > worked - I got the same 54 significant p-values as in the analysis with > DESeq 1.4 But when I continued the analysis on other datasets (we have a > number of different conditions), I again started to get odd p-values, such > as 1.00 for every OTU. I changed the setting for the estimateDispersions > command, trying different methods, and each time I would get a different > set of p-values, but usually very high numbers, close to 1. > > It seems to me that the results are really sensitive to the method used to > estimate dispersions, and I was wondering what are the properties of the > data that I might have to look for in order to select the best method. > > Another unusual thing that I have noticed is that when I plot the > Dispersion Estimates, the fit line deviates from the points towards the > right side of the graph. This suggests to me that there must be something > wrong with the fit estimate, but I do not know how I might be able to > change the settings to get a better fit. > > I wanted to know if anyone on the list has come across a similar problem? > > I am using the commands below in DESeq. I can provide files of the data, > as well as the results that I am receiving to anyone that might be > interested in taking a closer look. > > WBDCountTable <- read.table( file.choose(), header=TRUE, row.names=1 ) > WBDDesign <- data.frame(row.names = colnames( WBDCountTable ), condition = > c( "D1", "D2", "D3", "H1", "H2", "H3"), libType = c( "single-end", > "single-end", "single-end", "single-end", "single-end", "single-end" ) ) > conds <- factor( c( "D", "D", "D", "H", "H", "H" ) ) > cds <- newCountDataSet( WBDCountTable, conds ) > cds <- estimateSizeFactors( cds ) > cds <- estimateDispersions( cds, method="per-condition", fitType="local" ) > > plotDispEsts <- function( cds ) > { > plot( > rowMeans( counts( cds ) ), > fitInfo(cds)$perGeneDispEsts, > pch = 16, cex=1, log="xy" ) > xg <- 10^seq( -.5, 5, length.out=300 ) > lines( xg, fitInfo(cds)$dispFun( xg ), col="red" , lwd=3) > } > > res <- nbinomTest( cds, "D", "H" ) > > plotDE <- function( res ) > plot(res$baseMean, res$log2FoldChange, log="x", pch=20, cex=.3, col = > ifelse( res$padj < .1, "red", "black" ) ) > plotDE( res ) > > res > > > > > _______________________________________________ > Bioconductor mailing list > Bioconductor at r-project.org > https://stat.ethz.ch/mailman/listinfo/bioconductor > Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor > -- Best wishes Wolfgang Wolfgang Huber EMBL http://www.embl.de/research/units/genome_biology/huber

ADD COMMENT • link 11.8 years ago Wolfgang Huber ★ 13k

Login before adding your answer.