Question

EdgeR GOF non-linear

0

Entering edit mode

Thomas Frederick Willems ▴ 20

@thomas-frederick-willems-5527

Last seen 10.6 years ago

Hi Gordon, I've been trying to use edgeR to analyze some mRNA-seq data. I've begun looking at a data set where the cells are either unstimulated or stimulated with one single treatment and I have three replicates for each class. When I use the GLM framework with tagwise dispersion to fit a model for the effects of the single treatment and plot the GOF, it is distinctly non-linear. When I remove one replicate from each class by looking at the replicate most distant from its counterparts in the plotMDS graph, I obtain a plot that appears more linear but still deviates from the diagonal. I've attached these gof plot as well as the results from plotBCV for the situation in which 3 replicates were used. I have a few questions regarding this plot: 1) Do the combination of gof plots suggest that there are gross batch differences between the replicates, or have you seen other RNA-seq datasets with a fit similar to the first gof plot? 2) Is the fit sufficiently poor that any conclusions drawn from the GLM coefficients are likely of little use? 3) In another dataset I'm analyzing, the line also lies below the y=x plot but is linear and has the same slope. What sort of issue might this suggest? My sequencing depths for the samples, as given by samples.libsize, are: 26693672 61963337 64385230 65093734 79193403 57877925 My code is as follows: library(edgeR) # File containing file names where sample counts can be read targets <- read.delim("/home/twillems/Desktop/IL6-paired- end/GLM/files.txt", stringsAsFactors = FALSE) treatment <- c(0,0,0,1,1,1) design <- model.matrix(~treatment) y <- readDGE(targets) colnames(y) <- y$samples[['description']] min_counts_per_million <- 0.2 min_samples <- 3 y <- y[rowSums(cpm(y) > min_counts_per_million) >= nrow(y$samples), ] y$samples$lib.size <- colSums(y$counts) y <- calcNormFactors(y) y <- estimateGLMCommonDisp(y, design) y <- estimateGLMTrendedDisp(y, design) y <- estimateGLMTagwiseDisp(y,design) plotBCV(y) plotMDS(y) fit1 <- glmFit(y, design) g <- gof(fit1, plot=TRUE) And the result of sessionInfo(): R version 2.14.1 (2011-12-22) Platform: i686-pc-linux-gnu (32-bit) locale: [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C [3] LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8 [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8 [7] LC_PAPER=en_US.UTF-8 LC_NAME=en_US.UTF-8 [9] LC_ADDRESS=en_US.UTF-8 LC_TELEPHONE=en_US.UTF-8 [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=en_US.UTF-8 attached base packages: [1] splines stats graphics grDevices utils datasets methods [8] base other attached packages: [1] edgeR_2.6.12 limma_3.10.3 rkward_0.5.6 loaded via a namespace (and not attached): [1] tools_2.14.1 Thanks for your help Thomas Willems -------------- next part -------------- A non-text attachment was scrubbed... Name: gof plot_all_3_rep.PNG Type: image/png Size: 36903 bytes Desc: gof plot_all_3_rep.PNG URL: <https: stat.ethz.ch="" pipermail="" bioconductor="" attachments="" 20121019="" 9d926986="" attachment.png=""> -------------- next part -------------- A non-text attachment was scrubbed... Name: BCV plot_all_3_rep.PNG Type: image/png Size: 31803 bytes Desc: BCV plot_all_3_rep.PNG URL: <https: stat.ethz.ch="" pipermail="" bioconductor="" attachments="" 20121019="" 9d926986="" attachment-0001.png=""> -------------- next part -------------- A non-text attachment was scrubbed... Name: gof plot_2_rep.PNG Type: image/png Size: 35610 bytes Desc: gof plot_2_rep.PNG URL: <https: stat.ethz.ch="" pipermail="" bioconductor="" attachments="" 20121019="" 9d926986="" attachment-0002.png="">

Sequencing graph edgeR Sequencing graph edgeR • 1.4k views

ADD COMMENT • link updated 12.5 years ago by Gordon Smyth 52k • written 12.5 years ago by Thomas Frederick Willems ▴ 20

score 0 · Answer 1 · 2012-10-20

Dear Thomas, Could you please upgrade to the current Bioconductor release? Your questions might not go way, but you will find some improvements and, as a general principle, I don't like to spend time discussing older software. I think you're over-interpreting the gof plots. You worry that the second gof plot looks nonlinear, but to me it's beautifully linear, or as close to linear as real data gets. The gof plots are intended to judge whether the prior.df parameter has been chosen too large or small. They are not intended to judge whether you have outlier samples -- this is better done from the MDS plot. They may help to judge whether you have outlier genes -- but this is more clearly done from the BCV plot. The most notable feature of your data appears to be that there are five outlier genes, as shown clearly by the BCV plot. To my mind, the appropriate next step would be to identify those genes, to look at the counts for these genes, to try to identify why they are outliers, and to note which samples they are outliers in. In other words, the analysis has identified features of the biology and the experimental procedure that require more thought and attention. It is these outliers that cause the slight nonlinearity on the far right on your gof plots. We have some methods under development that can automatically identify and downweight outlier genes but, with the current software, an alternative strategy might be to simply filter out the 5 outlier transcripts and repeat the analysis without them. You might also consider a smaller value for prior.df, although the default value might be fine with the outliers gone. Best wishes Gordon PS. I feel that I have already answered your third question, see: https://www.stat.math.ethz.ch/pipermail/bioconductor/2012-October/0485 04.html If you want me to comment more on this, you would need to respond to my earlier comments. --------------------------------------------- Professor Gordon K Smyth, Bioinformatics Division, Walter and Eliza Hall Institute of Medical Research, 1G Royal Parade, Parkville, Vic 3052, Australia. http://www.statsci.org/smyth On Fri, 19 Oct 2012, Thomas Frederick Willems wrote: > Hi Gordon, > > I've been trying to use edgeR to analyze some mRNA-seq data. I've begun > looking at a data set where the cells are either unstimulated or > stimulated with one single treatment and I have three replicates for > each class. When I use the GLM framework with tagwise dispersion to fit > a model for the effects of the single treatment and plot the GOF, it is > distinctly non-linear. When I remove one replicate from each class by > looking at the replicate most distant from its counterparts in the > plotMDS graph, I obtain a plot that appears more linear but still > deviates from the diagonal. > I've attached these gof plot as well as the results from plotBCV for the > situation in which 3 replicates were used. I have a few questions > regarding this plot: > 1) Do the combination of gof plots suggest that there are gross batch > differences between the replicates, or have you seen other RNA-seq > datasets with a fit similar to the first gof plot? > 2) Is the fit sufficiently poor that any conclusions drawn from the GLM > coefficients are likely of little use? > 3) In another dataset I'm analyzing, the line also lies below the y=x > plot but is linear and has the same slope. What sort of issue might this > suggest? > > > My sequencing depths for the samples, as given by samples.libsize, are: > 26693672 61963337 64385230 65093734 79193403 57877925 > > My code is as follows: > > library(edgeR) > # File containing file names where sample counts can be read > targets <- read.delim("/home/twillems/Desktop/IL6-paired- end/GLM/files.txt", stringsAsFactors = FALSE) > > treatment <- c(0,0,0,1,1,1) > design <- model.matrix(~treatment) > > y <- readDGE(targets) > colnames(y) <- y$samples[['description']] > min_counts_per_million <- 0.2 > min_samples <- 3 > y <- y[rowSums(cpm(y) > min_counts_per_million) >= nrow(y$samples), ] > y$samples$lib.size <- colSums(y$counts) > y <- calcNormFactors(y) > > y <- estimateGLMCommonDisp(y, design) > y <- estimateGLMTrendedDisp(y, design) > y <- estimateGLMTagwiseDisp(y,design) > plotBCV(y) > > plotMDS(y) > > fit1 <- glmFit(y, design) > g <- gof(fit1, plot=TRUE) > > And the result of sessionInfo(): > R version 2.14.1 (2011-12-22) > Platform: i686-pc-linux-gnu (32-bit) > > locale: > [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C > [3] LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8 > [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8 > [7] LC_PAPER=en_US.UTF-8 LC_NAME=en_US.UTF-8 > [9] LC_ADDRESS=en_US.UTF-8 LC_TELEPHONE=en_US.UTF-8 > [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=en_US.UTF-8 > > attached base packages: > [1] splines stats graphics grDevices utils datasets methods > [8] base > > other attached packages: > [1] edgeR_2.6.12 limma_3.10.3 rkward_0.5.6 > > loaded via a namespace (and not attached): > [1] tools_2.14.1 > > Thanks for your help > > Thomas Willems > ______________________________________________________________________ The information in this email is confidential and intend...{{dropped:4}}