Question

how edgeR control the outliers?

0

Entering edit mode

Yuan Tian ▴ 60

@yuan-tian-5034

Last seen 9.6 years ago

Dear all, I use edgeR for differential expression analysis on a RNAseq dataset. But I found that edgeR is very sensitive to outlier samples. For example, for one gene, overall the expression pattern is similar between control group and experimental group, but there is one single sample which behaves very differently from the others, then this gene is very likely to be falsely detected as differentially expressed. So can anyone please tell me if there's any option in the algorithm that can control the outlier impact? I'm thinking to use median read count value instead of mean read count value to fit the NB distribution, and to estimate the dispersions. Just wondering if there's an option available in edgeR? Or is there any other RNAseq DE analysis package which is less sensitive to outliers? The outlier sample might be different when you look at different genes, so we can't take the whole sample out in the analysis. Yuan

RNASeq edgeR RNASeq edgeR • 3.2k views

ADD COMMENT • link updated 12.2 years ago by Gordon Smyth 50k • written 12.2 years ago by Yuan Tian ▴ 60

score 0 · Answer 1 · 2012-01-27

Dear Yuan On 2012-01-27 06:19, Yuan Tian wrote: > I use edgeR for differential expression analysis on a RNAseq dataset. > But I found that edgeR is very sensitive to outlier samples. For > example, for one gene, overall the expression pattern is similar > between control group and experimental group, but there is one single > sample which behaves very differently from the others, then this gene > is very likely to be falsely detected as differentially expressed. So [...] > Or is there any other RNAseq DE analysis package which is less > sensitive to outliers? You may want to give our DESeq package a try, as we made some major changes in the previous release to address precisely this issue. Simon

score 0 · Answer 2 · 2012-01-27

hi, On Thu, 2012-01-26 at 21:19 -0800, Yuan Tian wrote: > Dear all, > > I use edgeR for differential expression analysis on a RNAseq dataset. But I found > that edgeR is very sensitive to outlier samples. For example, for one gene, overall > the expression pattern is similar between control group and experimental group, but > there is one single sample which behaves very differently from the others, then this > gene is very likely to be falsely detected as differentially expressed. So can anyone > please tell me if there's any option in the algorithm that can control the outlier impact? > > I'm thinking to use median read count value instead of mean read count value to fit the > NB distribution, and to estimate the dispersions. Just wondering if there's an option > available in edgeR? Or is there any other RNAseq DE analysis package which is less > sensitive to outliers? i think what you're referring to is illustrated in figure 2 of the vignette of the tweeDEseq package whose underlying statistical model can address this kind of situation. > The outlier sample might be different when you look at different genes, so we can't take > the whole sample out in the analysis. there might be a number of reasons by which "outlier" count values show up but a sensible one is just biological variability (Hansen et al. Nat. Biotech., 29:572-573, 2011, doi:10.1038/nbt.1910), thus not only you cannot take the sample out, but that count value in that sample might be true biology. if your experimental conditions convey lots of biological variability you may need to work with more biological replicates. cheers, robert.

score 0 · Answer 3 · 2012-01-28

Dear Yuan, The edgeR empirical-Bayes algorithm is actually somewhat resistant to outliers, because it allows for gene-specific variability, unlike algorithms than treat the variance as a function of the mean. However the edgeR algorithm is designed to deal more with routine gene-specific biological variation than with true outliers. We prefer to detect and remove true outliers in the data checking steps rather than accommodate them as part of the dispersion estimation algorithm. Let me say that the second paragraph of your email is hard to understand, because cannot give either median or mean counts to edgeR. You must give actual read counts. I wonder if your problems are not caused by inputing inappropriate data into edgeR? edgeR has a great many options, and it would certainly help in writing a response to know which ones you are using already. For the purposes of this email, I am going to assume that you are doing a valid analysis using true read counts, and that you have used either estimateTagwiseDisp() or estimateGLMTagwiseDisp() with default settings. If you do have some substantial outliers, here are some options: 1. First, filter genes before analysis as in the edgeR User's Guide case studies that deal with RNA-Seq data. Suppose that you have 4 control libraries and 4 experimental: then keep genes only if they satisfy a minimum count-per-million (cpm>1 say) in a least four samples. This eliminates genes with RNA-Seq artifacts such that they are zero except in one or two samples. 2. Plot trended and tagwise dispersion estimates against abundance to look for outliers. 3. Test for outliers using the gof() function. 4. Reduce the prior.n setting to a smaller value. If none of this solves your problems, you might try the voom() function in the limma package instead. (See the limma User's Guide.) This approach is more flexible in adapting automatically to gene-specific variability in RNA-Seq data than the edgeR algorithm, and has proved successful on some high-variability datasets. Best wishes Gordon > Date: Thu, 26 Jan 2012 21:19:55 -0800 > From: Yuan Tian <ytianidyll at="" ucla.edu=""> > To: Bioconductor mailing list <bioconductor at="" r-project.org=""> > Subject: [BioC] how edgeR control the outliers? > > Dear all, > > I use edgeR for differential expression analysis on a RNAseq dataset. > But I found that edgeR is very sensitive to outlier samples. For > example, for one gene, overall the expression pattern is similar between > control group and experimental group, but there is one single sample > which behaves very differently from the others, then this gene is very > likely to be falsely detected as differentially expressed. So can anyone > please tell me if there's any option in the algorithm that can control > the outlier impact? > > I'm thinking to use median read count value instead of mean read count > value to fit the NB distribution, and to estimate the dispersions. Just > wondering if there's an option available in edgeR? Or is there any other > RNAseq DE analysis package which is less sensitive to outliers? > > The outlier sample might be different when you look at different genes, > so we can't take the whole sample out in the analysis. > > Yuan ______________________________________________________________________ The information in this email is confidential and intend...{{dropped:4}}