A metric to determine best filtration in the limma package

0

Entering edit mode

Mark Lawson ▴ 10

@mark-lawson-5484

Last seen 9.7 years ago

Hello Bioconductor Gurus! (I apologize if this goes through more than once) We are currently using limma (through the voom() function) to analyze RNA-seq data, represented as RSEM counts. We currently have 246 samples (including replicates) and our design matrix has 65 columns. My question is in regard to how much we should be filtering our data before running it through the analysis pipeline. Our current approach is to look for a CPM of greater than 2 in at least half of the samples. The code is: keep <- rowSums(cpm(dge) > 2) >= round(ncol(dge)/2) This brings down our transcript count from 73,761 to less than 20,000. While we do see groupings and batch effects we expect to see in the MDS plots, we are afraid we might be filtering too severely. So finally my question: What is a good metric for determining how well we have filtered the data? Thank you, Mark Lawson, PhD [[alternative HTML version deleted]]

limma limma • 2.2k views

ADD COMMENT • link updated 11.7 years ago by Gordon Smyth 50k • written 11.7 years ago by Mark Lawson ▴ 10

0

Entering edit mode

Steve Lianoglou ★ 13k

@steve-lianoglou-2771

Last seen 14 months ago

United States

Hi, On Thu, Sep 6, 2012 at 1:52 PM, Mark Lawson <mlawsonvt09 at="" gmail.com=""> wrote: > Hello Bioconductor Gurus! > > (I apologize if this goes through more than once) > > We are currently using limma (through the voom() function) to analyze > RNA-seq data, represented as RSEM counts. We currently have 246 samples > (including replicates) and our design matrix has 65 columns. > > My question is in regard to how much we should be filtering our data before > running it through the analysis pipeline. Our current approach is to look > for a CPM of greater than 2 in at least half of the samples. The code is: > > keep <- rowSums(cpm(dge) > 2) >= round(ncol(dge)/2) I'm guessing you are using "normal" rna-seq data (ie. it's not a tag sequencing something), so just a quick thought (apologies in advance if I am misunderstanding your setup): If you are filtering by counts per million without normalizing for approximate length of your transcript (like an R/FPKM-like measure), aren't you biasing your filter (and, therefore, data)? -steve -- Steve Lianoglou Graduate Student: Computational Systems Biology | Memorial Sloan-Kettering Cancer Center | Weill Medical College of Cornell University Contact Info: http://cbio.mskcc.org/~lianos/contact

ADD COMMENT • link 11.7 years ago Steve Lianoglou ★ 13k

0

Entering edit mode

As Gordon Smyth and others have commented on, such FPKM normalization is not necessary/useful if differential expression across treatments is the goal. Yes, this particular filter will have some bias against smaller transcripts who naturally have fewer total counts, but one major intent of the filter is not only to remove very low abundant transcripts for which significant differences will be hard to detect, but also to remove those transcripts with unnaturally high number of 0 counts (those being the elements that most strongly violate the negative binomial dispersion assumptions). This particular "cpm(dge) > X" recipe comes right out of the edgeR user guide. But the question remains, how much filtering is "enough" vs. "too much". Anecdotally, I've seen other mailing-list postings about raising the filtering until edgeR stops crashing (a rare event that seems to have been fixed in the recent dev version, at a sacrifice in parallel-cpu speed). Using voom(), I imagine something like tuning the filtering until the mean-variance profile stabilizes, but by what metric (except by eye) could you measure this? Thoughts? -Aaron On Thu, Sep 6, 2012 at 2:20 PM, Steve Lianoglou < mailinglist.honeypot@gmail.com> wrote: > Hi, > > On Thu, Sep 6, 2012 at 1:52 PM, Mark Lawson <mlawsonvt09@gmail.com> wrote: > > Hello Bioconductor Gurus! > > > > (I apologize if this goes through more than once) > > > > We are currently using limma (through the voom() function) to analyze > > RNA-seq data, represented as RSEM counts. We currently have 246 samples > > (including replicates) and our design matrix has 65 columns. > > > > My question is in regard to how much we should be filtering our data > before > > running it through the analysis pipeline. Our current approach is to look > > for a CPM of greater than 2 in at least half of the samples. The code is: > > > > keep <- rowSums(cpm(dge) > 2) >= round(ncol(dge)/2) > > I'm guessing you are using "normal" rna-seq data (ie. it's not a tag > sequencing something), so just a quick thought (apologies in advance > if I am misunderstanding your setup): > > If you are filtering by counts per million without normalizing for > approximate length of your transcript (like an R/FPKM-like measure), > aren't you biasing your filter (and, therefore, data)? > > -steve > > -- > Steve Lianoglou > Graduate Student: Computational Systems Biology > | Memorial Sloan-Kettering Cancer Center > | Weill Medical College of Cornell University > Contact Info: http://cbio.mskcc.org/~lianos/contact > > _______________________________________________ > Bioconductor mailing list > Bioconductor@r-project.org > https://stat.ethz.ch/mailman/listinfo/bioconductor > Search the archives: > http://news.gmane.org/gmane.science.biology.informatics.conductor > [[alternative HTML version deleted]]

ADD REPLY • link 11.7 years ago Aaron Mackey ▴ 170

0

Entering edit mode

Gordon Smyth 50k

@gordon-smyth

Last seen 5 hours ago

WEHI, Melbourne, Australia

Dear Mark, I think that voom() should be pretty tolerant of the amount of filtering that is done, so you can feel free to be more inclusive. Note that our recommended filtering is keep <- rowSums(cpm(dge) > k) >= X where X is the sample size of the smallest group size. Since X is usually smaller than half the number of arrays, our recommended filtering is usually more inclusive than the filter you give. You are also free to vary k, depending on your sequencing depth. The idea is to filter low counts. Best wishes Gordon -------------- original message ------------- [BioC] A metric to determine best filtration in the limma package Aaron Mackey amackey at virginia.edu Mon Sep 10 16:27:21 CEST 2012 Hello Bioconductor Gurus! (I apologize if this goes through more than once) We are currently using limma (through the voom() function) to analyze RNA-seq data, represented as RSEM counts. We currently have 246 samples (including replicates) and our design matrix has 65 columns. My question is in regard to how much we should be filtering our data before running it through the analysis pipeline. Our current approach is to look for a CPM of greater than 2 in at least half of the samples. The code is: keep <- rowSums(cpm(dge) > 2) >= round(ncol(dge)/2) This brings down our transcript count from 73,761 to less than 20,000. While we do see groupings and batch effects we expect to see in the MDS plots, we are afraid we might be filtering too severely. So finally my question: What is a good metric for determining how well we have filtered the data? Thank you, Mark Lawson, PhD ______________________________________________________________________ The information in this email is confidential and intend...{{dropped:4}}

ADD COMMENT • link 11.7 years ago Gordon Smyth 50k

0

Entering edit mode

Thanks Gordon. Would your advice about X still hold if the smallest group size is 5 out of 500 total samples (i.e. an experimental design profiling 100 treatments with 5 replicates each)? 5 out of 5000 samples? It seems to me that the available df should somehow impact your choice of X, and that for very large df, you'd wish to raise X somewhat. What I'm trying to get at is that it seems, at least to my intuition, that the desire to keep "at least X nonzero" stems from a possibility of a transcript being entirely "off" (zero counts) in all but the single, smallest group where it is found "on" (non-zero counts) -- but the models we use don't really test/allow for such "binary" switching behavior (i.e. infinite fold changes); so instead we're really just trying to weed out those transcripts that tend to have zero counts more often than others (which may not be due to expression at all, but other artifacts of the experiment, including vagaries of read mapping), that negatively impact the statistics of what we can model (non-infinite fold changes). Thus, we have a desire to come up with a metric or decision criterion for which the "rule of thumb" is a good proxy in small/modest-sized experiments but which can also be extended in larger many-factorial experiments ... i.e. something akin to getPriorN(). I imagine some tuning strategy (much like lasso/ridge) where the decrease in error associated with more aggressive filtering is offset by a loss in power, until a balance/crossing point is found; in an edgeR (not voom) scenario, this might be some aspect of the BCV distribution/loess fitt vs. some multivariate F-statistic of the experimental design? Thanks to all for brainstorming this, -Aaron On Mon, Sep 10, 2012 at 7:59 PM, Gordon K Smyth <smyth@wehi.edu.au> wrote: > Dear Mark, > > I think that voom() should be pretty tolerant of the amount of filtering > that is done, so you can feel free to be more inclusive. > > Note that our recommended filtering is > > keep <- rowSums(cpm(dge) > k) >= X > > where X is the sample size of the smallest group size. Since X is usually > smaller than half the number of arrays, our recommended filtering is > usually more inclusive than the filter you give. > > You are also free to vary k, depending on your sequencing depth. The idea > is to filter low counts. > > Best wishes > Gordon > > -------------- original message ------------- > [BioC] A metric to determine best filtration in the limma package > Aaron Mackey amackey at virginia.edu > Mon Sep 10 16:27:21 CEST 2012 > > Hello Bioconductor Gurus! > > (I apologize if this goes through more than once) > > We are currently using limma (through the voom() function) to analyze > RNA-seq data, represented as RSEM counts. We currently have 246 samples > (including replicates) and our design matrix has 65 columns. > > My question is in regard to how much we should be filtering our data > before running it through the analysis pipeline. Our current approach is to > look for a CPM of greater than 2 in at least half of the samples. The code > is: > > keep <- rowSums(cpm(dge) > 2) >= round(ncol(dge)/2) > > This brings down our transcript count from 73,761 to less than 20,000. > While we do see groupings and batch effects we expect to see in the MDS > plots, we are afraid we might be filtering too severely. > > So finally my question: What is a good metric for determining how well we > have filtered the data? > > Thank you, > Mark Lawson, PhD > > ______________________________**______________________________**____ ______ > The information in this email is confidential and inte...{{dropped:10}}

ADD REPLY • link 11.7 years ago Aaron Mackey ▴ 30

0

Entering edit mode

On Tue, Sep 11, 2012 at 10:47 AM, Aaron Mackey <mackey@hemoshear.com> wrote: > the models we use don't really test/allow for such "binary" switching > behavior (i.e. infinite fold changes) By the way, has anyone considered/tried composing a mixture model (i.e. zero-inflated negative binomial) to deal with such? I don't mean to handle the overdispersion in counts, but to explicitly model the differences between inferences of "no counts because no RNA" vs. "no counts because RNA too rare to sample, but let's estimate the abundance anyway". cf. http://www.ats.ucla.edu/stat/r/dae/zinbreg.htm [[alternative HTML version deleted]]

ADD REPLY • link 11.7 years ago Aaron Mackey ▴ 30

0

Entering edit mode

Dear Aaron, On Tue, 11 Sep 2012, Aaron Mackey wrote: > Thanks Gordon. Would your advice about X still hold if the smallest group > size is 5 out of 500 total samples (i.e. an experimental design profiling > 100 treatments with 5 replicates each)? 5 out of 5000 samples? If you want to be able to detect genes that are expressed only in the group with the smallest size, then yes. > It seems to me that the available df should somehow impact your choice > of X, and that for very large df, you'd wish to raise X somewhat. I prefer to use a filtering rule that relates your scientific questions. How many libraries would a gene have to be expressed in to be scientifically interesting to you? That should drive X, not a purely numerical rule that knows nothing about your experimental design. > What I'm trying to get at is that it seems, at least to my intuition, that > the desire to keep "at least X nonzero" stems from a possibility of a > transcript being entirely "off" (zero counts) in all but the single, > smallest group where it is found "on" (non-zero counts) Yes. > -- but the models we use don't really test/allow for such "binary" > switching behavior (i.e. infinite fold changes); so instead we're really > just trying to weed out those transcripts that tend to have zero counts > more often than others (which may not be due to expression at all, but > other artifacts of the experiment, including vagaries of read mapping), > that negatively impact the statistics of what we can model (non- infinite > fold changes). I understand the issue about wanting to remove transcripts with aberrant observations, but I think this is data dependent. You originally posted that you were worried that you were doing too much filtering. I have advised you that there is no technical reason why you need to do so much filtering. If you nevertheless want to keep the filtering just as stringent, then it's your judgement based on your understanding of your experiment! I can't comment further. Note that edgeR (and voom to a lesser extent) uses a shrinkage strategy that protects against infinite fold changes. > Thus, we have a desire to come up with a metric or decision criterion for > which the "rule of thumb" is a good proxy in small/modest-sized experiments > but which can also be extended in larger many-factorial experiments ... > i.e. something akin to getPriorN(). I imagine some tuning strategy (much > like lasso/ridge) where the decrease in error associated with more > aggressive filtering is offset by a loss in power, until a balance/crossing > point is found; in an edgeR (not voom) scenario, this might be some aspect > of the BCV distribution/loess fitt vs. some multivariate F-statistic of the > experimental design? I don't really follow your thinking here. If you have a huge experiment with too many DE genes to be conveniently interpreted, then I would suggest using the magnitude of the fold changes, and predictive (i.e., shrunk) fold changes from edgeR in particular, to prioritize. Best wishes Gordon > Thanks to all for brainstorming this, > > -Aaron > > On Mon, Sep 10, 2012 at 7:59 PM, Gordon K Smyth <smyth at="" wehi.edu.au=""> wrote: > >> Dear Mark, >> >> I think that voom() should be pretty tolerant of the amount of filtering >> that is done, so you can feel free to be more inclusive. >> >> Note that our recommended filtering is >> >> keep <- rowSums(cpm(dge) > k) >= X >> >> where X is the sample size of the smallest group size. Since X is usually >> smaller than half the number of arrays, our recommended filtering is >> usually more inclusive than the filter you give. >> >> You are also free to vary k, depending on your sequencing depth. The idea >> is to filter low counts. >> >> Best wishes >> Gordon >> >> -------------- original message ------------- >> [BioC] A metric to determine best filtration in the limma package >> Aaron Mackey amackey at virginia.edu >> Mon Sep 10 16:27:21 CEST 2012 >> >> Hello Bioconductor Gurus! >> >> (I apologize if this goes through more than once) >> >> We are currently using limma (through the voom() function) to analyze >> RNA-seq data, represented as RSEM counts. We currently have 246 samples >> (including replicates) and our design matrix has 65 columns. >> >> My question is in regard to how much we should be filtering our data >> before running it through the analysis pipeline. Our current approach is to >> look for a CPM of greater than 2 in at least half of the samples. The code >> is: >> >> keep <- rowSums(cpm(dge) > 2) >= round(ncol(dge)/2) >> >> This brings down our transcript count from 73,761 to less than 20,000. >> While we do see groupings and batch effects we expect to see in the MDS >> plots, we are afraid we might be filtering too severely. >> >> So finally my question: What is a good metric for determining how well we >> have filtered the data? >> >> Thank you, >> Mark Lawson, PhD ______________________________________________________________________ The information in this email is confidential and intend...{{dropped:4}}

ADD REPLY • link 11.6 years ago Gordon Smyth 50k

0

Entering edit mode

Dear Aaron, On Wed, 12 Sep 2012, Gordon K Smyth wrote: > You originally posted that you were worried that you were doing too much > filtering. I have advised you that there is no technical reason why you need > to do so much filtering. If you nevertheless want to keep the filtering just > as stringent, then it's your judgement based on your understanding of your > experiment! I can't comment further. Sorry, I replied to you as if you were the original poster of this thread, which you weren't. I temporarily forgot that it was the original poster who was worried about doing too much filtering, not yourself. Best wishes Gordon ______________________________________________________________________ The information in this email is confidential and intend...{{dropped:4}}

ADD REPLY • link 11.6 years ago Gordon Smyth 50k

0

Entering edit mode

Thanks again for your intuition. I agree that if I wish to optimize my sensitivity to detect genes that are "all off" in all but one experimental group, I should simply filter low counts up to the sample size of that smallest group. But for a very large experimental design with 100's of contrasts (say, a drug panel profiling experiment), I'm willing to sacrifice losing a few of those genes in return for better (more stable) experiment-wide statistics and sensitivity across all of the contrasts. To test my hypothesis, we did a simple experiment, increasing the number of X (min. # of samples with counts > threshold) from 1 (the smallest, requiring counts in at least 1 sample), 2, 3, 4, 5, 10, etc, up to more than 50% of the total sample size; at each value of X, we use the voom pipeline (edgeR calcNormFactors, voom transformation, limma lmFit+eBayes), and record the number of isoforms found DE at an FDR of 10% for each of ~50 independent contrasts. Because some contrasts (treatments) generate hundreds or thousands of DE isoforms, and others generate only a handful, we chose to monitor the rank (rather than the total count) of each filter X within each contrast as a comparable metric for the "yield" (i.e. for every contrasts, some particular value of X has the most counts, and is the "best"; and thus for each value of X we have ~50 rankings, one for each contrast). We break ties using ranks <- rank(..., ties.method="max"), and then normalize with ranks <- ranks/max(ranks). This yields the attached plot, where we also plot the median line to highlight how the distribution of ranks reaches a maximum somewhere around X=50 (in this case, about 1/5 of the sample size, and 10-times the smallest experimental grouping, which is 5). In another similar dataset, we get a peak at X=100, so I agree with Gordon's statement that such an approach will be data-dependent; in retrospect, we should probably perform such a "parameter tuning" using some variety of cross-validation so that our final results are unbiased by this search. Other comments or suggestions? Are we misinterpreting our results? Thanks again, -Aaron On Tue, Sep 11, 2012 at 9:47 PM, Gordon K Smyth <smyth at="" wehi.edu.au=""> wrote: > Dear Aaron, > > > On Tue, 11 Sep 2012, Aaron Mackey wrote: > > Thanks Gordon. Would your advice about X still hold if the smallest group >> size is 5 out of 500 total samples (i.e. an experimental design profiling >> 100 treatments with 5 replicates each)? 5 out of 5000 samples? >> > > If you want to be able to detect genes that are expressed only in the > group with the smallest size, then yes. > > > It seems to me that the available df should somehow impact your choice of >> X, and that for very large df, you'd wish to raise X somewhat. >> > > I prefer to use a filtering rule that relates your scientific questions. > How many libraries would a gene have to be expressed in to be > scientifically interesting to you? That should drive X, not a purely > numerical rule that knows nothing about your experimental design. > > > What I'm trying to get at is that it seems, at least to my intuition, that >> the desire to keep "at least X nonzero" stems from a possibility of a >> transcript being entirely "off" (zero counts) in all but the single, >> smallest group where it is found "on" (non-zero counts) >> > > Yes. > > > -- but the models we use don't really test/allow for such "binary" >> switching behavior (i.e. infinite fold changes); so instead we're really >> just trying to weed out those transcripts that tend to have zero counts >> more often than others (which may not be due to expression at all, but >> other artifacts of the experiment, including vagaries of read mapping), >> that negatively impact the statistics of what we can model (non- infinite >> fold changes). >> > > I understand the issue about wanting to remove transcripts with aberrant > observations, but I think this is data dependent. > > You originally posted that you were worried that you were doing too much > filtering. I have advised you that there is no technical reason why you > need to do so much filtering. If you nevertheless want to keep the > filtering just as stringent, then it's your judgement based on your > understanding of your experiment! I can't comment further. > > Note that edgeR (and voom to a lesser extent) uses a shrinkage strategy > that protects against infinite fold changes. > > > Thus, we have a desire to come up with a metric or decision criterion for >> which the "rule of thumb" is a good proxy in small/modest-sized >> experiments >> but which can also be extended in larger many-factorial experiments ... >> i.e. something akin to getPriorN(). I imagine some tuning strategy (much >> like lasso/ridge) where the decrease in error associated with more >> aggressive filtering is offset by a loss in power, until a >> balance/crossing >> point is found; in an edgeR (not voom) scenario, this might be some aspect >> of the BCV distribution/loess fitt vs. some multivariate F-statistic of >> the >> experimental design? >> > > I don't really follow your thinking here. If you have a huge experiment > with too many DE genes to be conveniently interpreted, then I would suggest > using the magnitude of the fold changes, and predictive (i.e., shrunk) fold > changes from edgeR in particular, to prioritize. > > Best wishes > Gordon > > > Thanks to all for brainstorming this, >> >> -Aaron >> >> On Mon, Sep 10, 2012 at 7:59 PM, Gordon K Smyth <smyth at="" wehi.edu.au=""> >> wrote: >> >> Dear Mark, >>> >>> I think that voom() should be pretty tolerant of the amount of filtering >>> that is done, so you can feel free to be more inclusive. >>> >>> Note that our recommended filtering is >>> >>> keep <- rowSums(cpm(dge) > k) >= X >>> >>> where X is the sample size of the smallest group size. Since X is >>> usually >>> smaller than half the number of arrays, our recommended filtering is >>> usually more inclusive than the filter you give. >>> >>> You are also free to vary k, depending on your sequencing depth. The >>> idea >>> is to filter low counts. >>> >>> Best wishes >>> Gordon >>> >>> -------------- original message ------------- >>> [BioC] A metric to determine best filtration in the limma package >>> Aaron Mackey amackey at virginia.edu >>> Mon Sep 10 16:27:21 CEST 2012 >>> >>> Hello Bioconductor Gurus! >>> >>> (I apologize if this goes through more than once) >>> >>> We are currently using limma (through the voom() function) to analyze >>> RNA-seq data, represented as RSEM counts. We currently have 246 samples >>> (including replicates) and our design matrix has 65 columns. >>> >>> My question is in regard to how much we should be filtering our data >>> before running it through the analysis pipeline. Our current approach is >>> to >>> look for a CPM of greater than 2 in at least half of the samples. The >>> code >>> is: >>> >>> keep <- rowSums(cpm(dge) > 2) >= round(ncol(dge)/2) >>> >>> This brings down our transcript count from 73,761 to less than 20,000. >>> While we do see groupings and batch effects we expect to see in the MDS >>> plots, we are afraid we might be filtering too severely. >>> >>> So finally my question: What is a good metric for determining how well we >>> have filtered the data? >>> >>> Thank you, >>> Mark Lawson, PhD >>> >> > ______________________________**______________________________**____ ______ > The information in this email is confidential and intend...{{dropped:4}} > > ______________________________**_________________ > Bioconductor mailing list > Bioconductor at r-project.org > https://stat.ethz.ch/mailman/**listinfo/bioconductor<https: stat.et="" hz.ch="" mailman="" listinfo="" bioconductor=""> > Search the archives: http://news.gmane.org/gmane.** > science.biology.informatics.**conductor<http: news.gmane.org="" gmane.="" science.biology.informatics.conductor=""> > -------------- next part -------------- A non-text attachment was scrubbed... Name: filterRank.png Type: image/png Size: 227977 bytes Desc: not available URL: <https: stat.ethz.ch="" pipermail="" bioconductor="" attachments="" 20120920="" 19bee242="" attachment.png="">

ADD REPLY • link 11.6 years ago Aaron Mackey ▴ 30

Login before adding your answer.