[Bioc] RNAseq less sensitive than microarrays? Is it a statistical issue?

0

Entering edit mode

Thomas Girke ★ 1.7k

@thomas-girke-993

Last seen 5 weeks ago

United States

Hi Simon, Because of these complications, I am sometimes wondering whether one couldn't use for many RNA-Seq use cases coverage values (e.g. mean coverage) as raw expression measure instead of read counts. Has anyone systematically tested whether this would be a suitable approach for the downstream DEG analysis? Right now everyone believes RNA-Seq analysis requires read counting, but honestly I don't see why that is. Perhaps the benefits of this are so minor that it is not worth dealing with a different type of expression measure. Thomas On Mon, May 20, 2013 at 11:15:04PM +0000, Simon Anders wrote: > Dear Lucia and list > > On second reading, I noticed that my previous post sounded quite > aggressive, which was not my intention. Sorry. I shouldn't write e-mails > that late at night. > > Anyway: We had a lot of discussion on this list and others recently > about how to correctly obtain a count table for differential expression > analysis from aligned RNA-Seq reads. From these discussions, it has > become clear that this is a task with many more pitfalls than one might > expect at first. In microarray analysis, there is no need to do this, > and so read counting is a likely culprit when such discrepancies are > noted. This is why exact details on the procedure are so important. > > Simon > > _______________________________________________ > Bioconductor mailing list > Bioconductor at r-project.org > https://stat.ethz.ch/mailman/listinfo/bioconductor > Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor

Microarray Coverage Microarray Coverage • 1.3k views

ADD COMMENT • link 11.0 years ago Thomas Girke ★ 1.7k

0

Entering edit mode

Ryan C. Thompson ★ 7.9k

@ryan-c-thompson-5618

Last seen 8 months ago

Scripps Research, La Jolla, CA

Hi Thomas, Gordon Smyth has noted previously on this list that limma's voom method is happy to accept raw counts, CPM, FPKM, and base counts (read counts times read length, allows splitting reads across exons). My understanding is that voom doesn't depend or exploit the discrete nature of count data that is fed to it, and can handle any data for which it can properly model the mean-variance relationship (heteroskedasticity). I'm sure Gordon could elaborate on this if I've missed anything. Also, note that "mean coverage" is more or less just another way to spell FPKM. And even if you want to calculate some other expression measure such as FPKM, you still need to properly assign reads to genes or transcripts, so using an alternate expression measure doesn't get around the difficulties associated with read counting, unless I'm missing something. -Ryan Thompson On Tue May 21 08:49:43 2013, Thomas Girke wrote: > > Hi Simon, > > Because of these complications, I am sometimes wondering whether one > couldn't use for many RNA-Seq use cases coverage values (e.g. mean > coverage) as raw expression measure instead of read counts. Has anyone > systematically tested whether this would be a suitable approach for the > downstream DEG analysis? Right now everyone believes RNA-Seq analysis > requires read counting, but honestly I don't see why that is. Perhaps > the benefits of this are so minor that it is not worth dealing with a > different type of expression measure. > > Thomas > > On Mon, May 20, 2013 at 11:15:04PM +0000, Simon Anders wrote: >> >> Dear Lucia and list >> >> On second reading, I noticed that my previous post sounded quite >> aggressive, which was not my intention. Sorry. I shouldn't write e-mails >> that late at night. >> >> Anyway: We had a lot of discussion on this list and others recently >> about how to correctly obtain a count table for differential expression >> analysis from aligned RNA-Seq reads. From these discussions, it has >> become clear that this is a task with many more pitfalls than one might >> expect at first. In microarray analysis, there is no need to do this, >> and so read counting is a likely culprit when such discrepancies are >> noted. This is why exact details on the procedure are so important. >> >> Simon >> >> _______________________________________________ >> Bioconductor mailing list >> Bioconductor at r-project.org >> https://stat.ethz.ch/mailman/listinfo/bioconductor >> Search the archives: >> http://news.gmane.org/gmane.science.biology.informatics.conductor > > > _______________________________________________ > Bioconductor mailing list > Bioconductor at r-project.org > https://stat.ethz.ch/mailman/listinfo/bioconductor > Search the archives: > http://news.gmane.org/gmane.science.biology.informatics.conductor

ADD COMMENT • link 11.0 years ago Ryan C. Thompson ★ 7.9k

0

Entering edit mode

Hi Ryan, It is true that voom doesn't depend on the discrete nature of count data, but it still needs to know the relative magnitude of the counts for different observations, otherwise there is no way to estimate the mean-variance relationship. In my opinion, any high efficiency statistical method for RNA-seq data needs to accommodate the fact that larger counts are relatively more precise than smaller counts. This is done by estimating the mean- variance relationship for the counts in some way. See the voom preprint for some discussion: http://www.statsci.org/smyth/pubs/VoomPreprint.pdf Voom doesn't need actual counts, but it still needs a quantity that preserves the ordering of the counts. So you could indeed count a paired-end fragment as 1/2 if one end maps and other doesn't, or split reads across exons, and input the fractional counts to voom. (I'm not recommended this as routine practice, just saying it would be statistically feasible.) voom() can work with CPM or FPKM, but it needs to compute these quantities internally. It can't accept FPKM as the primary input because FPKM does not preserve the ordering of the counts. In my opinion, no general purpose high efficiency statistical analysis of RNA-seq data is possible using FPKM as primary input. (Unless of course one also provides the library sizes and gene lengths from which the FPKM was computed, so that the software can map back to count size from the FPKM.) If the sequencing depth is the same for all libraries, then the CPM are sufficient for statistical modelling, because in that case the CPM map back to count size through the library size. In that case one could simply compute log-CPM and input it into limma using eBayes with trend=TRUE, and all would be fine. That would be very similar to voom. Best wishes Gordon On Tue, 21 May 2013, Ryan C. Thompson wrote: > Gordon Smyth has noted previously on this list that limma's voom method > is happy to accept raw counts, CPM, FPKM, and base counts (read counts > times read length, allows splitting reads across exons). My > understanding is that voom doesn't depend or exploit the discrete nature > of count data that is fed to it, and can handle any data for which it > can properly model the mean-variance relationship (heteroskedasticity). > I'm sure Gordon could elaborate on this if I've missed anything. ______________________________________________________________________ The information in this email is confidential and intend...{{dropped:4}}

ADD REPLY • link 11.0 years ago Gordon Smyth 50k

0

Entering edit mode

Thomas Girke ★ 1.7k

@thomas-girke-993

Last seen 5 weeks ago

United States

Ryan and Wolfgang, Agreed, it will not be suitable for certain RNA-Seq applications, such as splice variant discovery, but it could be a good approximation for problems related to read double counting across multiple range features. In general I am just raising this question to understand whether there is any fundamental reason not to consider coverage values (whether total sum or averaged) instead of read counts that I have missed. Using limma's voom function for this situation makes sense, but there must be someone who has performed some testing on this already and perhaps can report some results, or perhaps can share a reference of a publication addressing this? Thomas On Tue, May 21, 2013 at 06:07:38PM +0000, Wolfgang Huber wrote: > Dear Thomas > > you raise a good point. Working on the actual counts and modelling the discreteness of the data matters a lot when the number of samples is small, and when there are genes with small counts: e.g. in an experiment on a cell line or model organism 'treated vs untreated'. For large studies, where dozens or hundreds of samples are compared between balanced groups, it seems to matter less, and the good results of VST/voom + limma in such benchmarks support that view. > > However, it is not clear that the latter is really everything that people will want from RNA-Seq data. One may also want to detect what small groups of samples do among the big set; or what smaller-than genes features (e.g. exons, like in DEXSeq) do, where when one would like the explicit count modelling back. What do you think? > > PS - whether some sort of average coverage per gene would really be less confusing for users to compute than total coverage I am not so sure; there'll just be different confusions. > > Best wishes > Wolfgang > > > > > > > On 21 May 2013, at 17:49, Thomas Girke <thomas.girke at="" ucr.edu=""> wrote: > > > Hi Simon, > > > > Because of these complications, I am sometimes wondering whether one > > couldn't use for many RNA-Seq use cases coverage values (e.g. mean > > coverage) as raw expression measure instead of read counts. Has anyone > > systematically tested whether this would be a suitable approach for the > > downstream DEG analysis? Right now everyone believes RNA-Seq analysis > > requires read counting, but honestly I don't see why that is. Perhaps > > the benefits of this are so minor that it is not worth dealing with a > > different type of expression measure. > > > > Thomas > > > > On Mon, May 20, 2013 at 11:15:04PM +0000, Simon Anders wrote: > >> Dear Lucia and list > >> > >> On second reading, I noticed that my previous post sounded quite > >> aggressive, which was not my intention. Sorry. I shouldn't write e-mails > >> that late at night. > >> > >> Anyway: We had a lot of discussion on this list and others recently > >> about how to correctly obtain a count table for differential expression > >> analysis from aligned RNA-Seq reads. From these discussions, it has > >> become clear that this is a task with many more pitfalls than one might > >> expect at first. In microarray analysis, there is no need to do this, > >> and so read counting is a likely culprit when such discrepancies are > >> noted. This is why exact details on the procedure are so important. > >> > >> Simon > >> > >> _______________________________________________ > >> Bioconductor mailing list > >> Bioconductor at r-project.org > >> https://stat.ethz.ch/mailman/listinfo/bioconductor > >> Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor > > > > _______________________________________________ > > Bioconductor mailing list > > Bioconductor at r-project.org > > https://stat.ethz.ch/mailman/listinfo/bioconductor > > Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor >

ADD COMMENT • link 11.0 years ago Thomas Girke ★ 1.7k

0

Entering edit mode

Thomas Girke ★ 1.7k

@thomas-girke-993

Last seen 5 weeks ago

United States

Hi Simon, I totally agree with your statements on the importance of generating appropriate read counting tables. In my email I should have used as an example "sum of coverage" per feature rather than mean coverage since I don't want to direct the discussion into an area that has been discussed many times. Is it really so unreasonable to use this type of discrete raw expression values (sum of cov/feature) instead of read counts, and if so why? Thomas On Tue, May 21, 2013 at 07:01:39PM +0000, Simon Anders wrote: > Hi Thomas > > On 21/05/13 17:49, Thomas Girke wrote: > > Because of these complications, I am sometimes wondering whether one > > couldn't use for many RNA-Seq use cases coverage values (e.g. mean > > coverage) as raw expression measure instead of read counts. Has anyone > > systematically tested whether this would be a suitable approach for the > > downstream DEG analysis? Right now everyone believes RNA-Seq analysis > > requires read counting, but honestly I don't see why that is. Perhaps > > the benefits of this are so minor that it is not worth dealing with a > > different type of expression measure. > > The "complications" I had in mind apply to mean coverage as well as to > reads. > > Actually, at least in my personal opinion, it is quite clear which rules > one should stick to when obtaining the count table, and therefore, > obtaining a correct count table is not at all complicated if one thinks > carefully enough about it. > > The main issue is this: Imagine, a number of reads can be mapped to two > distinct genes, either because the genes overlap, or, more commonly, > because the genes share repetitive sequence. If you count your reads for > both genes then both genes will appear differentially expressed even if > only one gene actually is. Hence, one must discard reads that cannot be > unambiguously assigned to one gene. Of course, genes which lose reads in > this manner will have to low expression estimates, i.e., you incur a > negative bias. However, this bias cancels out when comparing the same > gene across samples, i.e., it is not a reason for concern in > differential expression testing. > > Consequently, a method which aims at getting _unbiased_ point estimates > of expression strength is typically unsuitable as input for strength for > differential expression testing. The point that counting for expression > estimation and for DE testing are different tasks is subtle and often > overlooked. Mistakes arising from this can cause strange effects. > > A particularly common mistake is to map the reads to transcripts, not > genes, or to count overlaps not with annotated genes but with annotated > transcripts. Of course, most reads will map to several transcripts > because most transcript isoforms of a given gene overlap heavily. If one > counts reads mapping to multiple transcripts for each of these, one gets > severe artifacts in downstream analysis, and if one discards multiply > mapping reads one loses most of the genes. > > I do not know what the original poster meant when she said that she > counted for UCSC transcripts (rather than genes) but if she took care to > avoid the common pitfalls just described, she must have employed a quite > sophisticated construction. This is why my first question was how she > obtained the counts. > > > Right now everyone believes RNA-Seq analysis > > requires read counting, but honestly I don't see why that is. > > Just to clarify this: I don't think that this is universally believed. > It is only that a method is typically designed for a specific type of > data, and several commonly used Bioconductor packages for RNA-Seq > analysis, namely edgeR, DESeq, BaySeq, DSS, expect count data, because > the specific statistical methods used in these packages are build on the > assumption that they get read counts. Obviously, they will not give > correct results if they are given any other kind of data. This is stated > quite clearly in the vignettes, but nevertheless people keep asking > whether they can supply arbitrary other kind of data such as rounded > coverage values. And this insistence in using methods in a manner > clearly violating instructions is, quite frankly, a bit frustrating. So > when you observed that people like me seem to be quite insistent on the > importance of using counts this may have been with respect to these very > common question, which are of course specific to certain methods. > > Simon

ADD COMMENT • link 11.0 years ago Thomas Girke ★ 1.7k

0

Entering edit mode

Hi Thomas > Is it really so unreasonable to use this type of discrete > raw expression values (sum of cov/feature) instead of read counts, and > if so why? First, the reverse question: What would be the advantage of using the coverage per feature over reads? Once you have decided which reads to use and where to map them, there is no real difference in difficulty between (a) counting how many reads overlap with a given feature and (b) adding up the numbers of bases of the feature that are overlapped by each read. Of course, some people might find (b) easier to do than (a) because they happen to have a script for (b) lying around and not for (a), but it could be as well the other way round, because writing a script is no more difficult for (a) than (b). (And actually, neither is trivial: The detecting and resolution of ambiguities is not as easy as it sounds, especially if features overlap or if paired-end reads are involved.) BTW: I assume (b) is what you mean by coverage. If not, correct me. The value (b) may sound slightly nicer as it counts reads only fractionally if they overlap the feature only partially. I am not sure whether this is really an advantage, though: Conceptually, a read either stems from a given gene or it does not. It cannot be that only a part of the read derives from a gene, and the other part form some other gene. The advantage of (a) is that it counts "units of evidence". Specifically, we know that the variance of a read count is at least as large as the expected value of the count. This is because, conditioned on the feature's actual concentration in the sample, read counts are always Poisson distributed. Once you marginalize over the within-sample-group distribution of concentration, you get some kind of overdispersed Poisson, whose variance is strictly larger than the expectation. This gives you for free a lower bound on the variance, which is useful to improve specificity of inferential methods. If you do not count reads but something else, you do not get this automatic lower bound -- and this is the actual reason why so many methods work on read counts rather than coverage. Simon

ADD REPLY • link 11.0 years ago Simon Anders ★ 3.7k

0

Entering edit mode

Hi Simon, You probed the point very clear. This makes me think about something related but not quite sure myself. In practice, we sometimes take a fraction of a reads mapped to multiple features especially in the case of transposons. If a read mapped to m different features, we counted 1/m for a single feature. This somehow breaks your 'units of evidence' rule. If we still would like to preserve the advantage of a smaller variance, do you think it's reasonable to always normalized the counts based on unique mappers, even the counts originated from multiple mappers? Cheers, Yuan On May 21, 2013, at 3:47 PM, Simon Anders <anders at="" embl.de=""> wrote: > Hi Thomas > >> Is it really so unreasonable to use this type of discrete >> raw expression values (sum of cov/feature) instead of read counts, and >> if so why? > > First, the reverse question: What would be the advantage of using the coverage per feature over reads? Once you have decided which reads to use and where to map them, there is no real difference in difficulty between (a) counting how many reads overlap with a given feature and (b) adding up the numbers of bases of the feature that are overlapped by each read. > > Of course, some people might find (b) easier to do than (a) because they happen to have a script for (b) lying around and not for (a), but it could be as well the other way round, because writing a script is no more difficult for (a) than (b). (And actually, neither is trivial: The detecting and resolution of ambiguities is not as easy as it sounds, especially if features overlap or if paired-end reads are involved.) > > BTW: I assume (b) is what you mean by coverage. If not, correct me. > > The value (b) may sound slightly nicer as it counts reads only fractionally if they overlap the feature only partially. I am not sure whether this is really an advantage, though: Conceptually, a read either stems from a given gene or it does not. It cannot be that only a part of the read derives from a gene, and the other part form some other gene. > > The advantage of (a) is that it counts "units of evidence". Specifically, we know that the variance of a read count is at least as large as the expected value of the count. This is because, conditioned on the feature's actual concentration in the sample, read counts are always Poisson distributed. Once you marginalize over the within- sample-group distribution of concentration, you get some kind of overdispersed Poisson, whose variance is strictly larger than the expectation. > > This gives you for free a lower bound on the variance, which is useful to improve specificity of inferential methods. If you do not count reads but something else, you do not get this automatic lower bound -- and this is the actual reason why so many methods work on read counts rather than coverage. > > Simon > > _______________________________________________ > Bioconductor mailing list > Bioconductor at r-project.org > https://stat.ethz.ch/mailman/listinfo/bioconductor > Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor

ADD REPLY • link 11.0 years ago Yuan Hao ▴ 240

0

Entering edit mode

Thomas Girke ★ 1.7k

@thomas-girke-993

Last seen 5 weeks ago

United States

Thanks. Your explanation makes sense. I really had to bring this up (perhaps should have used new email thread) since it appears to be such a basic question to which I didn't have a convincing answer. Thanks Simon and others for taking the time responding to this almost "philosophical" questions: "the meaning of read counting" :). I appreciate it. Thomas On Tue, May 21, 2013 at 07:47:36PM +0000, Simon Anders wrote: > Hi Thomas > > > Is it really so unreasonable to use this type of discrete > > raw expression values (sum of cov/feature) instead of read counts, and > > if so why? > > First, the reverse question: What would be the advantage of using the > coverage per feature over reads? Once you have decided which reads to > use and where to map them, there is no real difference in difficulty > between (a) counting how many reads overlap with a given feature and (b) > adding up the numbers of bases of the feature that are overlapped by > each read. > > Of course, some people might find (b) easier to do than (a) because they > happen to have a script for (b) lying around and not for (a), but it > could be as well the other way round, because writing a script is no > more difficult for (a) than (b). (And actually, neither is trivial: The > detecting and resolution of ambiguities is not as easy as it sounds, > especially if features overlap or if paired-end reads are involved.) > > BTW: I assume (b) is what you mean by coverage. If not, correct me. > > The value (b) may sound slightly nicer as it counts reads only > fractionally if they overlap the feature only partially. I am not sure > whether this is really an advantage, though: Conceptually, a read either > stems from a given gene or it does not. It cannot be that only a part of > the read derives from a gene, and the other part form some other gene. > > The advantage of (a) is that it counts "units of evidence". > Specifically, we know that the variance of a read count is at least as > large as the expected value of the count. This is because, conditioned > on the feature's actual concentration in the sample, read counts are > always Poisson distributed. Once you marginalize over the > within-sample-group distribution of concentration, you get some kind of > overdispersed Poisson, whose variance is strictly larger than the > expectation. > > This gives you for free a lower bound on the variance, which is useful > to improve specificity of inferential methods. If you do not count reads > but something else, you do not get this automatic lower bound -- and > this is the actual reason why so many methods work on read counts rather > than coverage. > > Simon >

ADD COMMENT • link 11.0 years ago Thomas Girke ★ 1.7k

0

Entering edit mode

Wolfgang Huber ★ 13k

@wolfgang-huber-3550

Last seen 23 days ago

EMBL European Molecular Biology Laborat…

Dear Thomas you raise a good point. Working on the actual counts and modelling the discreteness of the data matters a lot when the number of samples is small, and when there are genes with small counts: e.g. in an experiment on a cell line or model organism 'treated vs untreated'. For large studies, where dozens or hundreds of samples are compared between balanced groups, it seems to matter less, and the good results of VST/voom + limma in such benchmarks support that view. However, it is not clear that the latter is really everything that people will want from RNA-Seq data. One may also want to detect what small groups of samples do among the big set; or what smaller-than genes features (e.g. exons, like in DEXSeq) do, where when one would like the explicit count modelling back. What do you think? PS - whether some sort of average coverage per gene would really be less confusing for users to compute than total coverage I am not so sure; there'll just be different confusions. Best wishes Wolfgang On 21 May 2013, at 17:49, Thomas Girke <thomas.girke at="" ucr.edu=""> wrote: > Hi Simon, > > Because of these complications, I am sometimes wondering whether one > couldn't use for many RNA-Seq use cases coverage values (e.g. mean > coverage) as raw expression measure instead of read counts. Has anyone > systematically tested whether this would be a suitable approach for the > downstream DEG analysis? Right now everyone believes RNA-Seq analysis > requires read counting, but honestly I don't see why that is. Perhaps > the benefits of this are so minor that it is not worth dealing with a > different type of expression measure. > > Thomas > > On Mon, May 20, 2013 at 11:15:04PM +0000, Simon Anders wrote: >> Dear Lucia and list >> >> On second reading, I noticed that my previous post sounded quite >> aggressive, which was not my intention. Sorry. I shouldn't write e-mails >> that late at night. >> >> Anyway: We had a lot of discussion on this list and others recently >> about how to correctly obtain a count table for differential expression >> analysis from aligned RNA-Seq reads. From these discussions, it has >> become clear that this is a task with many more pitfalls than one might >> expect at first. In microarray analysis, there is no need to do this, >> and so read counting is a likely culprit when such discrepancies are >> noted. This is why exact details on the procedure are so important. >> >> Simon >> >> _______________________________________________ >> Bioconductor mailing list >> Bioconductor at r-project.org >> https://stat.ethz.ch/mailman/listinfo/bioconductor >> Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor > > _______________________________________________ > Bioconductor mailing list > Bioconductor at r-project.org > https://stat.ethz.ch/mailman/listinfo/bioconductor > Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor

ADD COMMENT • link 11.0 years ago Wolfgang Huber ★ 13k

0

Entering edit mode

Simon Anders ★ 3.7k

@simon-anders-3855

Last seen 3.8 years ago

Zentrum für Molekularbiologie, Universi…

Hi Thomas On 21/05/13 17:49, Thomas Girke wrote: > Because of these complications, I am sometimes wondering whether one > couldn't use for many RNA-Seq use cases coverage values (e.g. mean > coverage) as raw expression measure instead of read counts. Has anyone > systematically tested whether this would be a suitable approach for the > downstream DEG analysis? Right now everyone believes RNA-Seq analysis > requires read counting, but honestly I don't see why that is. Perhaps > the benefits of this are so minor that it is not worth dealing with a > different type of expression measure. The "complications" I had in mind apply to mean coverage as well as to reads. Actually, at least in my personal opinion, it is quite clear which rules one should stick to when obtaining the count table, and therefore, obtaining a correct count table is not at all complicated if one thinks carefully enough about it. The main issue is this: Imagine, a number of reads can be mapped to two distinct genes, either because the genes overlap, or, more commonly, because the genes share repetitive sequence. If you count your reads for both genes then both genes will appear differentially expressed even if only one gene actually is. Hence, one must discard reads that cannot be unambiguously assigned to one gene. Of course, genes which lose reads in this manner will have to low expression estimates, i.e., you incur a negative bias. However, this bias cancels out when comparing the same gene across samples, i.e., it is not a reason for concern in differential expression testing. Consequently, a method which aims at getting _unbiased_ point estimates of expression strength is typically unsuitable as input for strength for differential expression testing. The point that counting for expression estimation and for DE testing are different tasks is subtle and often overlooked. Mistakes arising from this can cause strange effects. A particularly common mistake is to map the reads to transcripts, not genes, or to count overlaps not with annotated genes but with annotated transcripts. Of course, most reads will map to several transcripts because most transcript isoforms of a given gene overlap heavily. If one counts reads mapping to multiple transcripts for each of these, one gets severe artifacts in downstream analysis, and if one discards multiply mapping reads one loses most of the genes. I do not know what the original poster meant when she said that she counted for UCSC transcripts (rather than genes) but if she took care to avoid the common pitfalls just described, she must have employed a quite sophisticated construction. This is why my first question was how she obtained the counts. > Right now everyone believes RNA-Seq analysis > requires read counting, but honestly I don't see why that is. Just to clarify this: I don't think that this is universally believed. It is only that a method is typically designed for a specific type of data, and several commonly used Bioconductor packages for RNA-Seq analysis, namely edgeR, DESeq, BaySeq, DSS, expect count data, because the specific statistical methods used in these packages are build on the assumption that they get read counts. Obviously, they will not give correct results if they are given any other kind of data. This is stated quite clearly in the vignettes, but nevertheless people keep asking whether they can supply arbitrary other kind of data such as rounded coverage values. And this insistence in using methods in a manner clearly violating instructions is, quite frankly, a bit frustrating. So when you observed that people like me seem to be quite insistent on the importance of using counts this may have been with respect to these very common question, which are of course specific to certain methods. Simon

ADD COMMENT • link 11.0 years ago Simon Anders ★ 3.7k

0

Entering edit mode

Hi Thomas Forgot to add a sentence to make clear what I'm at: On 21/05/13 21:01, Simon Anders wrote: > The "complications" I had in mind apply to mean coverage as well as to > reads. [...] > Consequently, a method which aims at getting _unbiased_ point estimates > of expression strength is typically unsuitable as input for strength for > differential expression testing. The point that counting for expression > estimation and for DE testing are different tasks is subtle and often > overlooked. Mistakes arising from this can cause strange effects. I wanted to add: The issue I described is about which reads to use and which to discard when counting (or how to resolve ambiguities about which gene or possible mapping a read should be assigned to when there are several options). And this issue arises no matter whether you want to obtain counts or mean coverages. Simon

ADD REPLY • link 11.0 years ago Simon Anders ★ 3.7k

Login before adding your answer.