RNASeq: normalization issues

0

Entering edit mode

João Moura ▴ 160

@joao-moura-4505

Last seen 9.6 years ago

Dear all, Until now I was doing RNAseq DE analysis and to do that I understand that normalization issues only matter inside samples, because one can assume the length/content biases will cancel out when comparing same genes in different samples. Although, I'm now trying to compare correlation of different genes and so, this biases should be taken into account - for this is there any better method than RPKM? My main doubt is if I should also take into acount the biases inside samples and to do that is there any better approach then TMM by Robinson and Oshlack [2010]? Thank you all, -- João Moura [[alternative HTML version deleted]]

RNASeq RNASeq • 1.7k views

ADD COMMENT • link updated 13.0 years ago by Wei Shi ★ 3.6k • written 13.0 years ago by João Moura ▴ 160

0

Entering edit mode

Vince S. Buffalo ▴ 140

@vince-s-buffalo-4618

Last seen 9.6 years ago

United States

Hi João, I have been curious about this issue too. Even if one is doing differential expression analysis using RNA-seq data, there is more power in a test of longer transcripts than shorter ones (from Oshlack and Wakefield, 2009). There appears to be no way to remedy this. Looking at correlation between genes across samples is difficult too because of differing transcript lengths, as you mention. Another concern in looking at correlations of different genes using RNA-seq data is the issue of multireads (reads that map do not map uniquely). Suppose there's a paralogous region in three genes (or maybe a common exon between transcripts, if one is mapping to a transcript set). An increase in the expression of any of these genes would increase the coverage across the paralogous region. Mapping programs like BWA deal with multireads by distributing them randomly to all regions of equal mapping quality. Consequently, the increase in expression of one of these three genes increases the mapped transcripts for these other two genes as well. Thus, I suspect there to be rampant artifactual correlation between genes with paralogous regions or transcripts with common exons. I haven't seen mention of this gene correlation and RNA-seq data issue yet. Running programs that try to handle multireads becomes much more important if one is looking at correlation across different genes. RSEM is such a program (http://deweylab.biostat.wisc.edu/rsem/), and there are others out there. Also, as a diagnostic, one could cluster transcript sets sequences, and then see if highly correlated transcripts are overwhelmingly the same with similar sequences. There could be biological reasons for this though too. Vince On Thu, Apr 28, 2011 at 2:36 AM, João Moura <palerma@gmail.com> wrote: > Dear all, > > > Until now I was doing RNAseq DE analysis and to do that I understand that > normalization issues only matter inside samples, because one can assume the > length/content biases will cancel out when comparing same genes in > different > samples. > Although, I'm now trying to compare correlation of different genes and so, > this biases should be taken into account - for this is there any better > method than RPKM? > > My main doubt is if I should also take into acount the biases inside > samples > and to do that is there any better approach then TMM by Robinson and > Oshlack > [2010]? > > Thank you all, > -- > João Moura > > [[alternative HTML version deleted]] > > > _______________________________________________ > Bioconductor mailing list > Bioconductor@r-project.org > https://stat.ethz.ch/mailman/listinfo/bioconductor > Search the archives: > http://news.gmane.org/gmane.science.biology.informatics.conductor > -- Vince Buffalo Statistical Programmer Bioinformatics Core UC Davis Genome Center University of California, Davis "There's real poetry in the real world. Science is the poetry of reality." -Richard Dawkins [[alternative HTML version deleted]]

ADD COMMENT • link 13.0 years ago Vince S. Buffalo ▴ 140

0

Entering edit mode

Wei Shi ★ 3.6k

@wei-shi-2183

Last seen 10 days ago

Australia/Melbourne/Olivia Newton-John …

Hi Jo?o: Maybe you can try different normalization methods for your data to see which one looks better. How to best normalize RNA-seq data is still of much debate at this stage. You can try scaling methods like TMM, RPKM, or 75th percentile, which as you said normalize data within samples. Or you can try quantile between-sample normalization (read counts should be adjusted by gene length first), which performs normalization across samples. You can try all these in edgeR package. From my experience, I actually found the quantile method performed better for my RNA-seq data. I used general linear model and likelihood ratio test in edgeR in my analysis. Hope this helps. Cheers, Wei On Apr 28, 2011, at 7:36 PM, Jo?o Moura wrote: > Dear all, > > > Until now I was doing RNAseq DE analysis and to do that I understand that > normalization issues only matter inside samples, because one can assume the > length/content biases will cancel out when comparing same genes in different > samples. > Although, I'm now trying to compare correlation of different genes and so, > this biases should be taken into account - for this is there any better > method than RPKM? > > My main doubt is if I should also take into acount the biases inside samples > and to do that is there any better approach then TMM by Robinson and Oshlack > [2010]? > > Thank you all, > -- > Jo?o Moura > > [[alternative HTML version deleted]] > > _______________________________________________ > Bioconductor mailing list > Bioconductor at r-project.org > https://stat.ethz.ch/mailman/listinfo/bioconductor > Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor ______________________________________________________________________ The information in this email is confidential and intend...{{dropped:6}}

ADD COMMENT • link 13.0 years ago Wei Shi ★ 3.6k

0

Entering edit mode

Dr. Wei, If I may I ask. What criteria do you use to find out which normalization suits better your data? thanks, Fernando ________________________________________ From: bioconductor-bounces@r-project.org [bioconductor- bounces@r-project.org] On Behalf Of Wei Shi [shi@wehi.EDU.AU] Sent: Thursday, April 28, 2011 6:07 PM To: Jo?o Moura Cc: bioconductor at r-project.org list Subject: Re: [BioC] RNASeq: normalization issues Hi Jo?o: Maybe you can try different normalization methods for your data to see which one looks better. How to best normalize RNA-seq data is still of much debate at this stage. You can try scaling methods like TMM, RPKM, or 75th percentile, which as you said normalize data within samples. Or you can try quantile between-sample normalization (read counts should be adjusted by gene length first), which performs normalization across samples. You can try all these in edgeR package. From my experience, I actually found the quantile method performed better for my RNA-seq data. I used general linear model and likelihood ratio test in edgeR in my analysis. Hope this helps. Cheers, Wei On Apr 28, 2011, at 7:36 PM, Jo?o Moura wrote: > Dear all, > > > Until now I was doing RNAseq DE analysis and to do that I understand that > normalization issues only matter inside samples, because one can assume the > length/content biases will cancel out when comparing same genes in different > samples. > Although, I'm now trying to compare correlation of different genes and so, > this biases should be taken into account - for this is there any better > method than RPKM? > > My main doubt is if I should also take into acount the biases inside samples > and to do that is there any better approach then TMM by Robinson and Oshlack > [2010]? > > Thank you all, > -- > Jo?o Moura > > [[alternative HTML version deleted]] > > _______________________________________________ > Bioconductor mailing list > Bioconductor at r-project.org > https://stat.ethz.ch/mailman/listinfo/bioconductor > Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor ______________________________________________________________________ The information in this email is confidential and intend...{{dropped:9}}

ADD REPLY • link 13.0 years ago Biase, Fernando ▴ 150

0

Entering edit mode

Hi Fernando: We had some positive control genes which we know should be up -/down-regulated in one cell type compared to the other from previous RT-PCR experiments. The quantile method successfully detected all these control genes and gave them higher ranks in the list of differentially expressed genes compared to other normalization methods. You could certainly argue that this is a biased comparison, but when you do not know which method works best, the one which gives results more closer to your expectation is often preferred. My belief in the quantile method actually mainly came from a evaluation study using the RNA-seq data from MAQC project, in which expression levels of ~1000 genes were validated by RT-PCR. What I found was that the quantile normalized data had a better correlation with the PCR data, compared to other normalization methods. This work hasn't been published yet, but I am working on that. Cheers, Wei On Apr 29, 2011, at 12:51 PM, Biase, Fernando wrote: > Dr. Wei, > > If I may I ask. What criteria do you use to find out which normalization suits better your data? > > thanks, > Fernando > > ________________________________________ > From: bioconductor-bounces at r-project.org [bioconductor-bounces at r-project.org] On Behalf Of Wei Shi [shi at wehi.EDU.AU] > Sent: Thursday, April 28, 2011 6:07 PM > To: Jo?o Moura > Cc: bioconductor at r-project.org list > Subject: Re: [BioC] RNASeq: normalization issues > > Hi Jo?o: > > Maybe you can try different normalization methods for your data to see which one looks better. How to best normalize RNA-seq data is still of much debate at this stage. > > You can try scaling methods like TMM, RPKM, or 75th percentile, which as you said normalize data within samples. Or you can try quantile between-sample normalization (read counts should be adjusted by gene length first), which performs normalization across samples. You can try all these in edgeR package. > > From my experience, I actually found the quantile method performed better for my RNA-seq data. I used general linear model and likelihood ratio test in edgeR in my analysis. > > Hope this helps. > > Cheers, > Wei > > On Apr 28, 2011, at 7:36 PM, Jo?o Moura wrote: > >> Dear all, >> >> >> Until now I was doing RNAseq DE analysis and to do that I understand that >> normalization issues only matter inside samples, because one can assume the >> length/content biases will cancel out when comparing same genes in different >> samples. >> Although, I'm now trying to compare correlation of different genes and so, >> this biases should be taken into account - for this is there any better >> method than RPKM? >> >> My main doubt is if I should also take into acount the biases inside samples >> and to do that is there any better approach then TMM by Robinson and Oshlack >> [2010]? >> >> Thank you all, >> -- >> Jo?o Moura >> >> [[alternative HTML version deleted]] >> >> _______________________________________________ >> Bioconductor mailing list >> Bioconductor at r-project.org >> https://stat.ethz.ch/mailman/listinfo/bioconductor >> Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor > > > ______________________________________________________________________ > The information in this email is confidential and inte...{{dropped:18}}

ADD REPLY • link 13.0 years ago Wei Shi ★ 3.6k

0

Entering edit mode

Hi Wei quantile normalisation is different from (and potentially better than) linear scaling methods if there is a non-linear relationship between the true abundances of a molecule and its read counts in different samples. I can imagine that this can happen, but have no idea how important the effect is, and how it interplays with QA/QC. Besides your bottomline benchmark, do you have any 'mechanistic' intuition (or examples from your MAQC analysis) for how such non-linearities look like? Best wishes Wolfgang Il Apr/29/11 5:26 AM, Wei Shi ha scritto: > Hi Fernando: > > We had some positive control genes which we know should be up -/down-regulated in one cell type compared to the other from previous RT-PCR experiments. The quantile method successfully detected all these control genes and gave them higher ranks in the list of differentially expressed genes compared to other normalization methods. You could certainly argue that this is a biased comparison, but when you do not know which method works best, the one which gives results more closer to your expectation is often preferred. > > My belief in the quantile method actually mainly came from a evaluation study using the RNA-seq data from MAQC project, in which expression levels of ~1000 genes were validated by RT-PCR. What I found was that the quantile normalized data had a better correlation with the PCR data, compared to other normalization methods. This work hasn't been published yet, but I am working on that. > > Cheers, > Wei > > > On Apr 29, 2011, at 12:51 PM, Biase, Fernando wrote: > >> Dr. Wei, >> >> If I may I ask. What criteria do you use to find out which normalization suits better your data? >> >> thanks, >> Fernando >> >> ________________________________________ >> From: bioconductor-bounces at r-project.org [bioconductor-bounces at r-project.org] On Behalf Of Wei Shi [shi at wehi.EDU.AU] >> Sent: Thursday, April 28, 2011 6:07 PM >> To: Jo?o Moura >> Cc: bioconductor at r-project.org list >> Subject: Re: [BioC] RNASeq: normalization issues >> >> Hi Jo?o: >> >> Maybe you can try different normalization methods for your data to see which one looks better. How to best normalize RNA-seq data is still of much debate at this stage. >> >> You can try scaling methods like TMM, RPKM, or 75th percentile, which as you said normalize data within samples. Or you can try quantile between-sample normalization (read counts should be adjusted by gene length first), which performs normalization across samples. You can try all these in edgeR package. >> >> From my experience, I actually found the quantile method performed better for my RNA-seq data. I used general linear model and likelihood ratio test in edgeR in my analysis. >> >> Hope this helps. >> >> Cheers, >> Wei >> >> On Apr 28, 2011, at 7:36 PM, Jo?o Moura wrote: >> >>> Dear all, >>> >>> >>> Until now I was doing RNAseq DE analysis and to do that I understand that >>> normalization issues only matter inside samples, because one can assume the >>> length/content biases will cancel out when comparing same genes in different >>> samples. >>> Although, I'm now trying to compare correlation of different genes and so, >>> this biases should be taken into account - for this is there any better >>> method than RPKM? >>> >>> My main doubt is if I should also take into acount the biases inside samples >>> and to do that is there any better approach then TMM by Robinson and Oshlack >>> [2010]? >>> >>> Thank you all, >>> -- >>> Jo?o Moura >>> >>> [[alternative HTML version deleted]] >>> >>> _______________________________________________ >>> Bioconductor mailing list >>> Bioconductor at r-project.org >>> https://stat.ethz.ch/mailman/listinfo/bioconductor >>> Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor >> >> >> ______________________________________________________________________ >> The information in this email is confidential and inte...{{dropped:18}} > > _______________________________________________ > Bioconductor mailing list > Bioconductor at r-project.org > https://stat.ethz.ch/mailman/listinfo/bioconductor > Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor -- Wolfgang Huber EMBL http://www.embl.de/research/units/genome_biology/huber

ADD REPLY • link 13.0 years ago Wolfgang Huber ★ 13k

0

Entering edit mode

Hi Wolfgang: I do not have any examples to show whether there are non- linearities or not in the data. But it will be very interesting to look at this if there is a good way to do this. Cheers, Wei On Apr 30, 2011, at 12:24 AM, Wolfgang Huber wrote: > Hi Wei > > quantile normalisation is different from (and potentially better than) linear scaling methods if there is a non-linear relationship between the true abundances of a molecule and its read counts in different samples. I can imagine that this can happen, but have no idea how important the effect is, and how it interplays with QA/QC. Besides your bottomline benchmark, do you have any 'mechanistic' intuition (or examples from your MAQC analysis) for how such non- linearities look like? > > Best wishes > Wolfgang > > > > > Il Apr/29/11 5:26 AM, Wei Shi ha scritto: >> Hi Fernando: >> >> We had some positive control genes which we know should be up -/down-regulated in one cell type compared to the other from previous RT-PCR experiments. The quantile method successfully detected all these control genes and gave them higher ranks in the list of differentially expressed genes compared to other normalization methods. You could certainly argue that this is a biased comparison, but when you do not know which method works best, the one which gives results more closer to your expectation is often preferred. >> >> My belief in the quantile method actually mainly came from a evaluation study using the RNA-seq data from MAQC project, in which expression levels of ~1000 genes were validated by RT-PCR. What I found was that the quantile normalized data had a better correlation with the PCR data, compared to other normalization methods. This work hasn't been published yet, but I am working on that. >> >> Cheers, >> Wei >> >> >> On Apr 29, 2011, at 12:51 PM, Biase, Fernando wrote: >> >>> Dr. Wei, >>> >>> If I may I ask. What criteria do you use to find out which normalization suits better your data? >>> >>> thanks, >>> Fernando >>> >>> ________________________________________ >>> From: bioconductor-bounces at r-project.org [bioconductor-bounces at r-project.org] On Behalf Of Wei Shi [shi at wehi.EDU.AU] >>> Sent: Thursday, April 28, 2011 6:07 PM >>> To: Jo?o Moura >>> Cc: bioconductor at r-project.org list >>> Subject: Re: [BioC] RNASeq: normalization issues >>> >>> Hi Jo?o: >>> >>> Maybe you can try different normalization methods for your data to see which one looks better. How to best normalize RNA-seq data is still of much debate at this stage. >>> >>> You can try scaling methods like TMM, RPKM, or 75th percentile, which as you said normalize data within samples. Or you can try quantile between-sample normalization (read counts should be adjusted by gene length first), which performs normalization across samples. You can try all these in edgeR package. >>> >>> From my experience, I actually found the quantile method performed better for my RNA-seq data. I used general linear model and likelihood ratio test in edgeR in my analysis. >>> >>> Hope this helps. >>> >>> Cheers, >>> Wei >>> >>> On Apr 28, 2011, at 7:36 PM, Jo?o Moura wrote: >>> >>>> Dear all, >>>> >>>> >>>> Until now I was doing RNAseq DE analysis and to do that I understand that >>>> normalization issues only matter inside samples, because one can assume the >>>> length/content biases will cancel out when comparing same genes in different >>>> samples. >>>> Although, I'm now trying to compare correlation of different genes and so, >>>> this biases should be taken into account - for this is there any better >>>> method than RPKM? >>>> >>>> My main doubt is if I should also take into acount the biases inside samples >>>> and to do that is there any better approach then TMM by Robinson and Oshlack >>>> [2010]? >>>> >>>> Thank you all, >>>> -- >>>> Jo?o Moura >>>> >>>> [[alternative HTML version deleted]] >>>> >>>> _______________________________________________ >>>> Bioconductor mailing list >>>> Bioconductor at r-project.org >>>> https://stat.ethz.ch/mailman/listinfo/bioconductor >>>> Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor >>> >>> >>> ______________________________________________________________________ >>> The information in this email is confidential and inte...{{dropped:18}} >> >> _______________________________________________ >> Bioconductor mailing list >> Bioconductor at r-project.org >> https://stat.ethz.ch/mailman/listinfo/bioconductor >> Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor > > > -- > > > Wolfgang Huber > EMBL > http://www.embl.de/research/units/genome_biology/huber > > _______________________________________________ > Bioconductor mailing list > Bioconductor at r-project.org > https://stat.ethz.ch/mailman/listinfo/bioconductor > Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor ______________________________________________________________________ The information in this email is confidential and intend...{{dropped:6}}

ADD REPLY • link 13.0 years ago Wei Shi ★ 3.6k

0

Entering edit mode

Wei Shi ★ 3.6k

@wei-shi-2183

Last seen 10 days ago

Australia/Melbourne/Olivia Newton-John …

Hi Stefano: To have access to RNA-seq data produced by MAQC project, you'll have to be a member of MAQC Consortium. Have a look at the MAQC website for details (http://www.fda.gov/ScienceResearch/Bioinformatics Tools/MicroarrayQualityControlProject/default.htm). The TaqMan RT-PCR data is publicly available, which can be downloaded from GEO (GSE5350). Cheers, Wei On Apr 29, 2011, at 5:41 PM, Stefano Calza wrote: > Dr Wei > > are these data on RNA-seq and RT-PCR already available? > > Regards > > Stefano > > > On Fri, Apr 29, 2011 at 01:26:49PM +1000, Wei Shi wrote: > <wei>Hi Fernando: > <wei> > <wei> We had some positive control genes which we know should be up -/down-regulated in one cell type compared to the other from previous RT-PCR experiments. The quantile method successfully detected all these control genes and gave them higher ranks in the list of differentially expressed genes compared to other normalization methods. You could certainly argue that this is a biased comparison, but when you do not know which method works best, the one which gives results more closer to your expectation is often preferred. > <wei> > <wei> My belief in the quantile method actually mainly came from a evaluation study using the RNA-seq data from MAQC project, in which expression levels of ~1000 genes were validated by RT-PCR. What I found was that the quantile normalized data had a better correlation with the PCR data, compared to other normalization methods. This work hasn't been published yet, but I am working on that. > <wei> > <wei>Cheers, > <wei>Wei > <wei> > <wei> > <wei>On Apr 29, 2011, at 12:51 PM, Biase, Fernando wrote: > <wei> > <wei>> Dr. Wei, > <wei>> > <wei>> If I may I ask. What criteria do you use to find out which normalization suits better your data? > <wei>> > <wei>> thanks, > <wei>> Fernando > <wei>> > <wei>> ________________________________________ > <wei>> From: bioconductor-bounces at r-project.org [bioconductor- bounces at r-project.org] On Behalf Of Wei Shi [shi at wehi.EDU.AU] > <wei>> Sent: Thursday, April 28, 2011 6:07 PM > <wei>> To: Jo??o Moura > <wei>> Cc: bioconductor at r-project.org list > <wei>> Subject: Re: [BioC] RNASeq: normalization issues > <wei>> > <wei>> Hi Jo??o: > <wei>> > <wei>> Maybe you can try different normalization methods for your data to see which one looks better. How to best normalize RNA-seq data is still of much debate at this stage. > <wei>> > <wei>> You can try scaling methods like TMM, RPKM, or 75th percentile, which as you said normalize data within samples. Or you can try quantile between-sample normalization (read counts should be adjusted by gene length first), which performs normalization across samples. You can try all these in edgeR package. > <wei>> > <wei>> From my experience, I actually found the quantile method performed better for my RNA-seq data. I used general linear model and likelihood ratio test in edgeR in my analysis. > <wei>> > <wei>> Hope this helps. > <wei>> > <wei>> Cheers, > <wei>> Wei > <wei>> > <wei>> On Apr 28, 2011, at 7:36 PM, Jo??o Moura wrote: > <wei>> > <wei>>> Dear all, > <wei>>> > <wei>>> > <wei>>> Until now I was doing RNAseq DE analysis and to do that I understand that > <wei>>> normalization issues only matter inside samples, because one can assume the > <wei>>> length/content biases will cancel out when comparing same genes in different > <wei>>> samples. > <wei>>> Although, I'm now trying to compare correlation of different genes and so, > <wei>>> this biases should be taken into account - for this is there any better > <wei>>> method than RPKM? > <wei>>> > <wei>>> My main doubt is if I should also take into acount the biases inside samples > <wei>>> and to do that is there any better approach then TMM by Robinson and Oshlack > <wei>>> [2010]? > <wei>>> > <wei>>> Thank you all, > <wei>>> -- > <wei>>> Jo??o Moura > <wei>>> > <wei>>> [[alternative HTML version deleted]] > <wei>>> > <wei>>> _______________________________________________ > <wei>>> Bioconductor mailing list > <wei>>> Bioconductor at r-project.org > <wei>>> https://stat.ethz.ch/mailman/listinfo/bioconductor > <wei>>> Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor > <wei>> > <wei>> > <wei>> ______________________________________________________________________ > <wei>> The information in this email is confidential and inte...{{dropped:18}} > <wei> > <wei>_______________________________________________ > <wei>Bioconductor mailing list > <wei>Bioconductor at r-project.org > <wei>https://stat.ethz.ch/mailman/listinfo/bioconductor > <wei>Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor > > -- > Stefano Calza, PhD > Researcher/Assistent Professor - Biostatistician > > *Sezione di Statistica Medica e Biometria > Dipartimento di Scienze Biomediche e Biotecnologie > Universit? degli Studi di Brescia - Italy > Viale Europa, 11 25123 Brescia > > email: stefano.calza at med.unibs.it > stefano.calza at biostatistics.it > > pec: stefano.calza at pec.biostatistics.it > > Phone: +390303717653 > Fax: +390303717488 ______________________________________________________________________ The information in this email is confidential and intend...{{dropped:6}}

ADD COMMENT • link 13.0 years ago Wei Shi ★ 3.6k

Login before adding your answer.