RNASeq: normalization issues
3
0
Entering edit mode
@ywchenjimmyharvardedu-3454
Last seen 7.1 years ago
Hi Wei, Could you elaborate on how to appropriately do gene-length-adjusted quantile normalization in edgeR? The "quantile normalization" option in calcNormFactors function does not seem to take into account the gene length. Thanks. Yiwen > Hi Jo?o: > > Maybe you can try different normalization methods for your data to see > which one looks better. How to best normalize RNA-seq data is still of > much debate at this stage. > > You can try scaling methods like TMM, RPKM, or 75th percentile, which as > you said normalize data within samples. Or you can try quantile > between-sample normalization (read counts should be adjusted by gene > length first), which performs normalization across samples. You can try > all these in edgeR package. > > From my experience, I actually found the quantile method performed better > for my RNA-seq data. I used general linear model and likelihood ratio > test in edgeR in my analysis. > > Hope this helps. > > Cheers, > Wei > > On Apr 28, 2011, at 7:36 PM, Jo?o Moura wrote: > >> Dear all, >> >> >> Until now I was doing RNAseq DE analysis and to do that I understand >> that >> normalization issues only matter inside samples, because one can assume >> the >> length/content biases will cancel out when comparing same genes in >> different >> samples. >> Although, I'm now trying to compare correlation of different genes and >> so, >> this biases should be taken into account - for this is there any better >> method than RPKM? >> >> My main doubt is if I should also take into acount the biases inside >> samples >> and to do that is there any better approach then TMM by Robinson and >> Oshlack >> [2010]? >> >> Thank you all, >> -- >> Jo?o Moura >> >> [[alternative HTML version deleted]] >> >> _______________________________________________ >> Bioconductor mailing list >> Bioconductor at r-project.org >> https://stat.ethz.ch/mailman/listinfo/bioconductor >> Search the archives: >> http://news.gmane.org/gmane.science.biology.informatics.conductor > > > ______________________________________________________________________ > The information in this email is confidential and intend...{{dropped:6}} > > _______________________________________________ > Bioconductor mailing list > Bioconductor at r-project.org > https://stat.ethz.ch/mailman/listinfo/bioconductor > Search the archives: > http://news.gmane.org/gmane.science.biology.informatics.conductor >
RNASeq Normalization edgeR RNASeq Normalization edgeR • 1.1k views
0
Entering edit mode
@wolfgang-huber-3550
Last seen 8 weeks ago
EMBL European Molecular Biology Laborat…
Hi Yiwen gene length adjustment usually does not make sense at this stage of the analysis (before assessing significance of differential expression), as it eliminates the information on count numbers, which is important for assessing significance in the low count range. It may or may not make sense at a later point of the analysis (after assessing significance). Best wishes Wolfgang Il May/1/11 6:44 AM, ywchen at jimmy.harvard.edu ha scritto: > Hi Wei, > > Could you elaborate on how to appropriately do gene-length-adjusted > quantile normalization in edgeR? The "quantile normalization" option in > calcNormFactors function does not seem to take into account the gene > length. > > Thanks. > Yiwen >> Hi Jo?o: >> >> Maybe you can try different normalization methods for your data to see >> which one looks better. How to best normalize RNA-seq data is still of >> much debate at this stage. >> >> You can try scaling methods like TMM, RPKM, or 75th percentile, which as >> you said normalize data within samples. Or you can try quantile >> between-sample normalization (read counts should be adjusted by gene >> length first), which performs normalization across samples. You can try >> all these in edgeR package. >> >> From my experience, I actually found the quantile method performed better >> for my RNA-seq data. I used general linear model and likelihood ratio >> test in edgeR in my analysis. >> >> Hope this helps. >> >> Cheers, >> Wei >> >> On Apr 28, 2011, at 7:36 PM, Jo?o Moura wrote: >> >>> Dear all, >>> >>> >>> Until now I was doing RNAseq DE analysis and to do that I understand >>> that >>> normalization issues only matter inside samples, because one can assume >>> the >>> length/content biases will cancel out when comparing same genes in >>> different >>> samples. >>> Although, I'm now trying to compare correlation of different genes and >>> so, >>> this biases should be taken into account - for this is there any better >>> method than RPKM? >>> >>> My main doubt is if I should also take into acount the biases inside >>> samples >>> and to do that is there any better approach then TMM by Robinson and >>> Oshlack >>> [2010]? >>> >>> Thank you all, >>> -- >>> Jo?o Moura >>> >>> [[alternative HTML version deleted]] >>> >>> _______________________________________________ >>> Bioconductor mailing list >>> Bioconductor at r-project.org >>> https://stat.ethz.ch/mailman/listinfo/bioconductor >>> Search the archives: >>> http://news.gmane.org/gmane.science.biology.informatics.conductor >> >> >> ______________________________________________________________________ >> The information in this email is confidential and intend...{{dropped:6}} >> >> _______________________________________________ >> Bioconductor mailing list >> Bioconductor at r-project.org >> https://stat.ethz.ch/mailman/listinfo/bioconductor >> Search the archives: >> http://news.gmane.org/gmane.science.biology.informatics.conductor >> > > _______________________________________________ > Bioconductor mailing list > Bioconductor at r-project.org > https://stat.ethz.ch/mailman/listinfo/bioconductor > Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor -- Wolfgang Huber EMBL http://www.embl.de/research/units/genome_biology/huber
0
Entering edit mode
Hi Wolfgang, Thanks for the note. I understand it is more statistically sound to work on the count-level data before assessing DGE. However, it says in Wei's original email to Jo?o that "Or you can try quantile between-sample normalization (read counts should be adjusted by gene length first), which performs normalization across samples. You can try all these in edgeR package. From my experience, I actually found the quantile method performed better for my RNA-seq data." Wei also mentioned some empirical evidence of the superiority of quantile normalization on the data from MAQC project(calibrated using golden standard qPCR data). I guess I may have some misunderstanding of the message there. Yiwen > Hi Yiwen > > gene length adjustment usually does not make sense at this stage of the > analysis (before assessing significance of differential expression), as > it eliminates the information on count numbers, which is important for > assessing significance in the low count range. > > It may or may not make sense at a later point of the analysis (after > assessing significance). > > Best wishes > Wolfgang > > > Il May/1/11 6:44 AM, ywchen at jimmy.harvard.edu ha scritto: >> Hi Wei, >> >> Could you elaborate on how to appropriately do gene-length-adjusted >> quantile normalization in edgeR? The "quantile normalization" option in >> calcNormFactors function does not seem to take into account the gene >> length. >> >> Thanks. >> Yiwen >>> Hi Jo?o: >>> >>> Maybe you can try different normalization methods for your data to see >>> which one looks better. How to best normalize RNA-seq data is still of >>> much debate at this stage. >>> >>> You can try scaling methods like TMM, RPKM, or 75th percentile, which >>> as >>> you said normalize data within samples. Or you can try quantile >>> between-sample normalization (read counts should be adjusted by gene >>> length first), which performs normalization across samples. You can try >>> all these in edgeR package. >>> >>> From my experience, I actually found the quantile method performed >>> better >>> for my RNA-seq data. I used general linear model and likelihood ratio >>> test in edgeR in my analysis. >>> >>> Hope this helps. >>> >>> Cheers, >>> Wei >>> >>> On Apr 28, 2011, at 7:36 PM, Jo?o Moura wrote: >>> >>>> Dear all, >>>> >>>> >>>> Until now I was doing RNAseq DE analysis and to do that I understand >>>> that >>>> normalization issues only matter inside samples, because one can >>>> assume >>>> the >>>> length/content biases will cancel out when comparing same genes in >>>> different >>>> samples. >>>> Although, I'm now trying to compare correlation of different genes and >>>> so, >>>> this biases should be taken into account - for this is there any >>>> better >>>> method than RPKM? >>>> >>>> My main doubt is if I should also take into acount the biases inside >>>> samples >>>> and to do that is there any better approach then TMM by Robinson and >>>> Oshlack >>>> [2010]? >>>> >>>> Thank you all, >>>> -- >>>> Jo?o Moura >>>> >>>> [[alternative HTML version deleted]] >>>> >>>> _______________________________________________ >>>> Bioconductor mailing list >>>> Bioconductor at r-project.org >>>> https://stat.ethz.ch/mailman/listinfo/bioconductor >>>> Search the archives: >>>> http://news.gmane.org/gmane.science.biology.informatics.conductor >>> >>> >>> ______________________________________________________________________ >>> The information in this email is confidential and >>> intend...{{dropped:6}} >>> >>> _______________________________________________ >>> Bioconductor mailing list >>> Bioconductor at r-project.org >>> https://stat.ethz.ch/mailman/listinfo/bioconductor >>> Search the archives: >>> http://news.gmane.org/gmane.science.biology.informatics.conductor >>> >> >> _______________________________________________ >> Bioconductor mailing list >> Bioconductor at r-project.org >> https://stat.ethz.ch/mailman/listinfo/bioconductor >> Search the archives: >> http://news.gmane.org/gmane.science.biology.informatics.conductor > > > -- > > > Wolfgang Huber > EMBL > http://www.embl.de/research/units/genome_biology/huber > > _______________________________________________ > Bioconductor mailing list > Bioconductor at r-project.org > https://stat.ethz.ch/mailman/listinfo/bioconductor > Search the archives: > http://news.gmane.org/gmane.science.biology.informatics.conductor >
0
Entering edit mode
@davis-mccarthy-4138
Last seen 7.1 years ago
Hi Yiwen The "quantile normalization" option in calcNormFactors in edgeR does something very different from the quantile normalization (microarray- style) that Wei has been discussing. The quantile normalization in calcNormFactors computes an offset for sequencing library depth after Bullard et al (2010) [1]. This is an approach in the same vein as TMM normalization [2] or scaled median [3]. I believe that the approach that Wei is suggesting is more similar to the quantile normalization approach that has been taken with microarray data, adjusting the data so that the response follows the same distribution across (in this context) sequenced libraries. This will typically result in non-integer data from adjusting counts, but count-based methods could still be used if this quantile normalization were treated as an offset for each observation in (e.g.) a generalized linear model. Cheers Davis [1] http://www.biomedcentral.com/1471-2105/11/94 [2] http://www.ncbi.nlm.nih.gov/pmc/articles/PMC2864565/?tool=pubmed [3] http://genomebiology.com/2010/11/10/R106#B13 > Hi Wei, > > Could you elaborate on how to appropriately do gene-length-adjusted > quantile normalization in edgeR? The "quantile normalization" option in > calcNormFactors function does not seem to take into account the gene > length. > > Thanks. > Yiwen >> Hi Jo?o: >> >> Maybe you can try different normalization methods for your data to see >> which one looks better. How to best normalize RNA-seq data is still of >> much debate at this stage. >> >> You can try scaling methods like TMM, RPKM, or 75th percentile, which >> as >> you said normalize data within samples. Or you can try quantile >> between-sample normalization (read counts should be adjusted by gene >> length first), which performs normalization across samples. You can try >> all these in edgeR package. >> >> From my experience, I actually found the quantile method performed >> better >> for my RNA-seq data. I used general linear model and likelihood ratio >> test in edgeR in my analysis. >> >> Hope this helps. >> >> Cheers, >> Wei >> >> On Apr 28, 2011, at 7:36 PM, Jo?o Moura wrote: >> >>> Dear all, >>> >>> >>> Until now I was doing RNAseq DE analysis and to do that I understand >>> that >>> normalization issues only matter inside samples, because one can assume >>> the >>> length/content biases will cancel out when comparing same genes in >>> different >>> samples. >>> Although, I'm now trying to compare correlation of different genes and >>> so, >>> this biases should be taken into account - for this is there any better >>> method than RPKM? >>> >>> My main doubt is if I should also take into acount the biases inside >>> samples >>> and to do that is there any better approach then TMM by Robinson and >>> Oshlack >>> [2010]? >>> >>> Thank you all, >>> -- >>> Jo?o Moura >>> >>> [[alternative HTML version deleted]] >>> >>> _______________________________________________ >>> Bioconductor mailing list >>> Bioconductor at r-project.org >>> https://stat.ethz.ch/mailman/listinfo/bioconductor >>> Search the archives: >>> http://news.gmane.org/gmane.science.biology.informatics.conductor >> >> >> ______________________________________________________________________ >> The information in this email is confidential and intend...{{dropped:6}} >> >> _______________________________________________ >> Bioconductor mailing list >> Bioconductor at r-project.org >> https://stat.ethz.ch/mailman/listinfo/bioconductor >> Search the archives: >> http://news.gmane.org/gmane.science.biology.informatics.conductor >> > > _______________________________________________ > Bioconductor mailing list > Bioconductor at r-project.org > https://stat.ethz.ch/mailman/listinfo/bioconductor > Search the archives: > http://news.gmane.org/gmane.science.biology.informatics.conductor > -------------------------------------------------- Davis J McCarthy Research Technician Bioinformatics Division Walter and Eliza Hall Institute of Medical Research 1G Royal Parade, Parkville, Vic 3052, Australia. dmccarthy at wehi.edu.au http://www.wehi.edu.au ______________________________________________________________________ The information in this email is confidential and intend...{{dropped:6}}
0
Entering edit mode
Hi Yiwen: As Davis said, the "length+quantile" method I mentioned in the previous correspondences is not the "quantile normalization" option in calcNormFactors function in edgeR. That's the reason why you didn't see gene length adjustment with that function. Adjusting read counts using gene length (total exon length) will put all genes on the same baseline within the sample (longer transcripts produce more reads), and quantile between-sample normalization will make all samples have the same read count distribution (and library size will become the same as well). This is what I mean by "length+quantile" normalization. The quantile normalization here is the same quantile normalization applied to microarray data, however it is applied to sequencing data in a different way (used as offsets in the general linear model). Now I elaborate how to do this normalization. Suppose you have a read count matrix of "x" with rows being genes and columns being samples . Also suppose you have a numeric vector "gene.length" which includes total exon length for each gene and gene order in "gene.length" is the same with that in "x". The following line of code yields the number of reads per 1000 bases for each gene: x1 <- x*1000/gene.length Now perform quantile normalization for gene length adjusted data: library(limma) x2 <- normalizeBetweenArrays(x1,method="quantile") Suppose x has two columns named "wt" and "ko". Create a design matrix: snames <- factor(c("wt","ko")) design <- model.matrix(~snames) Now get the offsets for each gene in each sample. The offsets are the intensity differences between raw data and normalized data. library(edgeR) y <- DGEList(counts=x,group=colnames(x)) lowcounts <- rowSums(x)<5 offset <- log(x[!lowcounts,]+0.1)-log(x2[!lowcounts,]+0.1) yf <- y[!lowcounts,] Fit general linear models to read count data with offsets included: y.glm <- estimateCRDisp(y=yf,design=design,offset=offset,trend=TRUE, tagwise=TRUE) fit <- glmFit(y=y.glm,design=design,dispersion=y.glm$CR.tagwise.disper sion,offset=offset) Perform likelihood ratio tests to find differentially expressed genes: DE <- glmLRT(y.glm,fit) dt <- decideTestsDGE(DE) summary(dt) Hope this will work for you! Cheers, wei On May 2, 2011, at 8:52 AM, Davis McCarthy wrote: > Hi Yiwen > > The "quantile normalization" option in calcNormFactors in edgeR does something very different from the quantile normalization (microarray- style) that Wei has been discussing. > > The quantile normalization in calcNormFactors computes an offset for sequencing library depth after Bullard et al (2010) [1]. This is an approach in the same vein as TMM normalization [2] or scaled median [3]. > > I believe that the approach that Wei is suggesting is more similar to the quantile normalization approach that has been taken with microarray data, adjusting the data so that the response follows the same distribution across (in this context) sequenced libraries. This will typically result in non-integer data from adjusting counts, but count-based methods could still be used if this quantile normalization were treated as an offset for each observation in (e.g.) a generalized linear model. > > Cheers > Davis > > > [1] http://www.biomedcentral.com/1471-2105/11/94 > [2] http://www.ncbi.nlm.nih.gov/pmc/articles/PMC2864565/?tool=pubmed > [3] http://genomebiology.com/2010/11/10/R106#B13 > > >> Hi Wei, >> >> Could you elaborate on how to appropriately do gene-length-adjusted >> quantile normalization in edgeR? The "quantile normalization" option in >> calcNormFactors function does not seem to take into account the gene >> length. >> >> Thanks. >> Yiwen >>> Hi Jo?o: >>> >>> Maybe you can try different normalization methods for your data to see >>> which one looks better. How to best normalize RNA-seq data is still of >>> much debate at this stage. >>> >>> You can try scaling methods like TMM, RPKM, or 75th percentile, which >>> as >>> you said normalize data within samples. Or you can try quantile >>> between-sample normalization (read counts should be adjusted by gene >>> length first), which performs normalization across samples. You can try >>> all these in edgeR package. >>> >>> From my experience, I actually found the quantile method performed >>> better >>> for my RNA-seq data. I used general linear model and likelihood ratio >>> test in edgeR in my analysis. >>> >>> Hope this helps. >>> >>> Cheers, >>> Wei >>> >>> On Apr 28, 2011, at 7:36 PM, Jo?o Moura wrote: >>> >>>> Dear all, >>>> >>>> >>>> Until now I was doing RNAseq DE analysis and to do that I understand >>>> that >>>> normalization issues only matter inside samples, because one can assume >>>> the >>>> length/content biases will cancel out when comparing same genes in >>>> different >>>> samples. >>>> Although, I'm now trying to compare correlation of different genes and >>>> so, >>>> this biases should be taken into account - for this is there any better >>>> method than RPKM? >>>> >>>> My main doubt is if I should also take into acount the biases inside >>>> samples >>>> and to do that is there any better approach then TMM by Robinson and >>>> Oshlack >>>> [2010]? >>>> >>>> Thank you all, >>>> -- >>>> Jo?o Moura >>>> >>>> [[alternative HTML version deleted]] >>>> >>>> _______________________________________________ >>>> Bioconductor mailing list >>>> Bioconductor at r-project.org >>>> https://stat.ethz.ch/mailman/listinfo/bioconductor >>>> Search the archives: >>>> http://news.gmane.org/gmane.science.biology.informatics.conductor >>> >>> >>> ______________________________________________________________________ >>> The information in this email is confidential and intend...{{dropped:6}} >>> >>> _______________________________________________ >>> Bioconductor mailing list >>> Bioconductor at r-project.org >>> https://stat.ethz.ch/mailman/listinfo/bioconductor >>> Search the archives: >>> http://news.gmane.org/gmane.science.biology.informatics.conductor >>> >> >> _______________________________________________ >> Bioconductor mailing list >> Bioconductor at r-project.org >> https://stat.ethz.ch/mailman/listinfo/bioconductor >> Search the archives: >> http://news.gmane.org/gmane.science.biology.informatics.conductor >> > > > -------------------------------------------------- > Davis J McCarthy > Research Technician > Bioinformatics Division > Walter and Eliza Hall Institute of Medical Research > 1G Royal Parade, Parkville, Vic 3052, Australia. > dmccarthy at wehi.edu.au > http://www.wehi.edu.au ______________________________________________________________________ The information in this email is confidential and intend...{{dropped:6}} ADD REPLY 0 Entering edit mode Hi Wei and Davis, Thank you so much for such detailed explanations! Now it is very clear. In your case you found the benefit of using quantile normalization+GLM+LRT, is it single factor with many libraries or multiple factor data? Yiwen > Hi Yiwen: > > As Davis said, the "length+quantile" method I mentioned in the previous > correspondences is not the "quantile normalization" option in > calcNormFactors function in edgeR. That's the reason why you didn't see > gene length adjustment with that function. > > Adjusting read counts using gene length (total exon length) will put all > genes on the same baseline within the sample (longer transcripts produce > more reads), and quantile between-sample normalization will make all > samples have the same read count distribution (and library size will > become the same as well). This is what I mean by "length+quantile" > normalization. The quantile normalization here is the same quantile > normalization applied to microarray data, however it is applied to > sequencing data in a different way (used as offsets in the general linear > model). > > Now I elaborate how to do this normalization. Suppose you have a read > count matrix of "x" with rows being genes and columns being samples . > Also suppose you have a numeric vector "gene.length" which includes total > exon length for each gene and gene order in "gene.length" is the same > with that in "x". The following line of code yields the number of reads > per 1000 bases for each gene: > > x1 <- x*1000/gene.length > > Now perform quantile normalization for gene length adjusted data: > > library(limma) > x2 <- normalizeBetweenArrays(x1,method="quantile") > > Suppose x has two columns named "wt" and "ko". Create a design matrix: > > snames <- factor(c("wt","ko")) > design <- model.matrix(~snames) > > Now get the offsets for each gene in each sample. The offsets are the > intensity differences between raw data and normalized data. > > library(edgeR) > y <- DGEList(counts=x,group=colnames(x)) > lowcounts <- rowSums(x)<5 > offset <- log(x[!lowcounts,]+0.1)-log(x2[!lowcounts,]+0.1) > yf <- y[!lowcounts,] > > Fit general linear models to read count data with offsets included: > > y.glm <- estimateCRDisp(y=yf,design=design,offset=offset,trend=TRUE, > tagwise=TRUE) > fit <- > glmFit(y=y.glm,design=design,dispersion=y.glm$CR.tagwise.dispersion, offset=offset) > > Perform likelihood ratio tests to find differentially expressed genes: > > DE <- glmLRT(y.glm,fit) > dt <- decideTestsDGE(DE) > summary(dt) > > Hope this will work for you! > > Cheers, > wei > > > > On May 2, 2011, at 8:52 AM, Davis McCarthy wrote: > >> Hi Yiwen >> >> The "quantile normalization" option in calcNormFactors in edgeR does >> something very different from the quantile normalization >> (microarray-style) that Wei has been discussing. >> >> The quantile normalization in calcNormFactors computes an offset for >> sequencing library depth after Bullard et al (2010) [1]. This is an >> approach in the same vein as TMM normalization [2] or scaled median [3]. >> >> I believe that the approach that Wei is suggesting is more similar to >> the quantile normalization approach that has been taken with microarray >> data, adjusting the data so that the response follows the same >> distribution across (in this context) sequenced libraries. This will >> typically result in non-integer data from adjusting counts, but >> count-based methods could still be used if this quantile normalization >> were treated as an offset for each observation in (e.g.) a generalized >> linear model. >> >> Cheers >> Davis >> >> >> [1] http://www.biomedcentral.com/1471-2105/11/94 >> [2] http://www.ncbi.nlm.nih.gov/pmc/articles/PMC2864565/?tool=pubmed >> [3] http://genomebiology.com/2010/11/10/R106#B13 >> >> >>> Hi Wei, >>> >>> Could you elaborate on how to appropriately do gene-length- adjusted >>> quantile normalization in edgeR? The "quantile normalization" option in >>> calcNormFactors function does not seem to take into account the gene >>> length. >>> >>> Thanks. >>> Yiwen >>>> Hi Jo?o: >>>> >>>> Maybe you can try different normalization methods for your data to >>>> see >>>> which one looks better. How to best normalize RNA-seq data is still of >>>> much debate at this stage. >>>> >>>> You can try scaling methods like TMM, RPKM, or 75th percentile, which >>>> as >>>> you said normalize data within samples. Or you can try quantile >>>> between-sample normalization (read counts should be adjusted by gene >>>> length first), which performs normalization across samples. You can >>>> try >>>> all these in edgeR package. >>>> >>>> From my experience, I actually found the quantile method performed >>>> better >>>> for my RNA-seq data. I used general linear model and likelihood ratio >>>> test in edgeR in my analysis. >>>> >>>> Hope this helps. >>>> >>>> Cheers, >>>> Wei >>>> >>>> On Apr 28, 2011, at 7:36 PM, Jo?o Moura wrote: >>>> >>>>> Dear all, >>>>> >>>>> >>>>> Until now I was doing RNAseq DE analysis and to do that I understand >>>>> that >>>>> normalization issues only matter inside samples, because one can >>>>> assume >>>>> the >>>>> length/content biases will cancel out when comparing same genes in >>>>> different >>>>> samples. >>>>> Although, I'm now trying to compare correlation of different genes >>>>> and >>>>> so, >>>>> this biases should be taken into account - for this is there any >>>>> better >>>>> method than RPKM? >>>>> >>>>> My main doubt is if I should also take into acount the biases inside >>>>> samples >>>>> and to do that is there any better approach then TMM by Robinson and >>>>> Oshlack >>>>> [2010]? >>>>> >>>>> Thank you all, >>>>> -- >>>>> Jo?o Moura >>>>> >>>>> [[alternative HTML version deleted]] >>>>> >>>>> _______________________________________________ >>>>> Bioconductor mailing list >>>>> Bioconductor at r-project.org >>>>> https://stat.ethz.ch/mailman/listinfo/bioconductor >>>>> Search the archives: >>>>> http://news.gmane.org/gmane.science.biology.informatics.conductor >>>> >>>> >>>> ______________________________________________________________________ >>>> The information in this email is confidential and >>>> intend...{{dropped:6}} >>>> >>>> _______________________________________________ >>>> Bioconductor mailing list >>>> Bioconductor at r-project.org >>>> https://stat.ethz.ch/mailman/listinfo/bioconductor >>>> Search the archives: >>>> http://news.gmane.org/gmane.science.biology.informatics.conductor >>>> >>> >>> _______________________________________________ >>> Bioconductor mailing list >>> Bioconductor at r-project.org >>> https://stat.ethz.ch/mailman/listinfo/bioconductor >>> Search the archives: >>> http://news.gmane.org/gmane.science.biology.informatics.conductor >>> >> >> >> -------------------------------------------------- >> Davis J McCarthy >> Research Technician >> Bioinformatics Division >> Walter and Eliza Hall Institute of Medical Research >> 1G Royal Parade, Parkville, Vic 3052, Australia. >> dmccarthy at wehi.edu.au >> http://www.wehi.edu.au > > > ______________________________________________________________________ > The information in this email is confidential and inte...{{dropped:8}}
0
Entering edit mode
Hi Yiwen: It is a single factor experiment with six libraries. There were four cell types in this experiment, one of which had three replicates and others did not have replicates. Cheers, Wei On May 2, 2011, at 11:21 AM, ywchen at jimmy.harvard.edu wrote: > Hi Wei and Davis, > > Thank you so much for such detailed explanations! Now it is very clear. > In your case you found the benefit of using quantile normalization+GLM+LRT, > is it single factor with many libraries or multiple factor data? > > Yiwen > > >> Hi Yiwen: >> >> As Davis said, the "length+quantile" method I mentioned in the previous >> correspondences is not the "quantile normalization" option in >> calcNormFactors function in edgeR. That's the reason why you didn't see >> gene length adjustment with that function. >> >> Adjusting read counts using gene length (total exon length) will put all >> genes on the same baseline within the sample (longer transcripts produce >> more reads), and quantile between-sample normalization will make all >> samples have the same read count distribution (and library size will >> become the same as well). This is what I mean by "length+quantile" >> normalization. The quantile normalization here is the same quantile >> normalization applied to microarray data, however it is applied to >> sequencing data in a different way (used as offsets in the general linear >> model). >> >> Now I elaborate how to do this normalization. Suppose you have a read >> count matrix of "x" with rows being genes and columns being samples . >> Also suppose you have a numeric vector "gene.length" which includes total >> exon length for each gene and gene order in "gene.length" is the same >> with that in "x". The following line of code yields the number of reads >> per 1000 bases for each gene: >> >> x1 <- x*1000/gene.length >> >> Now perform quantile normalization for gene length adjusted data: >> >> library(limma) >> x2 <- normalizeBetweenArrays(x1,method="quantile") >> >> Suppose x has two columns named "wt" and "ko". Create a design matrix: >> >> snames <- factor(c("wt","ko")) >> design <- model.matrix(~snames) >> >> Now get the offsets for each gene in each sample. The offsets are the >> intensity differences between raw data and normalized data. >> >> library(edgeR) >> y <- DGEList(counts=x,group=colnames(x)) >> lowcounts <- rowSums(x)<5 >> offset <- log(x[!lowcounts,]+0.1)-log(x2[!lowcounts,]+0.1) >> yf <- y[!lowcounts,] >> >> Fit general linear models to read count data with offsets included: >> >> y.glm <- estimateCRDisp(y=yf,design=design,offset=offset,trend=TRUE, >> tagwise=TRUE) >> fit <- >> glmFit(y=y.glm,design=design,dispersion=y.glm$CR.tagwise.dispersion ,offset=offset) >> >> Perform likelihood ratio tests to find differentially expressed genes: >> >> DE <- glmLRT(y.glm,fit) >> dt <- decideTestsDGE(DE) >> summary(dt) >> >> Hope this will work for you! >> >> Cheers, >> wei >> >> >> >> On May 2, 2011, at 8:52 AM, Davis McCarthy wrote: >> >>> Hi Yiwen >>> >>> The "quantile normalization" option in calcNormFactors in edgeR does >>> something very different from the quantile normalization >>> (microarray-style) that Wei has been discussing. >>> >>> The quantile normalization in calcNormFactors computes an offset for >>> sequencing library depth after Bullard et al (2010) [1]. This is an >>> approach in the same vein as TMM normalization [2] or scaled median [3]. >>> >>> I believe that the approach that Wei is suggesting is more similar to >>> the quantile normalization approach that has been taken with microarray >>> data, adjusting the data so that the response follows the same >>> distribution across (in this context) sequenced libraries. This will >>> typically result in non-integer data from adjusting counts, but >>> count-based methods could still be used if this quantile normalization >>> were treated as an offset for each observation in (e.g.) a generalized >>> linear model. >>> >>> Cheers >>> Davis >>> >>> >>> [1] http://www.biomedcentral.com/1471-2105/11/94 >>> [2] http://www.ncbi.nlm.nih.gov/pmc/articles/PMC2864565/?tool=pubmed >>> [3] http://genomebiology.com/2010/11/10/R106#B13 >>> >>> >>>> Hi Wei, >>>> >>>> Could you elaborate on how to appropriately do gene-length- adjusted >>>> quantile normalization in edgeR? The "quantile normalization" option in >>>> calcNormFactors function does not seem to take into account the gene >>>> length. >>>> >>>> Thanks. >>>> Yiwen >>>>> Hi Jo?o: >>>>> >>>>> Maybe you can try different normalization methods for your data to >>>>> see >>>>> which one looks better. How to best normalize RNA-seq data is still of >>>>> much debate at this stage. >>>>> >>>>> You can try scaling methods like TMM, RPKM, or 75th percentile, which >>>>> as >>>>> you said normalize data within samples. Or you can try quantile >>>>> between-sample normalization (read counts should be adjusted by gene >>>>> length first), which performs normalization across samples. You can >>>>> try >>>>> all these in edgeR package. >>>>> >>>>> From my experience, I actually found the quantile method performed >>>>> better >>>>> for my RNA-seq data. I used general linear model and likelihood ratio >>>>> test in edgeR in my analysis. >>>>> >>>>> Hope this helps. >>>>> >>>>> Cheers, >>>>> Wei >>>>> >>>>> On Apr 28, 2011, at 7:36 PM, Jo?o Moura wrote: >>>>> >>>>>> Dear all, >>>>>> >>>>>> >>>>>> Until now I was doing RNAseq DE analysis and to do that I understand >>>>>> that >>>>>> normalization issues only matter inside samples, because one can >>>>>> assume >>>>>> the >>>>>> length/content biases will cancel out when comparing same genes in >>>>>> different >>>>>> samples. >>>>>> Although, I'm now trying to compare correlation of different genes >>>>>> and >>>>>> so, >>>>>> this biases should be taken into account - for this is there any >>>>>> better >>>>>> method than RPKM? >>>>>> >>>>>> My main doubt is if I should also take into acount the biases inside >>>>>> samples >>>>>> and to do that is there any better approach then TMM by Robinson and >>>>>> Oshlack >>>>>> [2010]? >>>>>> >>>>>> Thank you all, >>>>>> -- >>>>>> Jo?o Moura >>>>>> >>>>>> [[alternative HTML version deleted]] >>>>>> >>>>>> _______________________________________________ >>>>>> Bioconductor mailing list >>>>>> Bioconductor at r-project.org >>>>>> https://stat.ethz.ch/mailman/listinfo/bioconductor >>>>>> Search the archives: >>>>>> http://news.gmane.org/gmane.science.biology.informatics.conductor >>>>> >>>>> >>>>> ______________________________________________________________________ >>>>> The information in this email is confidential and >>>>> intend...{{dropped:6}} >>>>> >>>>> _______________________________________________ >>>>> Bioconductor mailing list >>>>> Bioconductor at r-project.org >>>>> https://stat.ethz.ch/mailman/listinfo/bioconductor >>>>> Search the archives: >>>>> http://news.gmane.org/gmane.science.biology.informatics.conductor >>>>> >>>> >>>> _______________________________________________ >>>> Bioconductor mailing list >>>> Bioconductor at r-project.org >>>> https://stat.ethz.ch/mailman/listinfo/bioconductor >>>> Search the archives: >>>> http://news.gmane.org/gmane.science.biology.informatics.conductor >>>> >>> >>> >>> -------------------------------------------------- >>> Davis J McCarthy >>> Research Technician >>> Bioinformatics Division >>> Walter and Eliza Hall Institute of Medical Research >>> 1G Royal Parade, Parkville, Vic 3052, Australia. >>> dmccarthy at wehi.edu.au >>> http://www.wehi.edu.au >> >> >> ______________________________________________________________________ >> The information in this email is confidential and intended solely for the >> addressee. >> You must not disclose, forward, print or use it without the permission of >> the sender. >> ______________________________________________________________________ >> > > ______________________________________________________________________ The information in this email is confidential and intend...{{dropped:6}} ADD REPLY 0 Entering edit mode Thanks. > Hi Yiwen: > > It is a single factor experiment with six libraries. There were four cell > types in this experiment, one of which had three replicates and others > did not have replicates. > > Cheers, > Wei > > On May 2, 2011, at 11:21 AM, ywchen at jimmy.harvard.edu wrote: > >> Hi Wei and Davis, >> >> Thank you so much for such detailed explanations! Now it is very clear. >> In your case you found the benefit of using quantile >> normalization+GLM+LRT, >> is it single factor with many libraries or multiple factor data? >> >> Yiwen >> >> >>> Hi Yiwen: >>> >>> As Davis said, the "length+quantile" method I mentioned in the >>> previous >>> correspondences is not the "quantile normalization" option in >>> calcNormFactors function in edgeR. That's the reason why you didn't see >>> gene length adjustment with that function. >>> >>> Adjusting read counts using gene length (total exon length) will put >>> all >>> genes on the same baseline within the sample (longer transcripts >>> produce >>> more reads), and quantile between-sample normalization will make all >>> samples have the same read count distribution (and library size will >>> become the same as well). This is what I mean by "length+quantile" >>> normalization. The quantile normalization here is the same quantile >>> normalization applied to microarray data, however it is applied to >>> sequencing data in a different way (used as offsets in the general >>> linear >>> model). >>> >>> Now I elaborate how to do this normalization. Suppose you have a read >>> count matrix of "x" with rows being genes and columns being samples . >>> Also suppose you have a numeric vector "gene.length" which includes >>> total >>> exon length for each gene and gene order in "gene.length" is the same >>> with that in "x". The following line of code yields the number of reads >>> per 1000 bases for each gene: >>> >>> x1 <- x*1000/gene.length >>> >>> Now perform quantile normalization for gene length adjusted data: >>> >>> library(limma) >>> x2 <- normalizeBetweenArrays(x1,method="quantile") >>> >>> Suppose x has two columns named "wt" and "ko". Create a design matrix: >>> >>> snames <- factor(c("wt","ko")) >>> design <- model.matrix(~snames) >>> >>> Now get the offsets for each gene in each sample. The offsets are the >>> intensity differences between raw data and normalized data. >>> >>> library(edgeR) >>> y <- DGEList(counts=x,group=colnames(x)) >>> lowcounts <- rowSums(x)<5 >>> offset <- log(x[!lowcounts,]+0.1)-log(x2[!lowcounts,]+0.1) >>> yf <- y[!lowcounts,] >>> >>> Fit general linear models to read count data with offsets included: >>> >>> y.glm <- estimateCRDisp(y=yf,design=design,offset=offset,trend=TRUE, >>> tagwise=TRUE) >>> fit <- >>> glmFit(y=y.glm,design=design,dispersion=y.glm$CR.tagwise.dispersio n,offset=offset) >>> >>> Perform likelihood ratio tests to find differentially expressed genes: >>> >>> DE <- glmLRT(y.glm,fit) >>> dt <- decideTestsDGE(DE) >>> summary(dt) >>> >>> Hope this will work for you! >>> >>> Cheers, >>> wei >>> >>> >>> >>> On May 2, 2011, at 8:52 AM, Davis McCarthy wrote: >>> >>>> Hi Yiwen >>>> >>>> The "quantile normalization" option in calcNormFactors in edgeR does >>>> something very different from the quantile normalization >>>> (microarray-style) that Wei has been discussing. >>>> >>>> The quantile normalization in calcNormFactors computes an offset for >>>> sequencing library depth after Bullard et al (2010) [1]. This is an >>>> approach in the same vein as TMM normalization [2] or scaled median >>>> [3]. >>>> >>>> I believe that the approach that Wei is suggesting is more similar to >>>> the quantile normalization approach that has been taken with >>>> microarray >>>> data, adjusting the data so that the response follows the same >>>> distribution across (in this context) sequenced libraries. This will >>>> typically result in non-integer data from adjusting counts, but >>>> count-based methods could still be used if this quantile normalization >>>> were treated as an offset for each observation in (e.g.) a generalized >>>> linear model. >>>> >>>> Cheers >>>> Davis >>>> >>>> >>>> [1] http://www.biomedcentral.com/1471-2105/11/94 >>>> [2] http://www.ncbi.nlm.nih.gov/pmc/articles/PMC2864565/?tool=pubmed >>>> [3] http://genomebiology.com/2010/11/10/R106#B13 >>>> >>>> >>>>> Hi Wei, >>>>> >>>>> Could you elaborate on how to appropriately do gene-length- adjusted >>>>> quantile normalization in edgeR? The "quantile normalization" option >>>>> in >>>>> calcNormFactors function does not seem to take into account the gene >>>>> length. >>>>> >>>>> Thanks. >>>>> Yiwen >>>>>> Hi Jo?o: >>>>>> >>>>>> Maybe you can try different normalization methods for your data to >>>>>> see >>>>>> which one looks better. How to best normalize RNA-seq data is still >>>>>> of >>>>>> much debate at this stage. >>>>>> >>>>>> You can try scaling methods like TMM, RPKM, or 75th percentile, >>>>>> which >>>>>> as >>>>>> you said normalize data within samples. Or you can try quantile >>>>>> between-sample normalization (read counts should be adjusted by gene >>>>>> length first), which performs normalization across samples. You can >>>>>> try >>>>>> all these in edgeR package. >>>>>> >>>>>> From my experience, I actually found the quantile method performed >>>>>> better >>>>>> for my RNA-seq data. I used general linear model and likelihood >>>>>> ratio >>>>>> test in edgeR in my analysis. >>>>>> >>>>>> Hope this helps. >>>>>> >>>>>> Cheers, >>>>>> Wei >>>>>> >>>>>> On Apr 28, 2011, at 7:36 PM, Jo?o Moura wrote: >>>>>> >>>>>>> Dear all, >>>>>>> >>>>>>> >>>>>>> Until now I was doing RNAseq DE analysis and to do that I >>>>>>> understand >>>>>>> that >>>>>>> normalization issues only matter inside samples, because one can >>>>>>> assume >>>>>>> the >>>>>>> length/content biases will cancel out when comparing same genes in >>>>>>> different >>>>>>> samples. >>>>>>> Although, I'm now trying to compare correlation of different genes >>>>>>> and >>>>>>> so, >>>>>>> this biases should be taken into account - for this is there any >>>>>>> better >>>>>>> method than RPKM? >>>>>>> >>>>>>> My main doubt is if I should also take into acount the biases >>>>>>> inside >>>>>>> samples >>>>>>> and to do that is there any better approach then TMM by Robinson >>>>>>> and >>>>>>> Oshlack >>>>>>> [2010]? >>>>>>> >>>>>>> Thank you all, >>>>>>> -- >>>>>>> Jo?o Moura >>>>>>> >>>>>>> [[alternative HTML version deleted]] >>>>>>> >>>>>>> _______________________________________________ >>>>>>> Bioconductor mailing list >>>>>>> Bioconductor at r-project.org >>>>>>> https://stat.ethz.ch/mailman/listinfo/bioconductor >>>>>>> Search the archives: >>>>>>> http://news.gmane.org/gmane.science.biology.informatics.conductor >>>>>> >>>>>> >>>>>> ______________________________________________________________________ >>>>>> The information in this email is confidential and >>>>>> intend...{{dropped:6}} >>>>>> >>>>>> _______________________________________________ >>>>>> Bioconductor mailing list >>>>>> Bioconductor at r-project.org >>>>>> https://stat.ethz.ch/mailman/listinfo/bioconductor >>>>>> Search the archives: >>>>>> http://news.gmane.org/gmane.science.biology.informatics.conductor >>>>>> >>>>> >>>>> _______________________________________________ >>>>> Bioconductor mailing list >>>>> Bioconductor at r-project.org >>>>> https://stat.ethz.ch/mailman/listinfo/bioconductor >>>>> Search the archives: >>>>> http://news.gmane.org/gmane.science.biology.informatics.conductor >>>>> >>>> >>>> >>>> -------------------------------------------------- >>>> Davis J McCarthy >>>> Research Technician >>>> Bioinformatics Division >>>> Walter and Eliza Hall Institute of Medical Research >>>> 1G Royal Parade, Parkville, Vic 3052, Australia. >>>> dmccarthy at wehi.edu.au >>>> http://www.wehi.edu.au >>> >>> >>> ______________________________________________________________________ >>> The information in this email is confidential and intended solely for >>> the >>> addressee. >>> You must not disclose, forward, print or use it without the permission >>> of >>> the sender. >>> ______________________________________________________________________ >>> >> >> > > > ______________________________________________________________________ > The information in this email is confidential and inte...{{dropped:8}}
0
Entering edit mode
@alicia-oshlack-2241
Last seen 7.1 years ago
> Send Bioconductor mailing list submissions to > bioconductor at r-project.org > > To subscribe or unsubscribe via the World Wide Web, visit > https://stat.ethz.ch/mailman/listinfo/bioconductor > or, via email, send a message with subject or body 'help' to > bioconductor-request at r-project.org > > You can reach the person managing the list at > bioconductor-owner at r-project.org > > When replying, please edit your Subject line so it is more specific > than "Re: Contents of Bioconductor digest..." > > Hi, Just to get back to the original question I tend to agree with Wolfgang. If you are looking for correlations between genes the correlations will be length biased with longer gene pairs getting higher correlation values. This is a different test to differential expression. I believe we can correct the correlation estimation itself rather than correcting expression values using something like RPKM. I believe that using RPKM will not remove length bias in correlations between genes like it does not remove length bias in DE testing. We are currently working on a way to correct correlations. Cheers, Alicia > Date: Sun, 1 May 2011 21:34:43 -0400 (EDT) > From: ywchen at jimmy.harvard.edu > To: "Wei Shi" <shi at="" wehi.edu.au=""> > Cc: "bioconductor at r-project.org list" <bioconductor at="" r-project.org=""> > Subject: Re: [BioC] RNASeq: normalization issues > Message-ID: > <51046.155.52.45.41.1304300083.squirrel at roaming.dfci.harvard.edu> > Content-Type: text/plain;charset=iso-8859-1 > > Thanks. >> Hi Yiwen: >> >> It is a single factor experiment with six libraries. There were four >> cell >> types in this experiment, one of which had three replicates and others >> did not have replicates. >> >> Cheers, >> Wei >> >> On May 2, 2011, at 11:21 AM, ywchen at jimmy.harvard.edu wrote: >> >>> Hi Wei and Davis, >>> >>> Thank you so much for such detailed explanations! Now it is very >>> clear. >>> In your case you found the benefit of using quantile >>> normalization+GLM+LRT, >>> is it single factor with many libraries or multiple factor data? >>> >>> Yiwen >>> >>> >>>> Hi Yiwen: >>>> >>>> As Davis said, the "length+quantile" method I mentioned in the >>>> previous >>>> correspondences is not the "quantile normalization" option in >>>> calcNormFactors function in edgeR. That's the reason why you didn't >>>> see >>>> gene length adjustment with that function. >>>> >>>> Adjusting read counts using gene length (total exon length) will >>>> put >>>> all >>>> genes on the same baseline within the sample (longer transcripts >>>> produce >>>> more reads), and quantile between-sample normalization will make all >>>> samples have the same read count distribution (and library size will >>>> become the same as well). This is what I mean by "length+quantile" >>>> normalization. The quantile normalization here is the same quantile >>>> normalization applied to microarray data, however it is applied to >>>> sequencing data in a different way (used as offsets in the general >>>> linear >>>> model). >>>> >>>> Now I elaborate how to do this normalization. Suppose you have a >>>> read >>>> count matrix of "x" with rows being genes and columns being samples >>>> . >>>> Also suppose you have a numeric vector "gene.length" which includes >>>> total >>>> exon length for each gene and gene order in "gene.length" is the >>>> same >>>> with that in "x". The following line of code yields the number of >>>> reads >>>> per 1000 bases for each gene: >>>> >>>> x1 <- x*1000/gene.length >>>> >>>> Now perform quantile normalization for gene length adjusted data: >>>> >>>> library(limma) >>>> x2 <- normalizeBetweenArrays(x1,method="quantile") >>>> >>>> Suppose x has two columns named "wt" and "ko". Create a design >>>> matrix: >>>> >>>> snames <- factor(c("wt","ko")) >>>> design <- model.matrix(~snames) >>>> >>>> Now get the offsets for each gene in each sample. The offsets are >>>> the >>>> intensity differences between raw data and normalized data. >>>> >>>> library(edgeR) >>>> y <- DGEList(counts=x,group=colnames(x)) >>>> lowcounts <- rowSums(x)<5 >>>> offset <- log(x[!lowcounts,]+0.1)-log(x2[!lowcounts,]+0.1) >>>> yf <- y[!lowcounts,] >>>> >>>> Fit general linear models to read count data with offsets included: >>>> >>>> y.glm <- estimateCRDisp(y=yf,design=design,offset=offset,trend=TRUE, >>>> tagwise=TRUE) >>>> fit <- >>>> glmFit(y=y.glm,design=design,dispersion=y.glm$CR.tagwise.dispersi on,offset=offset) >>>> >>>> Perform likelihood ratio tests to find differentially expressed >>>> genes: >>>> >>>> DE <- glmLRT(y.glm,fit) >>>> dt <- decideTestsDGE(DE) >>>> summary(dt) >>>> >>>> Hope this will work for you! >>>> >>>> Cheers, >>>> wei >>>> >>>> >>>> >>>> On May 2, 2011, at 8:52 AM, Davis McCarthy wrote: >>>> >>>>> Hi Yiwen >>>>> >>>>> The "quantile normalization" option in calcNormFactors in edgeR >>>>> does >>>>> something very different from the quantile normalization >>>>> (microarray-style) that Wei has been discussing. >>>>> >>>>> The quantile normalization in calcNormFactors computes an offset >>>>> for >>>>> sequencing library depth after Bullard et al (2010) [1]. This is an >>>>> approach in the same vein as TMM normalization [2] or scaled median >>>>> [3]. >>>>> >>>>> I believe that the approach that Wei is suggesting is more similar >>>>> to >>>>> the quantile normalization approach that has been taken with >>>>> microarray >>>>> data, adjusting the data so that the response follows the same >>>>> distribution across (in this context) sequenced libraries. This >>>>> will >>>>> typically result in non-integer data from adjusting counts, but >>>>> count-based methods could still be used if this quantile >>>>> normalization >>>>> were treated as an offset for each observation in (e.g.) a >>>>> generalized >>>>> linear model. >>>>> >>>>> Cheers >>>>> Davis >>>>> >>>>> >>>>> [1] http://www.biomedcentral.com/1471-2105/11/94 >>>>> [2] >>>>> http://www.ncbi.nlm.nih.gov/pmc/articles/PMC2864565/?tool=pubmed >>>>> [3] http://genomebiology.com/2010/11/10/R106#B13 >>>>> >>>>> >>>>>> Hi Wei, >>>>>> >>>>>> Could you elaborate on how to appropriately do >>>>>> gene-length-adjusted >>>>>> quantile normalization in edgeR? The "quantile normalization" >>>>>> option >>>>>> in >>>>>> calcNormFactors function does not seem to take into account the >>>>>> gene >>>>>> length. >>>>>> >>>>>> Thanks. >>>>>> Yiwen >>>>>>> Hi Jo?o: >>>>>>> >>>>>>> Maybe you can try different normalization methods for your data >>>>>>> to >>>>>>> see >>>>>>> which one looks better. How to best normalize RNA-seq data is >>>>>>> still >>>>>>> of >>>>>>> much debate at this stage. >>>>>>> >>>>>>> You can try scaling methods like TMM, RPKM, or 75th percentile, >>>>>>> which >>>>>>> as >>>>>>> you said normalize data within samples. Or you can try quantile >>>>>>> between-sample normalization (read counts should be adjusted by >>>>>>> gene >>>>>>> length first), which performs normalization across samples. You >>>>>>> can >>>>>>> try >>>>>>> all these in edgeR package. >>>>>>> >>>>>>> From my experience, I actually found the quantile method >>>>>>> performed >>>>>>> better >>>>>>> for my RNA-seq data. I used general linear model and likelihood >>>>>>> ratio >>>>>>> test in edgeR in my analysis. >>>>>>> >>>>>>> Hope this helps. >>>>>>> >>>>>>> Cheers, >>>>>>> Wei >>>>>>> >>>>>>> On Apr 28, 2011, at 7:36 PM, Jo?o Moura wrote: >>>>>>> >>>>>>>> Dear all, >>>>>>>> >>>>>>>> >>>>>>>> Until now I was doing RNAseq DE analysis and to do that I >>>>>>>> understand >>>>>>>> that >>>>>>>> normalization issues only matter inside samples, because one can >>>>>>>> assume >>>>>>>> the >>>>>>>> length/content biases will cancel out when comparing same genes >>>>>>>> in >>>>>>>> different >>>>>>>> samples. >>>>>>>> Although, I'm now trying to compare correlation of different >>>>>>>> genes >>>>>>>> and >>>>>>>> so, >>>>>>>> this biases should be taken into account - for this is there any >>>>>>>> better >>>>>>>> method than RPKM? >>>>>>>> >>>>>>>> My main doubt is if I should also take into acount the biases >>>>>>>> inside >>>>>>>> samples >>>>>>>> and to do that is there any better approach then TMM by Robinson >>>>>>>> and >>>>>>>> Oshlack >>>>>>>> [2010]? >>>>>>>> >>>>>>>> Thank you all, >>>>>>>> -- >>>>>>>> Jo?o Moura >>>>>>>> >>>>>>>> [[alternative HTML version deleted]] >>>>>>>> >>>>>>>> _______________________________________________ >>>>>>>> Bioconductor mailing list >>>>>>>> Bioconductor at r-project.org >>>>>>>> https://stat.ethz.ch/mailman/listinfo/bioconductor >>>>>>>> Search the archives: >>>>>>>> http://news.gmane.org/gmane.science.biology.informatics.conductor >>>>>>> >>>>>>> >>>>>>> ______________________________________________________________________ >>>>>>> The information in this email is confidential and >>>>>>> intend...{{dropped:6}} >>>>>>> >>>>>>> _______________________________________________ >>>>>>> Bioconductor mailing list >>>>>>> Bioconductor at r-project.org >>>>>>> https://stat.ethz.ch/mailman/listinfo/bioconductor >>>>>>> Search the archives: >>>>>>> http://news.gmane.org/gmane.science.biology.informatics.conductor >>>>>>> >>>>>> >>>>>> _______________________________________________ >>>>>> Bioconductor mailing list >>>>>> Bioconductor at r-project.org >>>>>> https://stat.ethz.ch/mailman/listinfo/bioconductor >>>>>> Search the archives: >>>>>> http://news.gmane.org/gmane.science.biology.informatics.conductor >>>>>> >>>>> >>>>> >>>>> -------------------------------------------------- >>>>> Davis J McCarthy >>>>> Research Technician >>>>> Bioinformatics Division >>>>> Walter and Eliza Hall Institute of Medical Research >>>>> 1G Royal Parade, Parkville, Vic 3052, Australia. >>>>> dmccarthy at wehi.edu.au >>>>> http://www.wehi.edu.au >>>> >>>> >>>> ______________________________________________________________________ >>>> The information in this email is confidential and intended solely >>>> for >>>> the >>>> addressee. >>>> You must not disclose, forward, print or use it without the >>>> permission >>>> of >>>> the sender. >>>> ______________________________________________________________________ >>>> >>> >>> >> >> >> ______________________________________________________________________ >> The information in this email is confidential and inte...{{dropped:8}} > > > > ------------------------------ > > _______________________________________________ > Bioconductor mailing list > Bioconductor at r-project.org > https://stat.ethz.ch/mailman/listinfo/bioconductor > > > End of Bioconductor Digest, Vol 99, Issue 2 > ******************************************* > ______________________________________________________________________ The information in this email is confidential and intend...{{dropped:4}} ADD COMMENT 0 Entering edit mode Thank you all for your opinions. Alicia, can you give me some tips on how are you thinking of doing that? Best regards, On Mon, May 2, 2011 at 12:36 PM, Alicia Oshlack <oshlack@wehi.edu.au> wrote: > > Send Bioconductor mailing list submissions to > > bioconductor@r-project.org > > > > To subscribe or unsubscribe via the World Wide Web, visit > > https://stat.ethz.ch/mailman/listinfo/bioconductor > > or, via email, send a message with subject or body 'help' to > > bioconductor-request@r-project.org > > > > You can reach the person managing the list at > > bioconductor-owner@r-project.org > > > > When replying, please edit your Subject line so it is more specific > > than "Re: Contents of Bioconductor digest..." > > > > > Hi, > > Just to get back to the original question I tend to agree with Wolfgang. > If you are looking for correlations between genes the correlations will > be length biased with longer gene pairs getting higher correlation > values. This is a different test to differential expression. I believe > we can correct the correlation estimation itself rather than correcting > expression values using something like RPKM. I believe that using RPKM > will not remove length bias in correlations between genes like it does > not remove length bias in DE testing. We are currently working on a way > to correct correlations. > > Cheers, > Alicia > > > > Date: Sun, 1 May 2011 21:34:43 -0400 (EDT) > > From: ywchen@jimmy.harvard.edu > > To: "Wei Shi" <shi@wehi.edu.au> > > Cc: "bioconductor@r-project.org list" <bioconductor@r-project.org> > > Subject: Re: [BioC] RNASeq: normalization issues > > Message-ID: > > <51046.155.52.45.41.1304300083.squirrel@roaming.dfci.harvard.edu> > > Content-Type: text/plain;charset=iso-8859-1 > > > > Thanks. > >> Hi Yiwen: > >> > >> It is a single factor experiment with six libraries. There were > four > >> cell > >> types in this experiment, one of which had three replicates and others > >> did not have replicates. > >> > >> Cheers, > >> Wei > >> > >> On May 2, 2011, at 11:21 AM, ywchen@jimmy.harvard.edu wrote: > >> > >>> Hi Wei and Davis, > >>> > >>> Thank you so much for such detailed explanations! Now it is very > >>> clear. > >>> In your case you found the benefit of using quantile > >>> normalization+GLM+LRT, > >>> is it single factor with many libraries or multiple factor data? > >>> > >>> Yiwen > >>> > >>> > >>>> Hi Yiwen: > >>>> > >>>> As Davis said, the "length+quantile" method I mentioned in the > >>>> previous > >>>> correspondences is not the "quantile normalization" option in > >>>> calcNormFactors function in edgeR. That's the reason why you didn't > >>>> see > >>>> gene length adjustment with that function. > >>>> > >>>> Adjusting read counts using gene length (total exon length) will > >>>> put > >>>> all > >>>> genes on the same baseline within the sample (longer transcripts > >>>> produce > >>>> more reads), and quantile between-sample normalization will make all > >>>> samples have the same read count distribution (and library size will > >>>> become the same as well). This is what I mean by "length+quantile" > >>>> normalization. The quantile normalization here is the same quantile > >>>> normalization applied to microarray data, however it is applied to > >>>> sequencing data in a different way (used as offsets in the general > >>>> linear > >>>> model). > >>>> > >>>> Now I elaborate how to do this normalization. Suppose you have a > >>>> read > >>>> count matrix of "x" with rows being genes and columns being samples > >>>> . > >>>> Also suppose you have a numeric vector "gene.length" which includes > >>>> total > >>>> exon length for each gene and gene order in "gene.length" is the > >>>> same > >>>> with that in "x". The following line of code yields the number of > >>>> reads > >>>> per 1000 bases for each gene: > >>>> > >>>> x1 <- x*1000/gene.length > >>>> > >>>> Now perform quantile normalization for gene length adjusted data: > >>>> > >>>> library(limma) > >>>> x2 <- normalizeBetweenArrays(x1,method="quantile") > >>>> > >>>> Suppose x has two columns named "wt" and "ko". Create a design > >>>> matrix: > >>>> > >>>> snames <- factor(c("wt","ko")) > >>>> design <- model.matrix(~snames) > >>>> > >>>> Now get the offsets for each gene in each sample. The offsets are > >>>> the > >>>> intensity differences between raw data and normalized data. > >>>> > >>>> library(edgeR) > >>>> y <- DGEList(counts=x,group=colnames(x)) > >>>> lowcounts <- rowSums(x)<5 > >>>> offset <- log(x[!lowcounts,]+0.1)-log(x2[!lowcounts,]+0.1) > >>>> yf <- y[!lowcounts,] > >>>> > >>>> Fit general linear models to read count data with offsets included: > >>>> > >>>> y.glm <- estimateCRDisp(y=yf,design=design,offset=offset,trend=TRUE, > >>>> tagwise=TRUE) > >>>> fit <- > >>>> > glmFit(y=y.glm,design=design,dispersion=y.glm$CR.tagwise.dispersion, offset=offset) > >>>> > >>>> Perform likelihood ratio tests to find differentially expressed > >>>> genes: > >>>> > >>>> DE <- glmLRT(y.glm,fit) > >>>> dt <- decideTestsDGE(DE) > >>>> summary(dt) > >>>> > >>>> Hope this will work for you! > >>>> > >>>> Cheers, > >>>> wei > >>>> > >>>> > >>>> > >>>> On May 2, 2011, at 8:52 AM, Davis McCarthy wrote: > >>>> > >>>>> Hi Yiwen > >>>>> > >>>>> The "quantile normalization" option in calcNormFactors in edgeR > >>>>> does > >>>>> something very different from the quantile normalization > >>>>> (microarray-style) that Wei has been discussing. > >>>>> > >>>>> The quantile normalization in calcNormFactors computes an offset > >>>>> for > >>>>> sequencing library depth after Bullard et al (2010) [1]. This is an > >>>>> approach in the same vein as TMM normalization [2] or scaled median > >>>>> [3]. > >>>>> > >>>>> I believe that the approach that Wei is suggesting is more similar > >>>>> to > >>>>> the quantile normalization approach that has been taken with > >>>>> microarray > >>>>> data, adjusting the data so that the response follows the same > >>>>> distribution across (in this context) sequenced libraries. This > >>>>> will > >>>>> typically result in non-integer data from adjusting counts, but > >>>>> count-based methods could still be used if this quantile > >>>>> normalization > >>>>> were treated as an offset for each observation in (e.g.) a > >>>>> generalized > >>>>> linear model. > >>>>> > >>>>> Cheers > >>>>> Davis > >>>>> > >>>>> > >>>>> [1] http://www.biomedcentral.com/1471-2105/11/94 > >>>>> [2] > >>>>> http://www.ncbi.nlm.nih.gov/pmc/articles/PMC2864565/?tool=pubmed > >>>>> [3] http://genomebiology.com/2010/11/10/R106#B13 > >>>>> > >>>>> > >>>>>> Hi Wei, > >>>>>> > >>>>>> Could you elaborate on how to appropriately do > >>>>>> gene-length-adjusted > >>>>>> quantile normalization in edgeR? The "quantile normalization" > >>>>>> option > >>>>>> in > >>>>>> calcNormFactors function does not seem to take into account the > >>>>>> gene > >>>>>> length. > >>>>>> > >>>>>> Thanks. > >>>>>> Yiwen > >>>>>>> Hi Jo?o: > >>>>>>> > >>>>>>> Maybe you can try different normalization methods for your > data > >>>>>>> to > >>>>>>> see > >>>>>>> which one looks better. How to best normalize RNA-seq data is > >>>>>>> still > >>>>>>> of > >>>>>>> much debate at this stage. > >>>>>>> > >>>>>>> You can try scaling methods like TMM, RPKM, or 75th > percentile, > >>>>>>> which > >>>>>>> as > >>>>>>> you said normalize data within samples. Or you can try quantile > >>>>>>> between-sample normalization (read counts should be adjusted by > >>>>>>> gene > >>>>>>> length first), which performs normalization across samples. You > >>>>>>> can > >>>>>>> try > >>>>>>> all these in edgeR package. > >>>>>>> > >>>>>>> From my experience, I actually found the quantile method > >>>>>>> performed > >>>>>>> better > >>>>>>> for my RNA-seq data. I used general linear model and likelihood > >>>>>>> ratio > >>>>>>> test in edgeR in my analysis. > >>>>>>> > >>>>>>> Hope this helps. > >>>>>>> > >>>>>>> Cheers, > >>>>>>> Wei > >>>>>>> > >>>>>>> On Apr 28, 2011, at 7:36 PM, Jo?o Moura wrote: > >>>>>>> > >>>>>>>> Dear all, > >>>>>>>> > >>>>>>>> > >>>>>>>> Until now I was doing RNAseq DE analysis and to do that I > >>>>>>>> understand > >>>>>>>> that > >>>>>>>> normalization issues only matter inside samples, because one can > >>>>>>>> assume > >>>>>>>> the > >>>>>>>> length/content biases will cancel out when comparing same genes > >>>>>>>> in > >>>>>>>> different > >>>>>>>> samples. > >>>>>>>> Although, I'm now trying to compare correlation of different > >>>>>>>> genes > >>>>>>>> and > >>>>>>>> so, > >>>>>>>> this biases should be taken into account - for this is there any > >>>>>>>> better > >>>>>>>> method than RPKM? > >>>>>>>> > >>>>>>>> My main doubt is if I should also take into acount the biases > >>>>>>>> inside > >>>>>>>> samples > >>>>>>>> and to do that is there any better approach then TMM by Robinson > >>>>>>>> and > >>>>>>>> Oshlack > >>>>>>>> [2010]? > >>>>>>>> > >>>>>>>> Thank you all, > >>>>>>>> -- > >>>>>>>> Jo?o Moura > >>>>>>>> > >>>>>>>> [[alternative HTML version deleted]] > >>>>>>>> > >>>>>>>> _______________________________________________ > >>>>>>>> Bioconductor mailing list > >>>>>>>> Bioconductor@r-project.org > >>>>>>>> https://stat.ethz.ch/mailman/listinfo/bioconductor > >>>>>>>> Search the archives: > >>>>>>>> http://news.gmane.org/gmane.science.biology.informatics.conductor > >>>>>>> > >>>>>>> > >>>>>>> > ______________________________________________________________________ > >>>>>>> The information in this email is confidential and > >>>>>>> intend...{{dropped:6}} > >>>>>>> > >>>>>>> _______________________________________________ > >>>>>>> Bioconductor mailing list > >>>>>>> Bioconductor@r-project.org > >>>>>>> https://stat.ethz.ch/mailman/listinfo/bioconductor > >>>>>>> Search the archives: > >>>>>>> http://news.gmane.org/gmane.science.biology.informatics.conductor > >>>>>>> > >>>>>> > >>>>>> _______________________________________________ > >>>>>> Bioconductor mailing list > >>>>>> Bioconductor@r-project.org > >>>>>> https://stat.ethz.ch/mailman/listinfo/bioconductor > >>>>>> Search the archives: > >>>>>> http://news.gmane.org/gmane.science.biology.informatics.conductor > >>>>>> > >>>>> > >>>>> > >>>>> -------------------------------------------------- > >>>>> Davis J McCarthy > >>>>> Research Technician > >>>>> Bioinformatics Division > >>>>> Walter and Eliza Hall Institute of Medical Research > >>>>> 1G Royal Parade, Parkville, Vic 3052, Australia. > >>>>> dmccarthy@wehi.edu.au > >>>>> http://www.wehi.edu.au > >>>> > >>>> > >>>> ______________________________________________________________________ > >>>> The information in this email is confidential and intended solely > >>>> for > >>>> the > >>>> addressee. > >>>> You must not disclose, forward, print or use it without the > >>>> permission > >>>> of > >>>> the sender. > >>>> ______________________________________________________________________ > >>>> > >>> > >>> > >> > >> > >> ______________________________________________________________________ > >> The information in this email is confidential and inte...{{dropped:8}} > > > > > > > > ------------------------------ > > > > _______________________________________________ > > Bioconductor mailing list > > Bioconductor@r-project.org > > https://stat.ethz.ch/mailman/listinfo/bioconductor > > > > > > End of Bioconductor Digest, Vol 99, Issue 2 > > ******************************************* > > > > > > ______________________________________________________________________ > The information in this email is confidential and inte...{{dropped:18}}
0
Entering edit mode