edgeR normalization factors
3
0
Entering edit mode
王喆 ▴ 60
@-4142
Last seen 10.2 years ago
Hello,  I have a question about using TMM normalization factors. I want to modify the count for each gene after normalization. Should I just need to divide the count of each gene by the normalization factor for its library? Then, I may use the normalized data for DE analysis and other further analysis (e.g. clustering). Thanks a lot, Zhe [[alternative HTML version deleted]]
Normalization Normalization • 3.2k views
ADD COMMENT
0
Entering edit mode
Naomi Altman ★ 6.0k
@naomi-altman-380
Last seen 3.6 years ago
United States
Multiply. And yes, you should use the normalized data for DE and clustering. Otherwise, highly expressing genes in your sample will depress the expression of other genes relative to the size of the library, inducing spurious "differential" expression. I have been simulating data to try to understand this better. --Naomi At 11:19 PM 6/27/2010, ?????? wrote: >Hello, >? >I have a question about using TMM normalization >factors.? I want to modify the count for each >gene after normalization. Should I just need to >divide the count of each gene by the >normalization factor for its library? Then, I >may use the normalized data for DE >analysis and other further analysis (e.g. clustering). > >Thanks a lot, >Zhe > > > > > [[alternative HTML version deleted]] > >_______________________________________________ >Bioconductor mailing list >Bioconductor at stat.math.ethz.ch >https://stat.ethz.ch/mailman/listinfo/bioconductor >Search the archives: >http://news.gmane.org/gmane.science.biology.informatics.conductor Naomi S. Altman 814-865-3791 (voice) Associate Professor Dept. of Statistics 814-863-7114 (fax) Penn State University 814-865-1348 (Statistics) University Park, PA 16802-2111
ADD COMMENT
0
Entering edit mode
(Travelling so this is a rather quick response) I disagree with Naomi. First, for a differential expression analysis, we prefer to use the counts as is, and use the normalization factors as offsets in the statistical modeling. So, these normalization factors actually DO NOT change the data (this is unlike microarray data normalization). Second, for clustering, visualization etc. you may want to calculate a normalized expression value. Using the normalization factors that you calculate using calcNormFactors() multiplied by the library size (See Section 6 of the manual), you could DIVIDE your raw counts by this number for each library. Maybe also multiple by 10M so you have counts per 10M? I think what Naomi is talking about (highly expressed genes depressing the expression of other genes) is covered in our paper: http://genomebiology.com/2010/11/3/R25 Cheers, Mark > Multiply. > > And yes, you should use the normalized data for > DE and clustering. Otherwise, highly expressing > genes in your sample will depress the expression > of other genes relative to the size of the > library, inducing spurious "differential" > expression. I have been simulating data to try to understand this better. > > --Naomi > > At 11:19 PM 6/27/2010, ?????? wrote: >>Hello, >>? >>I have a question about using TMM normalization >>factors.? I want to modify the count for each >>gene after normalization. Should I just need to >>divide the count of each gene by the >>normalization factor for its library? Then, I >>may use the normalized data for DE >>analysis and other further analysis (e.g. clustering). >> >>Thanks a lot, >>Zhe >> >> >> >> >> [[alternative HTML version deleted]] >> >>_______________________________________________ >>Bioconductor mailing list >>Bioconductor at stat.math.ethz.ch >>https://stat.ethz.ch/mailman/listinfo/bioconductor >>Search the archives: >>http://news.gmane.org/gmane.science.biology.informatics.conductor > > Naomi S. Altman 814-865-3791 (voice) > Associate Professor > Dept. of Statistics 814-863-7114 (fax) > Penn State University 814-865-1348 (Statistics) > University Park, PA 16802-2111 > > _______________________________________________ > Bioconductor mailing list > Bioconductor at stat.math.ethz.ch > https://stat.ethz.ch/mailman/listinfo/bioconductor > Search the archives: > http://news.gmane.org/gmane.science.biology.informatics.conductor > ______________________________________________________________________ The information in this email is confidential and intend...{{dropped:4}}
ADD REPLY
0
Entering edit mode
Zhe, for clustering and similar endeavours, transforming the data to a "logarithm-like" variance-stabilised scale is useful. See e.g. chapter 7 "Sample clustering" of the vignette of the DESeq package. For differential expression, I agree with Mark that you want to use the counts as is, and use the normalization factors as parameters in the statistical modeling. Wolfgang On Jun/29/10 10:21 AM, Mark Robinson wrote: > > (Travelling so this is a rather quick response) > > I disagree with Naomi. > > First, for a differential expression analysis, we prefer to use the counts > as is, and use the normalization factors as offsets in the statistical > modeling. So, these normalization factors actually DO NOT change the data > (this is unlike microarray data normalization). > > Second, for clustering, visualization etc. you may want to calculate a > normalized expression value. Using the normalization factors that you > calculate using calcNormFactors() multiplied by the library size (See > Section 6 of the manual), you could DIVIDE your raw counts by this number > for each library. Maybe also multiple by 10M so you have counts per 10M? > > I think what Naomi is talking about (highly expressed genes depressing the > expression of other genes) is covered in our paper: > http://genomebiology.com/2010/11/3/R25 > > Cheers, > Mark > >> Multiply. >> >> And yes, you should use the normalized data for >> DE and clustering. Otherwise, highly expressing >> genes in your sample will depress the expression >> of other genes relative to the size of the >> library, inducing spurious "differential" >> expression. I have been simulating data to try to understand this better. >> >> --Naomi >> >> At 11:19 PM 6/27/2010, ?????? wrote: >>> Hello, >>> ? >>> I have a question about using TMM normalization >>> factors.? I want to modify the count for each >>> gene after normalization. Should I just need to >>> divide the count of each gene by the >>> normalization factor for its library? Then, I >>> may use the normalized data for DE >>> analysis and other further analysis (e.g. clustering). >>> >>> Thanks a lot, >>> Zhe >>> >>> >>> >>> >>> [[alternative HTML version deleted]] >>> >>> _______________________________________________ >>> Bioconductor mailing list >>> Bioconductor at stat.math.ethz.ch >>> https://stat.ethz.ch/mailman/listinfo/bioconductor >>> Search the archives: >>> http://news.gmane.org/gmane.science.biology.informatics.conductor >> >> Naomi S. Altman 814-865-3791 (voice) >> Associate Professor >> Dept. of Statistics 814-863-7114 (fax) >> Penn State University 814-865-1348 (Statistics) >> University Park, PA 16802-2111 >> >> _______________________________________________ >> Bioconductor mailing list >> Bioconductor at stat.math.ethz.ch >> https://stat.ethz.ch/mailman/listinfo/bioconductor >> Search the archives: >> http://news.gmane.org/gmane.science.biology.informatics.conductor >> > > > > ______________________________________________________________________ > The information in this email is confidential and inte...{{dropped:16}}
ADD REPLY
0
Entering edit mode
Thanks Mark and have a good trip. Zhe --- 10年6月29日,周二, Mark Robinson <mrobinson@wehi.edu.au> 写道: (Travelling so this is a rather quick response) I disagree with Naomi. First, for a differential expression analysis, we prefer to use the counts as is, and use the normalization factors as offsets in the statistical modeling.  So, these normalization factors actually DO NOT change the data (this is unlike microarray data normalization). Second, for clustering, visualization etc. you may want to calculate a normalized expression value.  Using the normalization factors that you calculate using calcNormFactors() multiplied by the library size (See Section 6 of the manual), you could DIVIDE your raw counts by this number for each library.  Maybe also multiple by 10M so you have counts per 10M? I think what Naomi is talking about (highly expressed genes depressing the expression of other genes) is covered in our paper: http://genomebiology.com/2010/11/3/R25 Cheers, Mark > Multiply. > > And yes, you should use the normalized data for > DE and clustering.  Otherwise, highly expressing > genes in your sample will depress the expression > of other genes relative to the size of the > library, inducing spurious "differential" > expression.  I have been simulating data to try to understand this better. > > --Naomi > > At 11:19 PM 6/27/2010, 王孆 wrote: >>Hello, >> >>I have a question about using TMM normalization >>factors. I want to modify the count for each >>gene after normalization. Should I just need to >>divide the count of each gene by the >>normalization factor for its library? Then, I >>may use the normalized data for DE >>analysis and other further analysis (e.g. clustering). >> >>Thanks a lot, >>Zhe >> >> >> >> >>         [[alternative HTML version deleted]] >> >>_______________________________________________ >>Bioconductor mailing list >>Bioconductor@stat.math.ethz.ch >>https://stat.ethz.ch/mailman/listinfo/bioconductor >>Search the archives: >>http://news.gmane.org/gmane.science.biology.informatics.conductor > > Naomi S. Altman                               814-865-3791 (voice) > Associate Professor > Dept. of Statistics                             814-863-7114 (fax) > Penn State University                         814-865-1348 (Statistics) > University Park, PA 16802-2111 > > _______________________________________________ > Bioconductor mailing list > Bioconductor@stat.math.ethz.ch > https://stat.ethz.ch/mailman/listinfo/bioconductor > Search the archives: > http://news.gmane.org/gmane.science.biology.informatics.conductor > ______________________________________________________________________ The information in this email is confidential and intend...{{dropped:12}}
ADD REPLY
0
Entering edit mode
Hi On Tue, 29 Jun 2010 21:53:18 +0800 (CST), ?? <zhedianyou at="" yahoo.cn=""> wrote: > I disagree with Naomi. > > First, for a differential expression analysis, we prefer to use the counts > as is, and use the normalization factors as offsets in the statistical > modeling.?? So, these normalization factors actually DO NOT change the > data > (this is unlike microarray data normalization). > > Second, for clustering, visualization etc. you may want to calculate a > normalized expression value.?? Using the normalization factors that you > calculate using calcNormFactors() multiplied by the library size (See > Section 6 of the manual), you could DIVIDE your raw counts by this number > for each library.?? Maybe also multiple by 10M so you have counts per 10M? > > I think what Naomi is talking about (highly expressed genes depressing the > expression of other genes) is covered in our paper: > http://genomebiology.com/2010/11/3/R25 For visualization, the normalized values should to the job. For clustering, however, you may still run into problem, because count data, normalized or not, is heteroskedastic, and if you feed such data to a typical distance function such as R's 'dist', the result will depends nearly only on the most strongly expressed genes as they have the strongest variance. Hence, you should perform a variance-stabilizing transformation (VST) on the data before handing it to dist (or to any other statistical function that is designed for homoskedastic data). Our 'DESeq' package (another tool for the same use case as edgeR, using a different way to estimate variance) has such a function ('getVarianceStabilizedData'), but it assumes that you use DESeq's variance estimation scheme and the vignette explains how to use it e.g. for clustering. If you prefer to stick to edgeR: To my knowledge, it does not have this functionality but you could add it yourself with a one-liner as follows: edgeR's variance-mean ratio is variance = mean + common_dispersion * mean^2 and from such a function, the is obtained by integrating variance^(-1/2) w.r.t. mean. According to Wolfram Alpha, this gives transformed_data = 2 * asinh( sqrt( common_dispersion * normalized_count ) ) / sqrt( common_dispersion ) but you may want to double-check this. Simon
ADD REPLY
0
Entering edit mode
Naomi Altman ★ 6.0k
@naomi-altman-380
Last seen 3.6 years ago
United States
Of course Mark is right. --Naomi At 04:21 AM 6/29/2010, Mark Robinson wrote: >(Travelling so this is a rather quick response) > >I disagree with Naomi. > >First, for a differential expression analysis, we prefer to use the counts >as is, and use the normalization factors as offsets in the statistical >modeling. So, these normalization factors actually DO NOT change the data >(this is unlike microarray data normalization). > >Second, for clustering, visualization etc. you may want to calculate a >normalized expression value. Using the normalization factors that you >calculate using calcNormFactors() multiplied by the library size (See >Section 6 of the manual), you could DIVIDE your raw counts by this number >for each library. Maybe also multiple by 10M so you have counts per 10M? > >I think what Naomi is talking about (highly expressed genes depressing the >expression of other genes) is covered in our paper: >http://genomebiology.com/2010/11/3/R25 > >Cheers, >Mark > > > Multiply. > > > > And yes, you should use the normalized data for > > DE and clustering. Otherwise, highly expressing > > genes in your sample will depress the expression > > of other genes relative to the size of the > > library, inducing spurious "differential" > > expression. I have been simulating data to try to understand this better. > > > > --Naomi > > > > At 11:19 PM 6/27/2010, ?????? wrote: > >>Hello, > >>? > >>I have a question about using TMM normalization > >>factors.? I want to modify the count for each > >>gene after normalization. Should I just need to > >>divide the count of each gene by the > >>normalization factor for its library? Then, I > >>may use the normalized data for DE > >>analysis and other further analysis (e.g. clustering). > >> > >>Thanks a lot, > >>Zhe > >> > >> > >> > >> > >> [[alternative HTML version deleted]] > >> > >>_______________________________________________ > >>Bioconductor mailing list > >>Bioconductor at stat.math.ethz.ch > >>https://stat.ethz.ch/mailman/listinfo/bioconductor > >>Search the archives: > >>http://news.gmane.org/gmane.science.biology.informatics.conductor > > > > Naomi S. Altman 814-865-3791 (voice) > > Associate Professor > > Dept. of Statistics 814-863-7114 (fax) > > Penn State University 814-865-1348 (Statistics) > > University Park, PA 16802-2111 > > > > _______________________________________________ > > Bioconductor mailing list > > Bioconductor at stat.math.ethz.ch > > https://stat.ethz.ch/mailman/listinfo/bioconductor > > Search the archives: > > http://news.gmane.org/gmane.science.biology.informatics.conductor > > > > > >_____________________________________________________________________ _ >The information in this email is confidential and intend...{{dropped:4}} > >_______________________________________________ >Bioconductor mailing list >Bioconductor at stat.math.ethz.ch >https://stat.ethz.ch/mailman/listinfo/bioconductor >Search the archives: >http://news.gmane.org/gmane.science.biology.informatics.conductor Naomi S. Altman 814-865-3791 (voice) Associate Professor Dept. of Statistics 814-863-7114 (fax) Penn State University 814-865-1348 (Statistics) University Park, PA 16802-2111
ADD COMMENT
0
Entering edit mode
Naomi Altman ★ 6.0k
@naomi-altman-380
Last seen 3.6 years ago
United States
Of course Mark is correct for DE analysis. What I should have said is that the normalized Library Size should be used for DE. And this is certainly covered in the paper. For clustering, I think you probably will need to change the data - but it depends on what you are clustering and the distance measure. --Naomi At 04:21 AM 6/29/2010, Mark Robinson wrote: >(Travelling so this is a rather quick response) > >I disagree with Naomi. > >First, for a differential expression analysis, we prefer to use the counts >as is, and use the normalization factors as offsets in the statistical >modeling. So, these normalization factors actually DO NOT change the data >(this is unlike microarray data normalization). > >Second, for clustering, visualization etc. you may want to calculate a >normalized expression value. Using the normalization factors that you >calculate using calcNormFactors() multiplied by the library size (See >Section 6 of the manual), you could DIVIDE your raw counts by this number >for each library. Maybe also multiple by 10M so you have counts per 10M? > >I think what Naomi is talking about (highly expressed genes depressing the >expression of other genes) is covered in our paper: >http://genomebiology.com/2010/11/3/R25 > >Cheers, >Mark > > > Multiply. > > > > And yes, you should use the normalized data for > > DE and clustering. Otherwise, highly expressing > > genes in your sample will depress the expression > > of other genes relative to the size of the > > library, inducing spurious "differential" > > expression. I have been simulating data to try to understand this better. > > > > --Naomi > > > > At 11:19 PM 6/27/2010, ?????? wrote: > >>Hello, > >>? > >>I have a question about using TMM normalization > >>factors.? I want to modify the count for each > >>gene after normalization. Should I just need to > >>divide the count of each gene by the > >>normalization factor for its library? Then, I > >>may use the normalized data for DE > >>analysis and other further analysis (e.g. clustering). > >> > >>Thanks a lot, > >>Zhe > >> > >> > >> > >> > >> [[alternative HTML version deleted]] > >> > >>_______________________________________________ > >>Bioconductor mailing list > >>Bioconductor at stat.math.ethz.ch > >>https://stat.ethz.ch/mailman/listinfo/bioconductor > >>Search the archives: > >>http://news.gmane.org/gmane.science.biology.informatics.conductor > > > > Naomi S. Altman 814-865-3791 (voice) > > Associate Professor > > Dept. of Statistics 814-863-7114 (fax) > > Penn State University 814-865-1348 (Statistics) > > University Park, PA 16802-2111 > > > > _______________________________________________ > > Bioconductor mailing list > > Bioconductor at stat.math.ethz.ch > > https://stat.ethz.ch/mailman/listinfo/bioconductor > > Search the archives: > > http://news.gmane.org/gmane.science.biology.informatics.conductor > > > > > >_____________________________________________________________________ _ >The information in this email is confidential >and intended solely for the addressee. >You must not disclose, forward, print or use it >without the permission of the sender. >_____________________________________________________________________ _ Naomi S. Altman 814-865-3791 (voice) Associate Professor Dept. of Statistics 814-863-7114 (fax) Penn State University 814-865-1348 (Statistics) University Park, PA 16802-2111
ADD COMMENT
0
Entering edit mode
Thank you for your suggestions. Zhe --- 10年6月29日,周二, Naomi Altman <naomi@stat.psu.edu> 写道: 发件人: Naomi Altman <naomi@stat.psu.edu> 主题: Re: [BioC] edgeR normalization factors 收件人: "Mark Robinson" <mrobinson@wehi.edu.au>, "Naomi Altman" <naomi@stat.psu.edu> 抄送: "王喆" <zhedianyou@yahoo.cn>, bioconductor@stat.math.ethz.ch 日期: 2010年6月29日,周二,下午11:20 Of course Mark is correct for DE analysis.  What I should have said is that the normalized Library Size should be used for DE.  And this is certainly covered in the paper. For clustering, I think you probably will need to change the data - but it depends on what you are clustering and the distance measure. --Naomi At 04:21 AM 6/29/2010, Mark Robinson wrote: >(Travelling so this is a rather quick response) > >I disagree with Naomi. > >First, for a differential expression analysis, we prefer to use the counts >as is, and use the normalization factors as offsets in the statistical >modeling.  So, these normalization factors actually DO NOT change the data >(this is unlike microarray data normalization). > >Second, for clustering, visualization etc. you may want to calculate a >normalized expression value.  Using the normalization factors that you >calculate using calcNormFactors() multiplied by the library size (See >Section 6 of the manual), you could DIVIDE your raw counts by this number >for each library.  Maybe also multiple by 10M so you have counts per 10M? > >I think what Naomi is talking about (highly expressed genes depressing the >expression of other genes) is covered in our paper: >http://genomebiology.com/2010/11/3/R25 > >Cheers, >Mark > > > Multiply. > > > > And yes, you should use the normalized data for > > DE and clustering.  Otherwise, highly expressing > > genes in your sample will depress the expression > > of other genes relative to the size of the > > library, inducing spurious "differential" > > expression.  I have been simulating data to try to understand this better. > > > > --Naomi > > > > At 11:19 PM 6/27/2010, 王孆 wrote: > >>Hello, > >> > >>I have a question about using TMM normalization > >>factors. I want to modify the count for each > >>gene after normalization. Should I just need to > >>divide the count of each gene by the > >>normalization factor for its library? Then, I > >>may use the normalized data for DE > >>analysis and other further analysis (e.g. clustering). > >> > >>Thanks a lot, > >>Zhe > >> > >> > >> > >> > >>         [[alternative HTML version deleted]] > >> > >>_______________________________________________ > >>Bioconductor mailing list > >>Bioconductor@stat.math.ethz.ch > >>https://stat.ethz.ch/mailman/listinfo/bioconductor > >>Search the archives: > >>http://news.gmane.org/gmane.science.biology.informatics.conductor > > > > Naomi S. Altman                               814-865-3791 (voice) > > Associate Professor > > Dept. of Statistics                             814-863-7114 (fax) > > Penn State University                         814-865-1348 (Statistics) > > University Park, PA 16802-2111 > > > > _______________________________________________ > > Bioconductor mailing list > > Bioconductor@stat.math.ethz.ch > > https://stat.ethz.ch/mailman/listinfo/bioconductor > > Search the archives: > > http://news.gmane.org/gmane.science.biology.informatics.conductor > > > > > >_____________________________________________________________________ _ >The information in this email is confidential \ >and ...{{dropped:22}}
ADD REPLY

Login before adding your answer.

Traffic: 529 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6