Single nucleotide based RNAseq normalization with edgeR

0

Entering edit mode

Gordon Smyth 53k

@gordon-smyth

Last seen 33 minutes ago

WEHI, Melbourne, Australia

Hi Jens, I don't know what you mean by single nucleotide based normalization, however the following comments may be helpful. edgeR automatically adjusts for library sizes, whether you include an explicit normalization step or not. Normalization is a separate issue, and is intended to deal with more subtle issues. Normalization, as edgeR does it, does not require replicates. Best wishes Gordon > Date: Fri, 04 Feb 2011 11:28:15 +0100 > From: Jens Georg <jens.georg at="" biologie.uni-freiburg.de=""> > To: bioconductor at r-project.org > Subject: [BioC] Single nucleotide based RNAseq normalization with > edgeR? > Message-ID: <4D4BD4BF.4010009 at biologie.uni-freiburg.de> > Content-Type: text/plain; charset=ISO-8859-15; format=flowed > > > > Dear edgeR users and developers, > > we used Solexa sequencing in order to detect RNase E processing sites. > Therefor we splitted a RNA sample and treated one half with RNase E > prior to cDNA synthesis and sequencing. The libraries differ in size > (1.918.953 and 1.208.586 reads respectively) which clearly necessitates > a normalization step. Furthermore we expect site specific differences > rather than differences in the accumulation of the full length RNAs. > > So I want to ask, if it is appropiate to do a single nucleotide based > normalization with edgeR and do you think a reliable basic normalization > is possible without replicates? > > Thank you for your comments. > > Best regards > > Jens ______________________________________________________________________ The information in this email is confidential and intend...{{dropped:4}}

Sequencing RNASeq Normalization edgeR Sequencing RNASeq Normalization edgeR • 2.3k views

ADD COMMENT • link updated 14.9 years ago by Jens Georg ▴ 20 • written 14.9 years ago by Gordon Smyth 53k

0

Entering edit mode

Jens Georg ▴ 20

@jens-georg-4467

Last seen 4.0 years ago

Germany

Hi Gordon, thank you for your reply. The resolution of our ~100nt solexa reads is to small to detect individual processing sites, so we want to investigate every single nucleotide individually ("single nucleotide based normalization"). That means that we count, how often an individual nucleotide is covered by sequence reads. Of course, this approach will virtually increase the lib.size by a factor which depends on length of the solexa reads. As the lib.size is critical for the normalization, I am not sure if I should use the original read numbers for each library or the read numbers multiplicated with the read length to adjust for the single nucleotide investigation. I have two more question regarding to the normalization: 1. Are the norm factors calculated by the calcNormFactors( ) function automatically used for further steps like the estimateCommonDisp( ) function? 2. Are the pseudocounts calculated by estimateCommonDisp( ) the normalized readcounts? Many thanks Jens > Hi Jens, > > I don't know what you mean by single nucleotide based normalization, > however the following comments may be helpful. > > edgeR automatically adjusts for library sizes, whether you include an > explicit normalization step or not. Normalization is a separate > issue, and is intended to deal with more subtle issues. > > Normalization, as edgeR does it, does not require replicates. > > Best wishes > Gordon > >> Date: Fri, 04 Feb 2011 11:28:15 +0100 >> From: Jens Georg <jens.georg at="" biologie.uni-freiburg.de=""> >> To: bioconductor at r-project.org >> Subject: [BioC] Single nucleotide based RNAseq normalization with >> edgeR? >> Message-ID: <4D4BD4BF.4010009 at biologie.uni-freiburg.de> >> Content-Type: text/plain; charset=ISO-8859-15; format=flowed >> >> >> >> Dear edgeR users and developers, >> >> we used Solexa sequencing in order to detect RNase E processing sites. >> Therefor we splitted a RNA sample and treated one half with RNase E >> prior to cDNA synthesis and sequencing. The libraries differ in size >> (1.918.953 and 1.208.586 reads respectively) which clearly necessitates >> a normalization step. Furthermore we expect site specific differences >> rather than differences in the accumulation of the full length RNAs. >> >> So I want to ask, if it is appropiate to do a single nucleotide based >> normalization with edgeR and do you think a reliable basic normalization >> is possible without replicates? >> >> Thank you for your comments. >> >> Best regards >> >> Jens > > ______________________________________________________________________ > The information in this email is confidential and inte...{{dropped:6}}

ADD COMMENT • link 14.9 years ago Jens Georg ▴ 20

0

Entering edit mode

Hi Gordon, First I would like to thank Jens for asking the questions that I had asked few days ago. In additions to the Jens question, I have one more question on my RNA- seq data 1. I would like to know if I can multiply the counts for each gene with the norm.factor (calculated by "calcNormFactors( )" function) Thanks Sridhara On Mon, Feb 7, 2011 at 5:46 AM, Jens Georg < jens.georg@biologie.uni-freiburg.de> wrote: > Hi Gordon, > thank you for your reply. The resolution of our ~100nt solexa reads is to > small to detect individual processing sites, so we want to investigate every > single nucleotide individually ("single nucleotide based normalization"). > That means that we count, how often an individual nucleotide is covered by > sequence reads. Of course, this approach will virtually increase the > lib.size by a factor which depends on length of the solexa reads. As the > lib.size is critical for the normalization, I am not sure if I should use > the original read numbers for each library or the read numbers multiplicated > with the read length to adjust for the single nucleotide investigation. > > I have two more question regarding to the normalization: > 1. Are the norm factors calculated by the calcNormFactors( ) function > automatically used for further steps like the estimateCommonDisp( ) > function? > 2. Are the pseudocounts calculated by estimateCommonDisp( ) the normalized > readcounts? > > Many thanks > > Jens > > Hi Jens, >> >> I don't know what you mean by single nucleotide based normalization, >> however the following comments may be helpful. >> >> edgeR automatically adjusts for library sizes, whether you include an >> explicit normalization step or not. Normalization is a separate issue, and >> is intended to deal with more subtle issues. >> >> Normalization, as edgeR does it, does not require replicates. >> >> Best wishes >> Gordon >> >> Date: Fri, 04 Feb 2011 11:28:15 +0100 >>> From: Jens Georg <jens.georg@biologie.uni-freiburg.de> >>> To: bioconductor@r-project.org >>> Subject: [BioC] Single nucleotide based RNAseq normalization with >>> edgeR? >>> Message-ID: <4D4BD4BF.4010009@biologie.uni-freiburg.de> >>> Content-Type: text/plain; charset=ISO-8859-15; format=flowed >>> >>> >>> >>> Dear edgeR users and developers, >>> >>> we used Solexa sequencing in order to detect RNase E processing sites. >>> Therefor we splitted a RNA sample and treated one half with RNase E >>> prior to cDNA synthesis and sequencing. The libraries differ in size >>> (1.918.953 and 1.208.586 reads respectively) which clearly necessitates >>> a normalization step. Furthermore we expect site specific differences >>> rather than differences in the accumulation of the full length RNAs. >>> >>> So I want to ask, if it is appropiate to do a single nucleotide based >>> normalization with edgeR and do you think a reliable basic normalization >>> is possible without replicates? >>> >>> Thank you for your comments. >>> >>> Best regards >>> >>> Jens >>> >> >> ______________________________________________________________________ >> The information in this email is confidential and inte...{{dropped:6}} >> > > _______________________________________________ > Bioconductor mailing list > Bioconductor@r-project.org > https://stat.ethz.ch/mailman/listinfo/bioconductor > Search the archives: > http://news.gmane.org/gmane.science.biology.informatics.conductor > -- Sridhara G Kunjeti PhD Candidate University of Delaware Department of Plant and Soil Science email- sridhara@udel.edu Ph: 832-566-0011 [[alternative HTML version deleted]]

ADD REPLY • link 14.9 years ago Sridhara Gupta Kunjeti ▴ 320

0

Entering edit mode

Hi Jens/Sridhara. A few thoughts below. On 2011-02-07, at 11:22 PM, Sridhara Gupta Kunjeti wrote: > Hi Gordon, > First I would like to thank Jens for asking the questions that I had asked > few days ago. > In additions to the Jens question, I have one more question on my RNA-seq > data > 1. I would like to know if I can multiply the counts for each gene with the > norm.factor (calculated by "calcNormFactors( )" function) Sridhara, you've asked this exact question before and I answered (short answer is: NO to multiplying ... instead, divide by [library size]*[normalization factor]): https://stat.ethz.ch/pipermail/bioconductor/2011-January/037564.html https://stat.ethz.ch/pipermail/bioconductor/2011-January/037469.html Perhaps you can clarify what you don't understand. > On Mon, Feb 7, 2011 at 5:46 AM, Jens Georg < > jens.georg at biologie.uni-freiburg.de> wrote: > >> Hi Gordon, >> thank you for your reply. The resolution of our ~100nt solexa reads is to >> small to detect individual processing sites, so we want to investigate every >> single nucleotide individually ("single nucleotide based normalization"). >> That means that we count, how often an individual nucleotide is covered by >> sequence reads. Of course, this approach will virtually increase the >> lib.size by a factor which depends on length of the solexa reads. As the >> lib.size is critical for the normalization, I am not sure if I should use >> the original read numbers for each library or the read numbers multiplicated >> with the read length to adjust for the single nucleotide investigation. So basically, by counting this way, your library size is ~100x the number of reads you've actually mapped. While I think this will work out ok (normalization calculation be fine), this coverage calculation does impose a (strong?) dependence between adjacent nucleotides. One alternative would be to count the reads that *begin* at a given nucleotide and only consider these. Then your library sizes are as normal. >> I have two more question regarding to the normalization: >> 1. Are the norm factors calculated by the calcNormFactors( ) function >> automatically used for further steps like the estimateCommonDisp( ) >> function? Yes. >> 2. Are the pseudocounts calculated by estimateCommonDisp( ) the normalized >> readcounts? Yes, but this is only accounting for overall depth and potential composition biases, not for length biases (or any others). It is with the intention of making inferences of a given gene across conditions. The inferences for differential expression are still done on the raw counts. Hope that helps. Mark >> >> Many thanks >> >> Jens >> >> Hi Jens, >>> >>> I don't know what you mean by single nucleotide based normalization, >>> however the following comments may be helpful. >>> >>> edgeR automatically adjusts for library sizes, whether you include an >>> explicit normalization step or not. Normalization is a separate issue, and >>> is intended to deal with more subtle issues. >>> >>> Normalization, as edgeR does it, does not require replicates. >>> >>> Best wishes >>> Gordon >>> >>> Date: Fri, 04 Feb 2011 11:28:15 +0100 >>>> From: Jens Georg <jens.georg at="" biologie.uni-freiburg.de=""> >>>> To: bioconductor at r-project.org >>>> Subject: [BioC] Single nucleotide based RNAseq normalization with >>>> edgeR? >>>> Message-ID: <4D4BD4BF.4010009 at biologie.uni-freiburg.de> >>>> Content-Type: text/plain; charset=ISO-8859-15; format=flowed >>>> >>>> >>>> >>>> Dear edgeR users and developers, >>>> >>>> we used Solexa sequencing in order to detect RNase E processing sites. >>>> Therefor we splitted a RNA sample and treated one half with RNase E >>>> prior to cDNA synthesis and sequencing. The libraries differ in size >>>> (1.918.953 and 1.208.586 reads respectively) which clearly necessitates >>>> a normalization step. Furthermore we expect site specific differences >>>> rather than differences in the accumulation of the full length RNAs. >>>> >>>> So I want to ask, if it is appropiate to do a single nucleotide based >>>> normalization with edgeR and do you think a reliable basic normalization >>>> is possible without replicates? >>>> >>>> Thank you for your comments. >>>> >>>> Best regards >>>> >>>> Jens >>>> >>> >>> ______________________________________________________________________ >>> The information in this email is confidential and inte...{{dropped:6}} >>> >> >> _______________________________________________ >> Bioconductor mailing list >> Bioconductor at r-project.org >> https://stat.ethz.ch/mailman/listinfo/bioconductor >> Search the archives: >> http://news.gmane.org/gmane.science.biology.informatics.conductor >> > > > > -- > Sridhara G Kunjeti > PhD Candidate > University of Delaware > Department of Plant and Soil Science > email- sridhara at udel.edu > Ph: 832-566-0011 > > [[alternative HTML version deleted]] > > _______________________________________________ > Bioconductor mailing list > Bioconductor at r-project.org > https://stat.ethz.ch/mailman/listinfo/bioconductor > Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor ------------------------------ Mark Robinson, PhD (Melb) Epigenetics Laboratory, Garvan Bioinformatics Division, WEHI e: mrobinson at wehi.edu.au e: m.robinson at garvan.org.au p: +61 (0)3 9345 2628 f: +61 (0)3 9347 0852 ------------------------------ ______________________________________________________________________ The information in this email is confidential and intend...{{dropped:6}}

ADD REPLY • link 14.9 years ago Mark Robinson ★ 1.1k

0

Entering edit mode

Hello Mark, This is in continuation with the normalization of the counts: did you mean (count / library size) * Norm.factor Can I use the numbers for the library size and Norm.factor can be used from the edgeR? Thanks, Sridhara On Mon, Feb 7, 2011 at 5:11 PM, Mark Robinson <mrobinson@wehi.edu.au> wrote: > Hi Jens/Sridhara. > > A few thoughts below. > > On 2011-02-07, at 11:22 PM, Sridhara Gupta Kunjeti wrote: > > > Hi Gordon, > > First I would like to thank Jens for asking the questions that I had > asked > > few days ago. > > In additions to the Jens question, I have one more question on my RNA-seq > > data > > 1. I would like to know if I can multiply the counts for each gene with > the > > norm.factor (calculated by "calcNormFactors( )" function) > > > Sridhara, you've asked this exact question before and I answered (short > answer is: NO to multiplying ... instead, divide by [library > size]*[normalization factor]): > > https://stat.ethz.ch/pipermail/bioconductor/2011-January/037564.html > https://stat.ethz.ch/pipermail/bioconductor/2011-January/037469.html > > Perhaps you can clarify what you don't understand. > > > > On Mon, Feb 7, 2011 at 5:46 AM, Jens Georg < > > jens.georg@biologie.uni-freiburg.de> wrote: > > > >> Hi Gordon, > >> thank you for your reply. The resolution of our ~100nt solexa reads is > to > >> small to detect individual processing sites, so we want to investigate > every > >> single nucleotide individually ("single nucleotide based > normalization"). > >> That means that we count, how often an individual nucleotide is covered > by > >> sequence reads. Of course, this approach will virtually increase the > >> lib.size by a factor which depends on length of the solexa reads. As the > >> lib.size is critical for the normalization, I am not sure if I should > use > >> the original read numbers for each library or the read numbers > multiplicated > >> with the read length to adjust for the single nucleotide investigation. > > > So basically, by counting this way, your library size is ~100x the number > of reads you've actually mapped. While I think this will work out ok > (normalization calculation be fine), this coverage calculation does impose a > (strong?) dependence between adjacent nucleotides. One alternative would be > to count the reads that *begin* at a given nucleotide and only consider > these. Then your library sizes are as normal. > > > >> I have two more question regarding to the normalization: > >> 1. Are the norm factors calculated by the calcNormFactors( ) function > >> automatically used for further steps like the estimateCommonDisp( ) > >> function? > > Yes. > > > >> 2. Are the pseudocounts calculated by estimateCommonDisp( ) the > normalized > >> readcounts? > > Yes, but this is only accounting for overall depth and potential > composition biases, not for length biases (or any others). It is with the > intention of making inferences of a given gene across conditions. The > inferences for differential expression are still done on the raw counts. > > Hope that helps. > Mark > > > > > >> > >> Many thanks > >> > >> Jens > >> > >> Hi Jens, > >>> > >>> I don't know what you mean by single nucleotide based normalization, > >>> however the following comments may be helpful. > >>> > >>> edgeR automatically adjusts for library sizes, whether you include an > >>> explicit normalization step or not. Normalization is a separate issue, > and > >>> is intended to deal with more subtle issues. > >>> > >>> Normalization, as edgeR does it, does not require replicates. > >>> > >>> Best wishes > >>> Gordon > >>> > >>> Date: Fri, 04 Feb 2011 11:28:15 +0100 > >>>> From: Jens Georg <jens.georg@biologie.uni-freiburg.de> > >>>> To: bioconductor@r-project.org > >>>> Subject: [BioC] Single nucleotide based RNAseq normalization with > >>>> edgeR? > >>>> Message-ID: <4D4BD4BF.4010009@biologie.uni-freiburg.de> > >>>> Content-Type: text/plain; charset=ISO-8859-15; format=flowed > >>>> > >>>> > >>>> > >>>> Dear edgeR users and developers, > >>>> > >>>> we used Solexa sequencing in order to detect RNase E processing sites. > >>>> Therefor we splitted a RNA sample and treated one half with RNase E > >>>> prior to cDNA synthesis and sequencing. The libraries differ in size > >>>> (1.918.953 and 1.208.586 reads respectively) which clearly > necessitates > >>>> a normalization step. Furthermore we expect site specific differences > >>>> rather than differences in the accumulation of the full length RNAs. > >>>> > >>>> So I want to ask, if it is appropiate to do a single nucleotide based > >>>> normalization with edgeR and do you think a reliable basic > normalization > >>>> is possible without replicates? > >>>> > >>>> Thank you for your comments. > >>>> > >>>> Best regards > >>>> > >>>> Jens > >>>> > >>> > >>> ______________________________________________________________________ > >>> The information in this email is confidential and inte...{{dropped:6}} > >>> > >> > >> _______________________________________________ > >> Bioconductor mailing list > >> Bioconductor@r-project.org > >> https://stat.ethz.ch/mailman/listinfo/bioconductor > >> Search the archives: > >> http://news.gmane.org/gmane.science.biology.informatics.conductor > >> > > > > > > > > -- > > Sridhara G Kunjeti > > PhD Candidate > > University of Delaware > > Department of Plant and Soil Science > > email- sridhara@udel.edu > > Ph: 832-566-0011 > > > > [[alternative HTML version deleted]] > > > > _______________________________________________ > > Bioconductor mailing list > > Bioconductor@r-project.org > > https://stat.ethz.ch/mailman/listinfo/bioconductor > > Search the archives: > http://news.gmane.org/gmane.science.biology.informatics.conductor > > ------------------------------ > Mark Robinson, PhD (Melb) > Epigenetics Laboratory, Garvan > Bioinformatics Division, WEHI > e: mrobinson@wehi.edu.au > e: m.robinson@garvan.org.au > p: +61 (0)3 9345 2628 > f: +61 (0)3 9347 0852 > ------------------------------ > > > ______________________________________________________________________ > The information in this email is confidential and inte...{{dropped:20}}

ADD REPLY • link 14.9 years ago Sridhara Gupta Kunjeti ▴ 320

0

Entering edit mode

Hi Sridhara. On 2011-02-10, at 4:34 AM, Sridhara Gupta Kunjeti wrote: > Hello Mark, > This is in continuation with the normalization of the counts: > did you mean > > (count / library size) * Norm.factor > Can I use the numbers for the library size and Norm.factor can be used from the edgeR? No. Actually, I mean what I wrote in both previous posts. I'll repeat again and hopefully third time lucky: rpm <- t(t(d$counts) / (d$samples$lib.size*d$samples$norm.factors)) * 1e6 So, this translates to: count / (lib.size*Norm.factor) ... and you may multiply by a factor to put it on a different scale (e.g. multiply by 1M as I've done above). And, you should remember all the previous caveats that I've mentioned (i.e. there is no need to do this for a differential expression analysis as edgeR already builds this in + this doesn't account for other biases such as gene length). Hope that helps. Mark > Thanks, > Sridhara > > > On Mon, Feb 7, 2011 at 5:11 PM, Mark Robinson <mrobinson at="" wehi.edu.au=""> wrote: > Hi Jens/Sridhara. > > A few thoughts below. > > On 2011-02-07, at 11:22 PM, Sridhara Gupta Kunjeti wrote: > > > Hi Gordon, > > First I would like to thank Jens for asking the questions that I had asked > > few days ago. > > In additions to the Jens question, I have one more question on my RNA-seq > > data > > 1. I would like to know if I can multiply the counts for each gene with the > > norm.factor (calculated by "calcNormFactors( )" function) > > > Sridhara, you've asked this exact question before and I answered (short answer is: NO to multiplying ... instead, divide by [library size]*[normalization factor]): > > https://stat.ethz.ch/pipermail/bioconductor/2011-January/037564.html > https://stat.ethz.ch/pipermail/bioconductor/2011-January/037469.html > > Perhaps you can clarify what you don't understand. > > > > On Mon, Feb 7, 2011 at 5:46 AM, Jens Georg < > > jens.georg at biologie.uni-freiburg.de> wrote: > > > >> Hi Gordon, > >> thank you for your reply. The resolution of our ~100nt solexa reads is to > >> small to detect individual processing sites, so we want to investigate every > >> single nucleotide individually ("single nucleotide based normalization"). > >> That means that we count, how often an individual nucleotide is covered by > >> sequence reads. Of course, this approach will virtually increase the > >> lib.size by a factor which depends on length of the solexa reads. As the > >> lib.size is critical for the normalization, I am not sure if I should use > >> the original read numbers for each library or the read numbers multiplicated > >> with the read length to adjust for the single nucleotide investigation. > > > So basically, by counting this way, your library size is ~100x the number of reads you've actually mapped. While I think this will work out ok (normalization calculation be fine), this coverage calculation does impose a (strong?) dependence between adjacent nucleotides. One alternative would be to count the reads that *begin* at a given nucleotide and only consider these. Then your library sizes are as normal. > > > >> I have two more question regarding to the normalization: > >> 1. Are the norm factors calculated by the calcNormFactors( ) function > >> automatically used for further steps like the estimateCommonDisp( ) > >> function? > > Yes. > > > >> 2. Are the pseudocounts calculated by estimateCommonDisp( ) the normalized > >> readcounts? > > Yes, but this is only accounting for overall depth and potential composition biases, not for length biases (or any others). It is with the intention of making inferences of a given gene across conditions. The inferences for differential expression are still done on the raw counts. > > Hope that helps. > Mark > > > > > >> > >> Many thanks > >> > >> Jens > >> > >> Hi Jens, > >>> > >>> I don't know what you mean by single nucleotide based normalization, > >>> however the following comments may be helpful. > >>> > >>> edgeR automatically adjusts for library sizes, whether you include an > >>> explicit normalization step or not. Normalization is a separate issue, and > >>> is intended to deal with more subtle issues. > >>> > >>> Normalization, as edgeR does it, does not require replicates. > >>> > >>> Best wishes > >>> Gordon > >>> > >>> Date: Fri, 04 Feb 2011 11:28:15 +0100 > >>>> From: Jens Georg <jens.georg at="" biologie.uni-freiburg.de=""> > >>>> To: bioconductor at r-project.org > >>>> Subject: [BioC] Single nucleotide based RNAseq normalization with > >>>> edgeR? > >>>> Message-ID: <4D4BD4BF.4010009 at biologie.uni-freiburg.de> > >>>> Content-Type: text/plain; charset=ISO-8859-15; format=flowed > >>>> > >>>> > >>>> > >>>> Dear edgeR users and developers, > >>>> > >>>> we used Solexa sequencing in order to detect RNase E processing sites. > >>>> Therefor we splitted a RNA sample and treated one half with RNase E > >>>> prior to cDNA synthesis and sequencing. The libraries differ in size > >>>> (1.918.953 and 1.208.586 reads respectively) which clearly necessitates > >>>> a normalization step. Furthermore we expect site specific differences > >>>> rather than differences in the accumulation of the full length RNAs. > >>>> > >>>> So I want to ask, if it is appropiate to do a single nucleotide based > >>>> normalization with edgeR and do you think a reliable basic normalization > >>>> is possible without replicates? > >>>> > >>>> Thank you for your comments. > >>>> > >>>> Best regards > >>>> > >>>> Jens > >>>> > >>> > >>> ______________________________________________________________________ > >>> The information in this email is confidential and inte...{{dropped:6}} > >>> > >> > >> _______________________________________________ > >> Bioconductor mailing list > >> Bioconductor at r-project.org > >> https://stat.ethz.ch/mailman/listinfo/bioconductor > >> Search the archives: > >> http://news.gmane.org/gmane.science.biology.informatics.conductor > >> > > > > > > > > -- > > Sridhara G Kunjeti > > PhD Candidate > > University of Delaware > > Department of Plant and Soil Science > > email- sridhara at udel.edu > > Ph: 832-566-0011 > > > > [[alternative HTML version deleted]] > > > > _______________________________________________ > > Bioconductor mailing list > > Bioconductor at r-project.org > > https://stat.ethz.ch/mailman/listinfo/bioconductor > > Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor > > ------------------------------ > Mark Robinson, PhD (Melb) > Epigenetics Laboratory, Garvan > Bioinformatics Division, WEHI > e: mrobinson at wehi.edu.au > e: m.robinson at garvan.org.au > p: +61 (0)3 9345 2628 > f: +61 (0)3 9347 0852 > ------------------------------ > > > ______________________________________________________________________ > The information in this email is confidential and intended solely for the addressee. > You must not disclose, forward, print or use it without the permission of the sender. > ______________________________________________________________________ > > > > -- > Sridhara G Kunjeti > PhD Candidate > University of Delaware > Department of Plant and Soil Science > email- sridhara at udel.edu > Ph: 832-566-0011 ------------------------------ Mark Robinson, PhD (Melb) Epigenetics Laboratory, Garvan Bioinformatics Division, WEHI e: mrobinson at wehi.edu.au e: m.robinson at garvan.org.au p: +61 (0)3 9345 2628 f: +61 (0)3 9347 0852 ------------------------------ ______________________________________________________________________ The information in this email is confidential and intended solely for the addressee. You must not disclose, forward, print or use it without the permission of the sender.

ADD REPLY • link 14.9 years ago Mark Robinson ★ 1.1k

0

Entering edit mode

Hello Mark, Yes, Now it is clear to me. Thank you very much for being patient in responding to my questions. Many thanks! Sridhara On Wed, Feb 9, 2011 at 5:16 PM, Mark Robinson <mrobinson@wehi.edu.au> wrote: > Hi Sridhara. > > On 2011-02-10, at 4:34 AM, Sridhara Gupta Kunjeti wrote: > > > Hello Mark, > > This is in continuation with the normalization of the counts: > > did you mean > > > > (count / library size) * Norm.factor > > Can I use the numbers for the library size and Norm.factor can be used > from the edgeR? > > > No. Actually, I mean what I wrote in both previous posts. I'll repeat > again and hopefully third time lucky: > > rpm <- t(t(d$counts) / (d$samples$lib.size*d$samples$norm.factors)) * 1e6 > > So, this translates to: > > count / (lib.size*Norm.factor) > > ... and you may multiply by a factor to put it on a different scale (e.g. > multiply by 1M as I've done above). And, you should remember all the > previous caveats that I've mentioned (i.e. there is no need to do this for a > differential expression analysis as edgeR already builds this in + this > doesn't account for other biases such as gene length). > > Hope that helps. > Mark > > > > > > Thanks, > > Sridhara > > > > > > On Mon, Feb 7, 2011 at 5:11 PM, Mark Robinson <mrobinson@wehi.edu.au> > wrote: > > Hi Jens/Sridhara. > > > > A few thoughts below. > > > > On 2011-02-07, at 11:22 PM, Sridhara Gupta Kunjeti wrote: > > > > > Hi Gordon, > > > First I would like to thank Jens for asking the questions that I had > asked > > > few days ago. > > > In additions to the Jens question, I have one more question on my > RNA-seq > > > data > > > 1. I would like to know if I can multiply the counts for each gene with > the > > > norm.factor (calculated by "calcNormFactors( )" function) > > > > > > Sridhara, you've asked this exact question before and I answered (short > answer is: NO to multiplying ... instead, divide by [library > size]*[normalization factor]): > > > > https://stat.ethz.ch/pipermail/bioconductor/2011-January/037564.html > > https://stat.ethz.ch/pipermail/bioconductor/2011-January/037469.html > > > > Perhaps you can clarify what you don't understand. > > > > > > > On Mon, Feb 7, 2011 at 5:46 AM, Jens Georg < > > > jens.georg@biologie.uni-freiburg.de> wrote: > > > > > >> Hi Gordon, > > >> thank you for your reply. The resolution of our ~100nt solexa reads is > to > > >> small to detect individual processing sites, so we want to investigate > every > > >> single nucleotide individually ("single nucleotide based > normalization"). > > >> That means that we count, how often an individual nucleotide is > covered by > > >> sequence reads. Of course, this approach will virtually increase the > > >> lib.size by a factor which depends on length of the solexa reads. As > the > > >> lib.size is critical for the normalization, I am not sure if I should > use > > >> the original read numbers for each library or the read numbers > multiplicated > > >> with the read length to adjust for the single nucleotide > investigation. > > > > > > So basically, by counting this way, your library size is ~100x the number > of reads you've actually mapped. While I think this will work out ok > (normalization calculation be fine), this coverage calculation does impose a > (strong?) dependence between adjacent nucleotides. One alternative would be > to count the reads that *begin* at a given nucleotide and only consider > these. Then your library sizes are as normal. > > > > > > >> I have two more question regarding to the normalization: > > >> 1. Are the norm factors calculated by the calcNormFactors( ) function > > >> automatically used for further steps like the estimateCommonDisp( ) > > >> function? > > > > Yes. > > > > > > >> 2. Are the pseudocounts calculated by estimateCommonDisp( ) the > normalized > > >> readcounts? > > > > Yes, but this is only accounting for overall depth and potential > composition biases, not for length biases (or any others). It is with the > intention of making inferences of a given gene across conditions. The > inferences for differential expression are still done on the raw counts. > > > > Hope that helps. > > Mark > > > > > > > > > > >> > > >> Many thanks > > >> > > >> Jens > > >> > > >> Hi Jens, > > >>> > > >>> I don't know what you mean by single nucleotide based normalization, > > >>> however the following comments may be helpful. > > >>> > > >>> edgeR automatically adjusts for library sizes, whether you include an > > >>> explicit normalization step or not. Normalization is a separate > issue, and > > >>> is intended to deal with more subtle issues. > > >>> > > >>> Normalization, as edgeR does it, does not require replicates. > > >>> > > >>> Best wishes > > >>> Gordon > > >>> > > >>> Date: Fri, 04 Feb 2011 11:28:15 +0100 > > >>>> From: Jens Georg <jens.georg@biologie.uni-freiburg.de> > > >>>> To: bioconductor@r-project.org > > >>>> Subject: [BioC] Single nucleotide based RNAseq normalization with > > >>>> edgeR? > > >>>> Message-ID: <4D4BD4BF.4010009@biologie.uni-freiburg.de> > > >>>> Content-Type: text/plain; charset=ISO-8859-15; format=flowed > > >>>> > > >>>> > > >>>> > > >>>> Dear edgeR users and developers, > > >>>> > > >>>> we used Solexa sequencing in order to detect RNase E processing > sites. > > >>>> Therefor we splitted a RNA sample and treated one half with RNase E > > >>>> prior to cDNA synthesis and sequencing. The libraries differ in size > > >>>> (1.918.953 and 1.208.586 reads respectively) which clearly > necessitates > > >>>> a normalization step. Furthermore we expect site specific > differences > > >>>> rather than differences in the accumulation of the full length RNAs. > > >>>> > > >>>> So I want to ask, if it is appropiate to do a single nucleotide > based > > >>>> normalization with edgeR and do you think a reliable basic > normalization > > >>>> is possible without replicates? > > >>>> > > >>>> Thank you for your comments. > > >>>> > > >>>> Best regards > > >>>> > > >>>> Jens > > >>>> > > >>> > > >>> > ______________________________________________________________________ > > >>> The information in this email is confidential and > inte...{{dropped:6}} > > >>> > > >> > > >> _______________________________________________ > > >> Bioconductor mailing list > > >> Bioconductor@r-project.org > > >> https://stat.ethz.ch/mailman/listinfo/bioconductor > > >> Search the archives: > > >> http://news.gmane.org/gmane.science.biology.informatics.conductor > > >> > > > > > > > > > > > > -- > > > Sridhara G Kunjeti > > > PhD Candidate > > > University of Delaware > > > Department of Plant and Soil Science > > > email- sridhara@udel.edu > > > Ph: 832-566-0011 > > > > > > [[alternative HTML version deleted]] > > > > > > _______________________________________________ > > > Bioconductor mailing list > > > Bioconductor@r-project.org > > > https://stat.ethz.ch/mailman/listinfo/bioconductor > > > Search the archives: > http://news.gmane.org/gmane.science.biology.informatics.conductor > > > > ------------------------------ > > Mark Robinson, PhD (Melb) > > Epigenetics Laboratory, Garvan > > Bioinformatics Division, WEHI > > e: mrobinson@wehi.edu.au > > e: m.robinson@garvan.org.au > > p: +61 (0)3 9345 2628 > > f: +61 (0)3 9347 0852 > > ------------------------------ > > > > > > ______________________________________________________________________ > > The information in this email is confidential and intended solely for the > addressee. > > You must not disclose, forward, print or use it without the permission of > the sender. > > ______________________________________________________________________ > > > > > > > > -- > > Sridhara G Kunjeti > > PhD Candidate > > University of Delaware > > Department of Plant and Soil Science > > email- sridhara@udel.edu > > Ph: 832-566-0011 > > ------------------------------ > Mark Robinson, PhD (Melb) > Epigenetics Laboratory, Garvan > Bioinformatics Division, WEHI > e: mrobinson@wehi.edu.au > e: m.robinson@garvan.org.au > p: +61 (0)3 9345 2628 > f: +61 (0)3 9347 0852 > ------------------------------ > > > ______________________________________________________________________ > The information in this email is confidential and inte...{{dropped:20}}

ADD REPLY • link 14.9 years ago Sridhara Gupta Kunjeti ▴ 320

0

Entering edit mode

Hello Mark, If I want to include a term (Gene length) in the below mentioned code to make it like RPKM. How to add this term. I would appreciate it. Many thanks in advance! Sridhara On Wed, Feb 9, 2011 at 6:19 PM, Sridhara Gupta Kunjeti <sridhara@udel.edu>wrote: > Hello Mark, > Yes, Now it is clear to me. > Thank you very much for being patient in responding to my questions. > > Many thanks! > Sridhara > > > On Wed, Feb 9, 2011 at 5:16 PM, Mark Robinson <mrobinson@wehi.edu.au>wrote: > >> Hi Sridhara. >> >> On 2011-02-10, at 4:34 AM, Sridhara Gupta Kunjeti wrote: >> >> > Hello Mark, >> > This is in continuation with the normalization of the counts: >> > did you mean >> > >> > (count / library size) * Norm.factor >> > Can I use the numbers for the library size and Norm.factor can be used >> from the edgeR? >> >> >> No. Actually, I mean what I wrote in both previous posts. I'll repeat >> again and hopefully third time lucky: >> >> rpm <- t(t(d$counts) / (d$samples$lib.size*d$samples$norm.factors)) * 1e6 >> >> So, this translates to: >> >> count / (lib.size*Norm.factor) >> >> ... and you may multiply by a factor to put it on a different scale (e.g. >> multiply by 1M as I've done above). And, you should remember all the >> previous caveats that I've mentioned (i.e. there is no need to do this for a >> differential expression analysis as edgeR already builds this in + this >> doesn't account for other biases such as gene length). >> >> Hope that helps. >> Mark >> >> >> >> >> > Thanks, >> > Sridhara >> > >> > >> > On Mon, Feb 7, 2011 at 5:11 PM, Mark Robinson <mrobinson@wehi.edu.au> >> wrote: >> > Hi Jens/Sridhara. >> > >> > A few thoughts below. >> > >> > On 2011-02-07, at 11:22 PM, Sridhara Gupta Kunjeti wrote: >> > >> > > Hi Gordon, >> > > First I would like to thank Jens for asking the questions that I had >> asked >> > > few days ago. >> > > In additions to the Jens question, I have one more question on my >> RNA-seq >> > > data >> > > 1. I would like to know if I can multiply the counts for each gene >> with the >> > > norm.factor (calculated by "calcNormFactors( )" function) >> > >> > >> > Sridhara, you've asked this exact question before and I answered (short >> answer is: NO to multiplying ... instead, divide by [library >> size]*[normalization factor]): >> > >> > https://stat.ethz.ch/pipermail/bioconductor/2011-January/037564.html >> > https://stat.ethz.ch/pipermail/bioconductor/2011-January/037469.html >> > >> > Perhaps you can clarify what you don't understand. >> > >> > >> > > On Mon, Feb 7, 2011 at 5:46 AM, Jens Georg < >> > > jens.georg@biologie.uni-freiburg.de> wrote: >> > > >> > >> Hi Gordon, >> > >> thank you for your reply. The resolution of our ~100nt solexa reads >> is to >> > >> small to detect individual processing sites, so we want to >> investigate every >> > >> single nucleotide individually ("single nucleotide based >> normalization"). >> > >> That means that we count, how often an individual nucleotide is >> covered by >> > >> sequence reads. Of course, this approach will virtually increase the >> > >> lib.size by a factor which depends on length of the solexa reads. As >> the >> > >> lib.size is critical for the normalization, I am not sure if I should >> use >> > >> the original read numbers for each library or the read numbers >> multiplicated >> > >> with the read length to adjust for the single nucleotide >> investigation. >> > >> > >> > So basically, by counting this way, your library size is ~100x the >> number of reads you've actually mapped. While I think this will work out ok >> (normalization calculation be fine), this coverage calculation does impose a >> (strong?) dependence between adjacent nucleotides. One alternative would be >> to count the reads that *begin* at a given nucleotide and only consider >> these. Then your library sizes are as normal. >> > >> > >> > >> I have two more question regarding to the normalization: >> > >> 1. Are the norm factors calculated by the calcNormFactors( ) function >> > >> automatically used for further steps like the estimateCommonDisp( ) >> > >> function? >> > >> > Yes. >> > >> > >> > >> 2. Are the pseudocounts calculated by estimateCommonDisp( ) the >> normalized >> > >> readcounts? >> > >> > Yes, but this is only accounting for overall depth and potential >> composition biases, not for length biases (or any others). It is with the >> intention of making inferences of a given gene across conditions. The >> inferences for differential expression are still done on the raw counts. >> > >> > Hope that helps. >> > Mark >> > >> > >> > >> > >> > >> >> > >> Many thanks >> > >> >> > >> Jens >> > >> >> > >> Hi Jens, >> > >>> >> > >>> I don't know what you mean by single nucleotide based normalization, >> > >>> however the following comments may be helpful. >> > >>> >> > >>> edgeR automatically adjusts for library sizes, whether you include >> an >> > >>> explicit normalization step or not. Normalization is a separate >> issue, and >> > >>> is intended to deal with more subtle issues. >> > >>> >> > >>> Normalization, as edgeR does it, does not require replicates. >> > >>> >> > >>> Best wishes >> > >>> Gordon >> > >>> >> > >>> Date: Fri, 04 Feb 2011 11:28:15 +0100 >> > >>>> From: Jens Georg <jens.georg@biologie.uni-freiburg.de> >> > >>>> To: bioconductor@r-project.org >> > >>>> Subject: [BioC] Single nucleotide based RNAseq normalization with >> > >>>> edgeR? >> > >>>> Message-ID: <4D4BD4BF.4010009@biologie.uni-freiburg.de> >> > >>>> Content-Type: text/plain; charset=ISO-8859-15; format=flowed >> > >>>> >> > >>>> >> > >>>> >> > >>>> Dear edgeR users and developers, >> > >>>> >> > >>>> we used Solexa sequencing in order to detect RNase E processing >> sites. >> > >>>> Therefor we splitted a RNA sample and treated one half with RNase E >> > >>>> prior to cDNA synthesis and sequencing. The libraries differ in >> size >> > >>>> (1.918.953 and 1.208.586 reads respectively) which clearly >> necessitates >> > >>>> a normalization step. Furthermore we expect site specific >> differences >> > >>>> rather than differences in the accumulation of the full length >> RNAs. >> > >>>> >> > >>>> So I want to ask, if it is appropiate to do a single nucleotide >> based >> > >>>> normalization with edgeR and do you think a reliable basic >> normalization >> > >>>> is possible without replicates? >> > >>>> >> > >>>> Thank you for your comments. >> > >>>> >> > >>>> Best regards >> > >>>> >> > >>>> Jens >> > >>>> >> > >>> >> > >>> >> ______________________________________________________________________ >> > >>> The information in this email is confidential and >> inte...{{dropped:6}} >> > >>> >> > >> >> > >> _______________________________________________ >> > >> Bioconductor mailing list >> > >> Bioconductor@r-project.org >> > >> https://stat.ethz.ch/mailman/listinfo/bioconductor >> > >> Search the archives: >> > >> http://news.gmane.org/gmane.science.biology.informatics.conductor >> > >> >> > > >> > > >> > > >> > > -- >> > > Sridhara G Kunjeti >> > > PhD Candidate >> > > University of Delaware >> > > Department of Plant and Soil Science >> > > email- sridhara@udel.edu >> > > Ph: 832-566-0011 >> > > >> > > [[alternative HTML version deleted]] >> > > >> > > _______________________________________________ >> > > Bioconductor mailing list >> > > Bioconductor@r-project.org >> > > https://stat.ethz.ch/mailman/listinfo/bioconductor >> > > Search the archives: >> http://news.gmane.org/gmane.science.biology.informatics.conductor >> > >> > ------------------------------ >> > Mark Robinson, PhD (Melb) >> > Epigenetics Laboratory, Garvan >> > Bioinformatics Division, WEHI >> > e: mrobinson@wehi.edu.au >> > e: m.robinson@garvan.org.au >> > p: +61 (0)3 9345 2628 >> > f: +61 (0)3 9347 0852 >> > ------------------------------ >> > >> > >> > ______________________________________________________________________ >> > The information in this email is confidential and intended solely for >> the addressee. >> > You must not disclose, forward, print or use it without the permission >> of the sender. >> > ______________________________________________________________________ >> > >> > >> > >> > -- >> > Sridhara G Kunjeti >> > PhD Candidate >> > University of Delaware >> > Department of Plant and Soil Science >> > email- sridhara@udel.edu >> > Ph: 832-566-0011 >> >> ------------------------------ >> Mark Robinson, PhD (Melb) >> Epigenetics Laboratory, Garvan >> Bioinformatics Division, WEHI >> e: mrobinson@wehi.edu.au >> e: m.robinson@garvan.org.au >> p: +61 (0)3 9345 2628 >> f: +61 (0)3 9347 0852 >> ------------------------------ >> >> >> ______________________________________________________________________ >> The information in this email is confidential and intended solely for the >> addressee. >> You must not disclose, forward, print or use it without the permission of >> the sender. >> ______________________________________________________________________ >> > > > > -- > Sridhara G Kunjeti > PhD Candidate > University of Delaware > Department of Plant and Soil Science > email- sridhara@udel.edu > Ph: 832-566-0011 > -- Sridhara G Kunjeti PhD Candidate University of Delaware Department of Plant and Soil Science email- sridhara@udel.edu Ph: 832-566-0011 [[alternative HTML version deleted]]

ADD REPLY • link 14.9 years ago Sridhara Gupta Kunjeti ▴ 320

0

Entering edit mode

Hi Jens, Il Feb/7/11 11:46 AM, Jens Georg ha scritto: > Hi Gordon, > thank you for your reply. The resolution of our ~100nt solexa reads is > to small to detect individual processing sites, so we want to > investigate every single nucleotide individually ("single nucleotide > based normalization"). That means that we count, how often an individual > nucleotide is covered by sequence reads. Of course, this approach will > virtually increase the lib.size by a factor which depends on length of > the solexa reads. As the lib.size is critical for the normalization, I > am not sure if I should use the original read numbers for each library > or the read numbers multiplicated with the read length to adjust for the > single nucleotide investigation. Do you have reasons to assume that these options are not essentially equivalent, ie. that the read length distributions are different in different lanes? If that were the case, probably more thought is required on what underlying uncontrolled physical/chemical/biological effect causes this, and derive a suitable 'normalisation' approach from that. Best wishes Wolfgang > > I have two more question regarding to the normalization: > 1. Are the norm factors calculated by the calcNormFactors( ) function > automatically used for further steps like the estimateCommonDisp( ) > function? > 2. Are the pseudocounts calculated by estimateCommonDisp( ) the > normalized readcounts? > > Many thanks > > Jens > >> Hi Jens, >> >> I don't know what you mean by single nucleotide based normalization, >> however the following comments may be helpful. >> >> edgeR automatically adjusts for library sizes, whether you include an >> explicit normalization step or not. Normalization is a separate issue, >> and is intended to deal with more subtle issues. >> >> Normalization, as edgeR does it, does not require replicates. >> >> Best wishes >> Gordon >> >>> Date: Fri, 04 Feb 2011 11:28:15 +0100 >>> From: Jens Georg <jens.georg at="" biologie.uni-freiburg.de=""> >>> To: bioconductor at r-project.org >>> Subject: [BioC] Single nucleotide based RNAseq normalization with >>> edgeR? >>> Message-ID: <4D4BD4BF.4010009 at biologie.uni-freiburg.de> >>> Content-Type: text/plain; charset=ISO-8859-15; format=flowed >>> >>> >>> >>> Dear edgeR users and developers, >>> >>> we used Solexa sequencing in order to detect RNase E processing sites. >>> Therefor we splitted a RNA sample and treated one half with RNase E >>> prior to cDNA synthesis and sequencing. The libraries differ in size >>> (1.918.953 and 1.208.586 reads respectively) which clearly necessitates >>> a normalization step. Furthermore we expect site specific differences >>> rather than differences in the accumulation of the full length RNAs. >>> >>> So I want to ask, if it is appropiate to do a single nucleotide based >>> normalization with edgeR and do you think a reliable basic normalization >>> is possible without replicates? >>> >>> Thank you for your comments. >>> >>> Best regards >>> >>> Jens >> >> ______________________________________________________________________ >> The information in this email is confidential and inte...{{dropped:6}} > > _______________________________________________ > Bioconductor mailing list > Bioconductor at r-project.org > https://stat.ethz.ch/mailman/listinfo/bioconductor > Search the archives: > http://news.gmane.org/gmane.science.biology.informatics.conductor -- Wolfgang Huber EMBL http://www.embl.de/research/units/genome_biology/huber

ADD REPLY • link 14.9 years ago Wolfgang Huber ★ 13k

Login before adding your answer.