RNA-seq

0

Entering edit mode

Sridhara Gupta Kunjeti ▴ 320

@sridhara-gupta-kunjeti-4449

Last seen 10.8 years ago

United States

Hi, > I am a PhD candidate at University of Delaware and area of research is > RNA-seq and functional genomics in Phytophthora phaseoli. I was impressed by > your paper in Genome Biology on "A scaling normalization method for > differential expression analysis of RNA-seq data". I did use the edgeR guide > and user manual and it makes lot of sense when the DGE in a plot graph. > If I am correct the DGE plot (Visualising DGE results) is plotted using the > normalized tag counts for each gene ID, please correct me if I am wrong? > If so, I was wondering if I can generate a list or table or gene ID with > the normalized tag counts? I would appreciate if you let me know if this can > be done. > > Thanks, > Sridhara > > > -- > Sridhara G Kunjeti > PhD Candidate > University of Delaware > Department of Plant and Soil Science > email- sridhara@udel.edu > Ph: 832-566-0011 > [[alternative HTML version deleted]]

Normalization Normalization • 1.7k views

ADD COMMENT • link updated 15.0 years ago by Mark Robinson ★ 1.1k • written 15.0 years ago by Sridhara Gupta Kunjeti ▴ 320

0

Entering edit mode

Sridhara Gupta Kunjeti ▴ 320

@sridhara-gupta-kunjeti-4449

Last seen 10.8 years ago

United States

Hi, I am a PhD candidate at University of Delaware and area of research is RNA-seq and functional genomics in Phytophthora phaseoli. I was impressed by your paper in Genome Biology on "A scaling normalization method for differential expression analysis of RNA-seq data". I did use the edgeR guide and user manual and it makes lot of sense when the DGE in a plot graph. If I am correct the DGE plot (Visualising DGE results) is plotted using the normalized tag counts for each gene ID, please correct me if I am wrong? If so, I was wondering if I can generate a list or table or gene ID with the normalized tag counts? I would appreciate if you let me know if this can be done. Thanks, Sridhara -- Sridhara G Kunjeti PhD Candidate University of Delaware Department of Plant and Soil Science email- sridhara@udel.edu Ph: 832-566-0011 [[alternative HTML version deleted]]

ADD COMMENT • link 15.0 years ago Sridhara Gupta Kunjeti ▴ 320

0

Entering edit mode

Mark Robinson ★ 1.1k

@mark-robinson-2171

Last seen 11.4 years ago

Hi Sridhara. I'm having trouble understanding what you have already done (perhaps show some code?) and what exactly are asking, but I'll take a stab. When you say "the DGE plot", what are you referring to? A plot produced by plotSmear()? If so, then yes, the log-ratios (and 'logConc') will take into account the normalization factors. But, this still doesn't lead to "normalized counts". For this, you could do something like: d <- DGEList(counts=...) d <- calcNormFactors(d) rpm <- t(t(d$counts) / (d$samples$lib.size*d$samples$norm.factors)) * 1e6 Here, 'rpm' means reads per million, having used what you might call 'effective' library sizes i.e. adjusted for relative composition differences. Of course, if you are doing RNA-seq, you may want to include a term in there for gene length, which will make it more like an RPKM. It sort of depends what your objective is here, because none of this is really necessary for differential expression analysis (as you will see from edgeR user's guide). Note that the 'rpm' table is not to be used for any statistical analysis of differential expression, since the data has been put on an arbitrary scale. Hope that helps. Cheers, Mark On 2011-01-25, at 11:40 AM, Sridhara Gupta Kunjeti wrote: > Hi, >> I am a PhD candidate at University of Delaware and area of research is >> RNA-seq and functional genomics in Phytophthora phaseoli. I was impressed by >> your paper in Genome Biology on "A scaling normalization method for >> differential expression analysis of RNA-seq data". I did use the edgeR guide >> and user manual and it makes lot of sense when the DGE in a plot graph. >> If I am correct the DGE plot (Visualising DGE results) is plotted using the >> normalized tag counts for each gene ID, please correct me if I am wrong? >> If so, I was wondering if I can generate a list or table or gene ID with >> the normalized tag counts? I would appreciate if you let me know if this can >> be done. >> >> Thanks, >> Sridhara >> >> >> -- >> Sridhara G Kunjeti >> PhD Candidate >> University of Delaware >> Department of Plant and Soil Science >> email- sridhara at udel.edu >> Ph: 832-566-0011 >> > > [[alternative HTML version deleted]] > > _______________________________________________ > Bioconductor mailing list > Bioconductor at r-project.org > https://stat.ethz.ch/mailman/listinfo/bioconductor > Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor ------------------------------ Mark Robinson, PhD (Melb) Epigenetics Laboratory, Garvan Bioinformatics Division, WEHI e: mrobinson at wehi.edu.au e: m.robinson at garvan.org.au p: +61 (0)3 9345 2628 f: +61 (0)3 9347 0852 ------------------------------ ______________________________________________________________________ The information in this email is confidential and intend...{{dropped:6}}

ADD COMMENT • link 15.0 years ago Mark Robinson ★ 1.1k

0

Entering edit mode

Mark Robinson ★ 1.1k

@mark-robinson-2171

Last seen 11.4 years ago

Hi Sridhara. In future when asking questions on the mailing list, please Reply-All instead of just replying to the responder (me, in this case), so that others can read the entire thread if they want to. I've inserted some comments below. See "--->". > Hello Mark, > Thank you very much for your reply. I have given an example to as my questions. > > I used Bowtie for mapping the RNA-seq data (raw files) to a reference sequence, and the used the outfile to calculate the abundance of the tags > that match to the unique gene ID (of the reference sequence). > The outfile of PhyP18B1.txt looks something like this: > PITG_01618 | Pi conserved hypothetical protein (773 nt) 73 > PITG_22358 | Pi conserved hypothetical protein (1483 nt) 20 > PITG_05672 | Pi conserved hypothetical protein (1898 nt) 513 > PITG_10862 | Pi cytochrome c oxidase assembly protein COX11 (735 nt) 1346 > >> setwd("C:/Users/SRIDHARA/Documents/test/bowtie/0_mismatch") >> targets <- read.delim(file = "targets.txt", stringsAsFactors = FALSE) d <- readDGE(targets, skip = 5, comment.char = "!") >> d <- d[rowSums(d$counts) > 5, ] >> d <- calcNormFactors(d) >> d$samples > files group description lib.size norm.factors > 1 PhyP18B1.txt PhyP18 Phytophthora phaseoli 2442435 0.7528028 2 PhyP18B2.txt PhyP18 Phytophthora phaseoli 7355562 1.0412687 3 PPLB6dpiB1.txt PPLB6dpi Phytophthora phaseoli 3280812 1.1965512 4 PPLB6dpiB2.txt PPLB6dpi Phytophthora phaseoli 3906611 1.0661656 > > 1) My objective is to study the expression levels of group of genes (for eg., effector genes) in files PhyP18 (one group) and PPLB6dpi (another group). > So, I was wondering if I could multiply each of the tag counts (for each geneID) in all the files with their respective norm.factor values. For example in PhyP18B1 multiplying each geneID tag counts with 0.7528028 > (norm.factors) > PITG_01618 | Pi conserved hypothetical protein (773 nt) 73 * 0.7528028 > = > 54.9546044 ---> No, do NOT multiply by the normalization factor. Did you notice the formula in my last email? You'll see there you divide your counts by the product of the library size and normalization factor and then (optionally?) multiple by a constant to put the normalized values on a scale of interest. > > and similarly each tag counts in file PPLB6dpiB1 with 1.1965512 and so on... > > later can I use the value 54.954 as the normalized tag count for that gene > ID in that library (PhyP18B1)? > > 2) Second question is about the plotSmear(). > red dots (in plot) are the gene ID > orange dots (in plot) are the genes that are unique to one of the > library > black dots - this where I had trouble in understand. Are these set of > prominent genes that are largely attributable for the overall bias in log-fold-changes > ---> Again, I'm taking a stab because you haven't told us what commands you are using, but if you are using something like: [...] plotSmear(d, de.tags=de.tags) ... then the red, orange, black dots are the DE, unique-to-sample and remaining genes, respectively. See also the help docs at ?plotSmear and ?maPlot. Hope that helps. Mark > > Sorry for such a long email. Any suggestions will be highly appreciated. Thanks for taking a stab on this. > > Sridhara > > > > On Mon, Jan 24, 2011 at 10:38 PM, Mark Robinson > <mrobinson at="" wehi.edu.au="">wrote: > >> Hi Sridhara. >> I'm having trouble understanding what you have already done (perhaps show >> some code?) and what exactly are asking, but I'll take a stab. >> When you say "the DGE plot", what are you referring to? A plot produced >> by >> plotSmear()? If so, then yes, the log-ratios (and 'logConc') will take into >> account the normalization factors. But, this still doesn't lead to "normalized counts". For this, you could do something like: >> d <- DGEList(counts=...) >> d <- calcNormFactors(d) >> rpm <- t(t(d$counts) / (d$samples$lib.size*d$samples$norm.factors)) * 1e6 >> Here, 'rpm' means reads per million, having used what you might call 'effective' library sizes i.e. adjusted for relative composition differences. Of course, if you are doing RNA-seq, you may want to include a >> term in there for gene length, which will make it more like an RPKM. It >> sort of depends what your objective is here, because none of this is really >> necessary for differential expression analysis (as you will see from edgeR >> user's guide). Note that the 'rpm' table is not to be used for any statistical analysis of differential expression, since the data has been >> put >> on an arbitrary scale. >> Hope that helps. >> Cheers, >> Mark >> On 2011-01-25, at 11:40 AM, Sridhara Gupta Kunjeti wrote: >> > Hi, >> >> I am a PhD candidate at University of Delaware and area of research >> is >> >> RNA-seq and functional genomics in Phytophthora phaseoli. I was >> impressed by >> >> your paper in Genome Biology on "A scaling normalization method for differential expression analysis of RNA-seq data". I did use the >> edgeR >> guide >> >> and user manual and it makes lot of sense when the DGE in a plot >> graph. >> >> If I am correct the DGE plot (Visualising DGE results) is plotted >> using >> the >> >> normalized tag counts for each gene ID, please correct me if I am >> wrong? >> >> If so, I was wondering if I can generate a list or table or gene ID >> with >> >> the normalized tag counts? I would appreciate if you let me know if >> this >> can >> >> be done. >> >> >> >> Thanks, >> >> Sridhara >> >> >> >> >> >> -- >> >> Sridhara G Kunjeti >> >> PhD Candidate >> >> University of Delaware >> >> Department of Plant and Soil Science >> >> email- sridhara at udel.edu >> >> Ph: 832-566-0011 >> >> >> > >> > [[alternative HTML version deleted]] >> > >> > _______________________________________________ >> > Bioconductor mailing list >> > Bioconductor at r-project.org >> > https://stat.ethz.ch/mailman/listinfo/bioconductor >> > Search the archives: >> http://news.gmane.org/gmane.science.biology.informatics.conductor ------------------------------ >> Mark Robinson, PhD (Melb) >> Epigenetics Laboratory, Garvan >> Bioinformatics Division, WEHI >> e: mrobinson at wehi.edu.au >> e: m.robinson at garvan.org.au >> p: +61 (0)3 9345 2628 >> f: +61 (0)3 9347 0852 >> ------------------------------ >> ______________________________________________________________________ The information in this email is confidential and intend...{{dropped:27}}

ADD COMMENT • link 15.0 years ago Mark Robinson ★ 1.1k

0

Entering edit mode

Hi Mark, Sure, in future I will reply all as you suggested. Yes, I did notice the formula in your previous email. But, could not completely understand when you said to "include term in there for gene length, which will make it more like an RPKM" for RNA-seq data. Many thanks! Sridhara On Wed, Jan 26, 2011 at 4:59 AM, Mark Robinson <mrobinson@wehi.edu.au>wrote: > Hi Sridhara. > > In future when asking questions on the mailing list, please Reply- All > instead of just replying to the responder (me, in this case), so that > others can read the entire thread if they want to. > > I've inserted some comments below. See "--->". > > > Hello Mark, > > Thank you very much for your reply. I have given an example to as my > questions. > > > > I used Bowtie for mapping the RNA-seq data (raw files) to a reference > sequence, and the used the outfile to calculate the abundance of the > tags > > that match to the unique gene ID (of the reference sequence). > > The outfile of PhyP18B1.txt looks something like this: > > PITG_01618 | Pi conserved hypothetical protein (773 nt) 73 > > PITG_22358 | Pi conserved hypothetical protein (1483 nt) 20 > > PITG_05672 | Pi conserved hypothetical protein (1898 nt) 513 > > PITG_10862 | Pi cytochrome c oxidase assembly protein COX11 (735 nt) > 1346 > > > >> setwd("C:/Users/SRIDHARA/Documents/test/bowtie/0_mismatch") > >> targets <- read.delim(file = "targets.txt", stringsAsFactors = FALSE) d > <- readDGE(targets, skip = 5, comment.char = "!") > >> d <- d[rowSums(d$counts) > 5, ] > >> d <- calcNormFactors(d) > >> d$samples > > files group description lib.size norm.factors > > 1 PhyP18B1.txt PhyP18 Phytophthora phaseoli 2442435 0.7528028 2 > PhyP18B2.txt PhyP18 Phytophthora phaseoli 7355562 1.0412687 3 > PPLB6dpiB1.txt PPLB6dpi Phytophthora phaseoli 3280812 1.1965512 4 > PPLB6dpiB2.txt PPLB6dpi Phytophthora phaseoli 3906611 1.0661656 > > > > 1) My objective is to study the expression levels of group of genes (for > eg., effector genes) in files PhyP18 (one group) and PPLB6dpi (another > group). > > So, I was wondering if I could multiply each of the tag counts (for each > geneID) in all the files with their respective norm.factor values. For > example in PhyP18B1 multiplying each geneID tag counts with > 0.7528028 > > (norm.factors) > > PITG_01618 | Pi conserved hypothetical protein (773 nt) 73 * > 0.7528028 > > = > > 54.9546044 > > ---> No, do NOT multiply by the normalization factor. Did you notice the > formula in my last email? You'll see there you divide your counts by the > product of the library size and normalization factor and then > (optionally?) multiple by a constant to put the normalized values on a > scale of interest. > > > > > > and similarly each tag counts in file PPLB6dpiB1 with 1.1965512 and so > on... > > > > later can I use the value 54.954 as the normalized tag count for that > gene > > ID in that library (PhyP18B1)? > > > > 2) Second question is about the plotSmear(). > > red dots (in plot) are the gene ID > > orange dots (in plot) are the genes that are unique to one of the > > library > > black dots - this where I had trouble in understand. Are these set > of > > prominent genes that are largely attributable for the overall bias in > log-fold-changes > > > > ---> Again, I'm taking a stab because you haven't told us what commands > you are using, but if you are using something like: > > [...] > plotSmear(d, de.tags=de.tags) > > ... then the red, orange, black dots are the DE, unique-to-sample and > remaining genes, respectively. See also the help docs at ?plotSmear and > ?maPlot. > > Hope that helps. > > Mark > > > > > > > > Sorry for such a long email. Any suggestions will be highly appreciated. > Thanks for taking a stab on this. > > > > Sridhara > > > > > > > > On Mon, Jan 24, 2011 at 10:38 PM, Mark Robinson > > <mrobinson@wehi.edu.au>wrote: > > > >> Hi Sridhara. > >> I'm having trouble understanding what you have already done (perhaps > show > >> some code?) and what exactly are asking, but I'll take a stab. > >> When you say "the DGE plot", what are you referring to? A plot > produced > >> by > >> plotSmear()? If so, then yes, the log-ratios (and 'logConc') will take > into > >> account the normalization factors. But, this still doesn't lead to > "normalized counts". For this, you could do something like: > >> d <- DGEList(counts=...) > >> d <- calcNormFactors(d) > >> rpm <- t(t(d$counts) / (d$samples$lib.size*d$samples$norm.factors)) * > 1e6 > >> Here, 'rpm' means reads per million, having used what you might call > 'effective' library sizes i.e. adjusted for relative composition > differences. Of course, if you are doing RNA-seq, you may want to > include a > >> term in there for gene length, which will make it more like an RPKM. > It > >> sort of depends what your objective is here, because none of this is > really > >> necessary for differential expression analysis (as you will see from > edgeR > >> user's guide). Note that the 'rpm' table is not to be used for any > statistical analysis of differential expression, since the data has > been > >> put > >> on an arbitrary scale. > >> Hope that helps. > >> Cheers, > >> Mark > >> On 2011-01-25, at 11:40 AM, Sridhara Gupta Kunjeti wrote: > >> > Hi, > >> >> I am a PhD candidate at University of Delaware and area of research > >> is > >> >> RNA-seq and functional genomics in Phytophthora phaseoli. I was > >> impressed by > >> >> your paper in Genome Biology on "A scaling normalization method for > differential expression analysis of RNA-seq data". I did use the > >> edgeR > >> guide > >> >> and user manual and it makes lot of sense when the DGE in a plot > >> graph. > >> >> If I am correct the DGE plot (Visualising DGE results) is plotted > >> using > >> the > >> >> normalized tag counts for each gene ID, please correct me if I am > >> wrong? > >> >> If so, I was wondering if I can generate a list or table or gene ID > >> with > >> >> the normalized tag counts? I would appreciate if you let me know if > >> this > >> can > >> >> be done. > >> >> > >> >> Thanks, > >> >> Sridhara > >> >> > >> >> > >> >> -- > >> >> Sridhara G Kunjeti > >> >> PhD Candidate > >> >> University of Delaware > >> >> Department of Plant and Soil Science > >> >> email- sridhara@udel.edu > >> >> Ph: 832-566-0011 > >> >> > >> > > >> > [[alternative HTML version deleted]] > >> > > >> > _______________________________________________ > >> > Bioconductor mailing list > >> > Bioconductor@r-project.org > >> > https://stat.ethz.ch/mailman/listinfo/bioconductor > >> > Search the archives: > >> http://news.gmane.org/gmane.science.biology.informatics.conductor > ------------------------------ > >> Mark Robinson, PhD (Melb) > >> Epigenetics Laboratory, Garvan > >> Bioinformatics Division, WEHI > >> e: mrobinson@wehi.edu.au > >> e: m.robinson@garvan.org.au > >> p: +61 (0)3 9345 2628 > >> f: +61 (0)3 9347 0852 > >> ------------------------------ > >> ______________________________________________________________________ > The information in this email is confidential and intended solely for > the > >> addressee. > >> You must not disclose, forward, print or use it without the permission > of > >> the sender. > >> ______________________________________________________________________ > > > > > > > > -- > > Sridhara G Kunjeti > > PhD Candidate > > University of Delaware > > Department of Plant and Soil Science > > email- sridhara@udel.edu > > Ph: 832-566-0011 > > > > > > > > ______________________________________________________________________ > The information in this email is confidential and intended solely for the > addressee. > You must not disclose, forward, print or use it without the permission of > the sender. > ______________________________________________________________________ > -- Sridhara G Kunjeti PhD Candidate University of Delaware Department of Plant and Soil Science email- sridhara@udel.edu Ph: 832-566-0011 [[alternative HTML version deleted]]

ADD REPLY • link 15.0 years ago Sridhara Gupta Kunjeti ▴ 320

Login before adding your answer.