EDASeq within normalization

0

Entering edit mode

Catarina Almeida ▴ 30

@catarina-almeida-6053

Last seen 9.6 years ago

Dear all, I'm using EDASeq to normalize my RNA-seq data. But I'm having some trouble understanding how to normalize for gc and for length... I got the idea that I needed to do it separately, like this: # within and between lane normalization for GC # dataWithinGC2 <- withinLaneNormalization(data,"gc",which="full") dataNormGC2 <- betweenLaneNormalization(dataWithinGC,which="full") # within and between lane normalization for length ## dataWithinLength <- withinLaneNormalization(data,"length",which="full") dataNormLength <- betweenLaneNormalization(dataWithinLength,which="full") Am I thinking right? Or should I within-normalize my data for both GC and length, like this: dataWithin <- withinLaneNormalization(data,"length",which="full") dataWithin <- withinLaneNormalization(dataWithin,"gc",which="full") dataNorm <- betweenLaneNormalization(dataWithin,which="full") Any help is much appreciated! C [[alternative HTML version deleted]]

Normalization EDASeq Normalization EDASeq • 2.5k views

ADD COMMENT • link updated 10.5 years ago by davide risso ▴ 950 • written 10.6 years ago by Catarina Almeida ▴ 30

0

Entering edit mode

davide risso ▴ 950

@davide-risso-5075

Last seen 6 weeks ago

University of Padova

Hi Catarina, our within-sample normalization is meant to normalize for one factor at the time. In our paper (http://www.biomedcentral.com/1471-2105/12/480/) we showed that in our data GC-content effect are possibly library-specific and can bias differential expression, while we didn't see such a library-specific effect for gene length. Hence, we propose to normalize for GC-content and not for length. If you want to normalize for both GC-content and length, I suggest to have a look at the cqn normalization (http://bioconductor.org/packages/release/bioc/html/cqn.html) that, if I remember correctly, accounts for both effects. I also suggest to carefully "look" at the data, e.g. with the EDASeq functions biasPlot and biasBoxplot to see if you need to normalize for GC-content and/or length effects, because this may vary a lot across datasets. Best regards, Davide On Thu, Oct 10, 2013 at 11:05 AM, Catarina Almeida <catarina.fa at="" gmail.com=""> wrote: > Dear all, > > I'm using EDASeq to normalize my RNA-seq data. > > But I'm having some trouble understanding how to normalize for gc and for > length... I got the idea that I needed to do it separately, like this: > > # within and between lane normalization for GC # > dataWithinGC2 <- withinLaneNormalization(data,"gc",which="full") > dataNormGC2 <- betweenLaneNormalization(dataWithinGC,which="full") > > # within and between lane normalization for length ## > dataWithinLength <- withinLaneNormalization(data,"length",which="full") > dataNormLength <- betweenLaneNormalization(dataWithinLength,which="full") > > Am I thinking right? Or should I within-normalize my data for both GC and > length, like this: > dataWithin <- withinLaneNormalization(data,"length",which="full") > dataWithin <- withinLaneNormalization(dataWithin,"gc",which="full") > dataNorm <- betweenLaneNormalization(dataWithin,which="full") > > Any help is much appreciated! > C > > [[alternative HTML version deleted]] > > _______________________________________________ > Bioconductor mailing list > Bioconductor at r-project.org > https://stat.ethz.ch/mailman/listinfo/bioconductor > Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor -- Davide Risso, PhD Post Doctoral Scholar Department of Statistics University of California, Berkeley 344 Li Ka Shing Center, #3370 Berkeley, CA 94720-3370 E-mail: davide.risso at berkeley.edu

ADD COMMENT • link 10.5 years ago davide risso ▴ 950

0

Entering edit mode

Got it, thanks for clarifying and for the suggestion. I do have another question though! Hopefully I can make it clear. For instance, for within-lane normalization, what parameter from do we chose from "upper" "loess" median" and "full" for wich= "" when normalizing? I understand how they work and I understand that "full" seems a much more accurate way to normalize. What I fail to understand is the criteria used to chose between full, upper, median and loess. Does it depend on my experience? Is it a question of what method gives the best normalized plots? I've read your article and two tutorials I found on normalizing data (and also Bullard's 2010, on the between-lane normalization approaches) but I'm afraid I am still confused with this. Thanks in advance! Catarina 2013/10/16 davide risso <risso.davide@gmail.com> > Hi Catarina, > > our within-sample normalization is meant to normalize for one factor > at the time. > In our paper (http://www.biomedcentral.com/1471-2105/12/480/) we > showed that in our data GC-content effect are possibly > library-specific and can bias differential expression, while we didn't > see such a library-specific effect for gene length. Hence, we propose > to normalize for GC-content and not for length. > > If you want to normalize for both GC-content and length, I suggest to > have a look at the cqn normalization > (http://bioconductor.org/packages/release/bioc/html/cqn.html) that, if > I remember correctly, accounts for both effects. > > I also suggest to carefully "look" at the data, e.g. with the EDASeq > functions biasPlot and biasBoxplot to see if you need to normalize for > GC-content and/or length effects, because this may vary a lot across > datasets. > > Best regards, > Davide > > On Thu, Oct 10, 2013 at 11:05 AM, Catarina Almeida > <catarina.fa@gmail.com> wrote: > > Dear all, > > > > I'm using EDASeq to normalize my RNA-seq data. > > > > But I'm having some trouble understanding how to normalize for gc and for > > length... I got the idea that I needed to do it separately, like this: > > > > # within and between lane normalization for GC # > > dataWithinGC2 <- withinLaneNormalization(data,"gc",which="full") > > dataNormGC2 <- betweenLaneNormalization(dataWithinGC,which="full") > > > > # within and between lane normalization for length ## > > dataWithinLength <- withinLaneNormalization(data,"length",which="full") > > dataNormLength <- betweenLaneNormalization(dataWithinLength,which="full") > > > > Am I thinking right? Or should I within-normalize my data for both GC and > > length, like this: > > dataWithin <- withinLaneNormalization(data,"length",which="full") > > dataWithin <- withinLaneNormalization(dataWithin,"gc",which="full") > > dataNorm <- betweenLaneNormalization(dataWithin,which="full") > > > > Any help is much appreciated! > > C > > > > [[alternative HTML version deleted]] > > > > _______________________________________________ > > Bioconductor mailing list > > Bioconductor@r-project.org > > https://stat.ethz.ch/mailman/listinfo/bioconductor > > Search the archives: > http://news.gmane.org/gmane.science.biology.informatics.conductor > > > > -- > Davide Risso, PhD > Post Doctoral Scholar > Department of Statistics > University of California, Berkeley > 344 Li Ka Shing Center, #3370 > Berkeley, CA 94720-3370 > E-mail: davide.risso@berkeley.edu > [[alternative HTML version deleted]]

ADD REPLY • link 10.5 years ago Catarina Almeida ▴ 30

0

Entering edit mode

Hi Catarina, the answer to the question "what is the best normalization method?" really depends on the specific dataset that you are looking at. We suggest to perform a careful exploratory data analysis; the plots in the EDASeq package are a good starting point, but you may also want to look at how each different normalization affect the downstream analysis, e.g. by looking at the distribution of the differential expression p-values. In our yeast data the full-quantile normalization seemed to perform slightly better, but this might be different in other datasets. So, I'm afraid that you need to "look" carefully at your data after each normalization and pick the method that leads to the more satisfying results, in terms of absence of bias, more uniform distribution of the p-values, etc. Best, davide On Mon, Oct 21, 2013 at 7:55 AM, Catarina Almeida <catarina.fa at="" gmail.com=""> wrote: > Got it, thanks for clarifying and for the suggestion. > I do have another question though! Hopefully I can make it clear. > > For instance, for within-lane normalization, what parameter from do we chose > from "upper" "loess" median" and "full" for > wich= "" > when normalizing? > > I understand how they work and I understand that "full" seems a much more > accurate way to normalize. What I fail to understand is the criteria used to > chose between full, upper, median and loess. > Does it depend on my experience? Is it a question of what method gives the > best normalized plots? > > I've read your article and two tutorials I found on normalizing data (and > also Bullard's 2010, on the between-lane normalization approaches) but I'm > afraid I am still confused with this. > > Thanks in advance! > Catarina > > > 2013/10/16 davide risso <risso.davide at="" gmail.com=""> >> >> Hi Catarina, >> >> our within-sample normalization is meant to normalize for one factor >> at the time. >> In our paper (http://www.biomedcentral.com/1471-2105/12/480/) we >> showed that in our data GC-content effect are possibly >> library-specific and can bias differential expression, while we didn't >> see such a library-specific effect for gene length. Hence, we propose >> to normalize for GC-content and not for length. >> >> If you want to normalize for both GC-content and length, I suggest to >> have a look at the cqn normalization >> (http://bioconductor.org/packages/release/bioc/html/cqn.html) that, if >> I remember correctly, accounts for both effects. >> >> I also suggest to carefully "look" at the data, e.g. with the EDASeq >> functions biasPlot and biasBoxplot to see if you need to normalize for >> GC-content and/or length effects, because this may vary a lot across >> datasets. >> >> Best regards, >> Davide >> >> On Thu, Oct 10, 2013 at 11:05 AM, Catarina Almeida >> <catarina.fa at="" gmail.com=""> wrote: >> > Dear all, >> > >> > I'm using EDASeq to normalize my RNA-seq data. >> > >> > But I'm having some trouble understanding how to normalize for gc and >> > for >> > length... I got the idea that I needed to do it separately, like this: >> > >> > # within and between lane normalization for GC # >> > dataWithinGC2 <- withinLaneNormalization(data,"gc",which="full") >> > dataNormGC2 <- betweenLaneNormalization(dataWithinGC,which="full") >> > >> > # within and between lane normalization for length ## >> > dataWithinLength <- withinLaneNormalization(data,"length",which="full") >> > dataNormLength <- >> > betweenLaneNormalization(dataWithinLength,which="full") >> > >> > Am I thinking right? Or should I within-normalize my data for both GC >> > and >> > length, like this: >> > dataWithin <- withinLaneNormalization(data,"length",which="full") >> > dataWithin <- withinLaneNormalization(dataWithin,"gc",which="full") >> > dataNorm <- betweenLaneNormalization(dataWithin,which="full") >> > >> > Any help is much appreciated! >> > C >> > >> > [[alternative HTML version deleted]] >> > >> > _______________________________________________ >> > Bioconductor mailing list >> > Bioconductor at r-project.org >> > https://stat.ethz.ch/mailman/listinfo/bioconductor >> > Search the archives: >> > http://news.gmane.org/gmane.science.biology.informatics.conductor >> >> >> >> -- >> Davide Risso, PhD >> Post Doctoral Scholar >> Department of Statistics >> University of California, Berkeley >> 344 Li Ka Shing Center, #3370 >> Berkeley, CA 94720-3370 >> E-mail: davide.risso at berkeley.edu > > -- Davide Risso, PhD Post Doctoral Scholar Department of Statistics University of California, Berkeley 344 Li Ka Shing Center, #3370 Berkeley, CA 94720-3370 E-mail: davide.risso at berkeley.edu

ADD REPLY • link 10.5 years ago davide risso ▴ 950

Login before adding your answer.