RNAseq machine learning classifier

0

Entering edit mode

Michael Breen ▴ 370

@michael-breen-5999

Last seen 9.6 years ago

Hi all, We have a large RNAseq data set. Apart from identifying differentially expressed genes with these data we are also interested in classification in terms of developing a pronostic and diagnostic classifier. Normally, our approach would utilize a machine learning classifier, as SVM, and typically proceed with a nested cross-validation approach. The vast majority of these programs and packages have been designed utilizing microarray data. Are there any reasonable biases which one should consider before using such already published approaches on RNAseq data? Do the distributions of the different data types matter at all? If so, does an application exist using an SVM taking into consideration RNAseq raw counts? Thanks, Michael [[alternative HTML version deleted]]

RNASeq Microarray RNASeq Microarray • 2.3k views

ADD COMMENT • link updated 10.8 years ago by jhua@tgen.org ▴ 60 • written 10.8 years ago by Michael Breen ▴ 370

0

Entering edit mode

Steve Lianoglou ★ 13k

@steve-lianoglou-2771

Last seen 14 months ago

United States

Hi, On Mon, Jul 15, 2013 at 2:42 PM, Michael Breen <breenbioinformatics at="" gmail.com=""> wrote: > Hi all, > We have a large RNAseq data set. Apart from identifying differentially > expressed genes with these data we are also interested in classification in > terms of developing a pronostic and diagnostic classifier. > > Normally, our approach would utilize a machine learning classifier, as SVM, > and typically proceed with a nested cross-validation approach. > > > The vast majority of these programs and packages have been designed > utilizing microarray data. > > Are there any reasonable biases which one should consider before using such > already published approaches on RNAseq data? > > Do the distributions of the different data types matter at all? > > If so, does an application exist using an SVM taking into consideration > RNAseq raw counts? One approach would be to take the output from one of the variance stabilizing transformations in DESeq2 as the input to your machine learning approach. See: R> library(DESeq2) R> ?varianceStabilizingTransformation and the Section 7 of the DESeq2 vignette (count data transformations): http://bioconductor.org/packages/release/bioc/vignettes/DESeq2/inst/do c/DESeq2.pdf HTH, -steve -- Steve Lianoglou Computational Biologist Bioinformatics and Computational Biology Genentech

ADD COMMENT • link 10.8 years ago Steve Lianoglou ★ 13k

0

Entering edit mode

Steve! I was thinking along these same lines: estimating dispersions then using a variance stabilizing transformation. However, I am not sure how proper this approach is? Can anyone confirm the validity of this approach? Michael On Mon, Jul 15, 2013 at 3:58 PM, Steve Lianoglou <lianoglou.steve@gene.com>wrote: > Hi, > > On Mon, Jul 15, 2013 at 2:42 PM, Michael Breen > <breenbioinformatics@gmail.com> wrote: > > Hi all, > > We have a large RNAseq data set. Apart from identifying differentially > > expressed genes with these data we are also interested in classification > in > > terms of developing a pronostic and diagnostic classifier. > > > > Normally, our approach would utilize a machine learning classifier, as > SVM, > > and typically proceed with a nested cross-validation approach. > > > > > > The vast majority of these programs and packages have been designed > > utilizing microarray data. > > > > Are there any reasonable biases which one should consider before using > such > > already published approaches on RNAseq data? > > > > Do the distributions of the different data types matter at all? > > > > If so, does an application exist using an SVM taking into consideration > > RNAseq raw counts? > > One approach would be to take the output from one of the variance > stabilizing transformations in DESeq2 as the input to your machine > learning approach. > > See: > > R> library(DESeq2) > R> ?varianceStabilizingTransformation > > and the Section 7 of the DESeq2 vignette (count data transformations): > > > http://bioconductor.org/packages/release/bioc/vignettes/DESeq2/inst/ doc/DESeq2.pdf > > HTH, > -steve > > -- > Steve Lianoglou > Computational Biologist > Bioinformatics and Computational Biology > Genentech > [[alternative HTML version deleted]]

ADD REPLY • link 10.8 years ago Michael Breen ▴ 370

0

Entering edit mode

jhua@tgen.org ▴ 60

@jhuatgenorg-5011

Last seen 9.6 years ago

This sounds an OK approach to me. One thing you might take into consideration is that the classifier design usually involves independent validation data. If you are going to validate your classifier with the same type of RNAseq data, in general you need to normalize/variance stabilize all of them in one cohort. But sometimes the validation data are not collected until I report really positive results on training data only, which end up with another round of full normalization, training, and testing... Jianping Hua, Ph. D. Research Assistant Professor Computational Biology Division Translational Genomics Research Institute (TGen) > > Steve! > > I was thinking along these same lines: estimating dispersions then using a > variance stabilizing transformation. However, I am not sure how proper this > approach is? > > Can anyone confirm the validity of this approach? > > Michael > > > On Mon, Jul 15, 2013 at 3:58 PM, Steve Lianoglou > <lianoglou.steve at="" gene.com="">wrote: > >> Hi, >> >> On Mon, Jul 15, 2013 at 2:42 PM, Michael Breen >> <breenbioinformatics at="" gmail.com=""> wrote: >>> Hi all, >>> We have a large RNAseq data set. Apart from identifying differentially >>> expressed genes with these data we are also interested in classification >> in >>> terms of developing a pronostic and diagnostic classifier. >>> >>> Normally, our approach would utilize a machine learning classifier, as >> SVM, >>> and typically proceed with a nested cross-validation approach. >>> >>> >>> The vast majority of these programs and packages have been designed >>> utilizing microarray data. >>> >>> Are there any reasonable biases which one should consider before using >> such >>> already published approaches on RNAseq data? >>> >>> Do the distributions of the different data types matter at all? >>> >>> If so, does an application exist using an SVM taking into consideration >>> RNAseq raw counts? >> >> One approach would be to take the output from one of the variance >> stabilizing transformations in DESeq2 as the input to your machine >> learning approach. >> >> See: >> >> R> library(DESeq2) >> R> ?varianceStabilizingTransformation >> >> and the Section 7 of the DESeq2 vignette (count data transformations): >> >> >> http://bioconductor.org/packages/release/bioc/vignettes/DESeq2/inst /doc/DESeq2.pdf >> >> HTH, >> -steve >> >> -- >> Steve Lianoglou >> Computational Biologist >> Bioinformatics and Computational Biology >> Genentech >>

ADD COMMENT • link 10.8 years ago jhua@tgen.org ▴ 60

0

Entering edit mode

Hi Jianping good point about the parameter-dependence (i.e. dataset-dependence) of the variance stabilising transformations (VST) in DESeq2. However, once the typical coverage and noise characteristics of the RNA-Seq assay used are established, one can 'freeze' the VST parameters and then just use that for future samples. As always, QC of new data for compliance with the expectations from the learned ('frozen') characteristics will be needed. Best wishes Wolfgang On 17 Jul 2013, at 20:16, <jhua at="" tgen.org=""> wrote: > This sounds an OK approach to me. > > One thing you might take into consideration is that the classifier design usually involves independent validation data. If you are going to validate your classifier with the same type of RNAseq data, in general you need to normalize/variance stabilize all of them in one cohort. But sometimes the validation data are not collected until I report really positive results on training data only, which end up with another round of full normalization, training, and testing... > > Jianping Hua, Ph. D. > Research Assistant Professor > Computational Biology Division > Translational Genomics Research Institute (TGen) > > > >> >> Steve! >> >> I was thinking along these same lines: estimating dispersions then using a >> variance stabilizing transformation. However, I am not sure how proper this >> approach is? >> >> Can anyone confirm the validity of this approach? >> >> Michael >> >> >> On Mon, Jul 15, 2013 at 3:58 PM, Steve Lianoglou >> <lianoglou.steve at="" gene.com="">wrote: >> >>> Hi, >>> >>> On Mon, Jul 15, 2013 at 2:42 PM, Michael Breen >>> <breenbioinformatics at="" gmail.com=""> wrote: >>>> Hi all, >>>> We have a large RNAseq data set. Apart from identifying differentially >>>> expressed genes with these data we are also interested in classification >>> in >>>> terms of developing a pronostic and diagnostic classifier. >>>> >>>> Normally, our approach would utilize a machine learning classifier, as >>> SVM, >>>> and typically proceed with a nested cross-validation approach. >>>> >>>> >>>> The vast majority of these programs and packages have been designed >>>> utilizing microarray data. >>>> >>>> Are there any reasonable biases which one should consider before using >>> such >>>> already published approaches on RNAseq data? >>>> >>>> Do the distributions of the different data types matter at all? >>>> >>>> If so, does an application exist using an SVM taking into consideration >>>> RNAseq raw counts? >>> >>> One approach would be to take the output from one of the variance >>> stabilizing transformations in DESeq2 as the input to your machine >>> learning approach. >>> >>> See: >>> >>> R> library(DESeq2) >>> R> ?varianceStabilizingTransformation >>> >>> and the Section 7 of the DESeq2 vignette (count data transformations): >>> >>> >>> http://bioconductor.org/packages/release/bioc/vignettes/DESeq2/ins t/doc/DESeq2.pdf >>> >>> HTH, >>> -steve >>> >>> -- >>> Steve Lianoglou >>> Computational Biologist >>> Bioinformatics and Computational Biology >>> Genentech >>> > > _______________________________________________ > Bioconductor mailing list > Bioconductor at r-project.org > https://stat.ethz.ch/mailman/listinfo/bioconductor > Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor

ADD REPLY • link 10.8 years ago Wolfgang Huber ★ 13k

0

Entering edit mode

Hi, Wolfgang: Thanks for pointing this out. This sounds really convenient. I'll definitely check it out on how to freeze the parameters. Also how about normalization? Is there a similar procedure that I can freeze the profile for future sample normalization? Right now I do it by my own simple routine. But it would be wonderful if this can be done internally. Thanks. Jianping On Jul 17, 2013, at 11:49 AM, Wolfgang Huber wrote: > Hi Jianping > > good point about the parameter-dependence (i.e. dataset-dependence) of the variance stabilising transformations (VST) in DESeq2. > However, once the typical coverage and noise characteristics of the RNA-Seq assay used are established, one can 'freeze' the VST parameters and then just use that for future samples. > > As always, QC of new data for compliance with the expectations from the learned ('frozen') characteristics will be needed. > > Best wishes > Wolfgang > > On 17 Jul 2013, at 20:16, <jhua at="" tgen.org=""> wrote: > >> This sounds an OK approach to me. >> >> One thing you might take into consideration is that the classifier design usually involves independent validation data. If you are going to validate your classifier with the same type of RNAseq data, in general you need to normalize/variance stabilize all of them in one cohort. But sometimes the validation data are not collected until I report really positive results on training data only, which end up with another round of full normalization, training, and testing... >> >> Jianping Hua, Ph. D. >> Research Assistant Professor >> Computational Biology Division >> Translational Genomics Research Institute (TGen) >> >> >> >>> >>> Steve! >>> >>> I was thinking along these same lines: estimating dispersions then using a >>> variance stabilizing transformation. However, I am not sure how proper this >>> approach is? >>> >>> Can anyone confirm the validity of this approach? >>> >>> Michael >>> >>> >>> On Mon, Jul 15, 2013 at 3:58 PM, Steve Lianoglou >>> <lianoglou.steve at="" gene.com="">wrote: >>> >>>> Hi, >>>> >>>> On Mon, Jul 15, 2013 at 2:42 PM, Michael Breen >>>> <breenbioinformatics at="" gmail.com=""> wrote: >>>>> Hi all, >>>>> We have a large RNAseq data set. Apart from identifying differentially >>>>> expressed genes with these data we are also interested in classification >>>> in >>>>> terms of developing a pronostic and diagnostic classifier. >>>>> >>>>> Normally, our approach would utilize a machine learning classifier, as >>>> SVM, >>>>> and typically proceed with a nested cross-validation approach. >>>>> >>>>> >>>>> The vast majority of these programs and packages have been designed >>>>> utilizing microarray data. >>>>> >>>>> Are there any reasonable biases which one should consider before using >>>> such >>>>> already published approaches on RNAseq data? >>>>> >>>>> Do the distributions of the different data types matter at all? >>>>> >>>>> If so, does an application exist using an SVM taking into consideration >>>>> RNAseq raw counts? >>>> >>>> One approach would be to take the output from one of the variance >>>> stabilizing transformations in DESeq2 as the input to your machine >>>> learning approach. >>>> >>>> See: >>>> >>>> R> library(DESeq2) >>>> R> ?varianceStabilizingTransformation >>>> >>>> and the Section 7 of the DESeq2 vignette (count data transformations): >>>> >>>> >>>> http://bioconductor.org/packages/release/bioc/vignettes/DESeq2/in st/doc/DESeq2.pdf >>>> >>>> HTH, >>>> -steve >>>> >>>> -- >>>> Steve Lianoglou >>>> Computational Biologist >>>> Bioinformatics and Computational Biology >>>> Genentech >>>> >> >> _______________________________________________ >> Bioconductor mailing list >> Bioconductor at r-project.org >> https://stat.ethz.ch/mailman/listinfo/bioconductor >> Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor >

ADD REPLY • link 10.8 years ago jhua@tgen.org ▴ 60

0

Entering edit mode

hi Jianping, I thought the discussion was about normalization using VST, I'm not sure what is meant by normalization otherwise. by the way, the VST parameters can be re-assigned in DESeq2 like so: dispersionFunction(ddsNew) <- dispersionFunction(ddsOld) The call to the transformation should then specify blind=FALSE, so as to bypass the internal re-estimation of dispersions. At the moment, you also need to have some dispersions estimated for ddsNew (or to set the dispersions to any numeric values), to avoid re-estimation of dispersion internally, although I will fix this so that the VST only checks for an existing dispersion function. Mike On Wed, Jul 17, 2013 at 9:03 PM, <jhua at="" tgen.org=""> wrote: > Hi, Wolfgang: > > Thanks for pointing this out. This sounds really convenient. I'll definitely check it out on how to freeze the parameters. > > Also how about normalization? Is there a similar procedure that I can freeze the profile for future sample normalization? Right now I do it by my own simple routine. But it would be wonderful if this can be done internally. Thanks. > > > Jianping > > > > On Jul 17, 2013, at 11:49 AM, Wolfgang Huber wrote: > >> Hi Jianping >> >> good point about the parameter-dependence (i.e. dataset-dependence) of the variance stabilising transformations (VST) in DESeq2. >> However, once the typical coverage and noise characteristics of the RNA-Seq assay used are established, one can 'freeze' the VST parameters and then just use that for future samples. >> >> As always, QC of new data for compliance with the expectations from the learned ('frozen') characteristics will be needed. >> >> Best wishes >> Wolfgang >> >> On 17 Jul 2013, at 20:16, <jhua at="" tgen.org=""> wrote: >> >>> This sounds an OK approach to me. >>> >>> One thing you might take into consideration is that the classifier design usually involves independent validation data. If you are going to validate your classifier with the same type of RNAseq data, in general you need to normalize/variance stabilize all of them in one cohort. But sometimes the validation data are not collected until I report really positive results on training data only, which end up with another round of full normalization, training, and testing... >>> >>> Jianping Hua, Ph. D. >>> Research Assistant Professor >>> Computational Biology Division >>> Translational Genomics Research Institute (TGen) >>> >>> >>> >>>> >>>> Steve! >>>> >>>> I was thinking along these same lines: estimating dispersions then using a >>>> variance stabilizing transformation. However, I am not sure how proper this >>>> approach is? >>>> >>>> Can anyone confirm the validity of this approach? >>>> >>>> Michael >>>> >>>> >>>> On Mon, Jul 15, 2013 at 3:58 PM, Steve Lianoglou >>>> <lianoglou.steve at="" gene.com="">wrote: >>>> >>>>> Hi, >>>>> >>>>> On Mon, Jul 15, 2013 at 2:42 PM, Michael Breen >>>>> <breenbioinformatics at="" gmail.com=""> wrote: >>>>>> Hi all, >>>>>> We have a large RNAseq data set. Apart from identifying differentially >>>>>> expressed genes with these data we are also interested in classification >>>>> in >>>>>> terms of developing a pronostic and diagnostic classifier. >>>>>> >>>>>> Normally, our approach would utilize a machine learning classifier, as >>>>> SVM, >>>>>> and typically proceed with a nested cross-validation approach. >>>>>> >>>>>> >>>>>> The vast majority of these programs and packages have been designed >>>>>> utilizing microarray data. >>>>>> >>>>>> Are there any reasonable biases which one should consider before using >>>>> such >>>>>> already published approaches on RNAseq data? >>>>>> >>>>>> Do the distributions of the different data types matter at all? >>>>>> >>>>>> If so, does an application exist using an SVM taking into consideration >>>>>> RNAseq raw counts? >>>>> >>>>> One approach would be to take the output from one of the variance >>>>> stabilizing transformations in DESeq2 as the input to your machine >>>>> learning approach. >>>>> >>>>> See: >>>>> >>>>> R> library(DESeq2) >>>>> R> ?varianceStabilizingTransformation >>>>> >>>>> and the Section 7 of the DESeq2 vignette (count data transformations): >>>>> >>>>> >>>>> http://bioconductor.org/packages/release/bioc/vignettes/DESeq2/i nst/doc/DESeq2.pdf >>>>> >>>>> HTH, >>>>> -steve >>>>> >>>>> -- >>>>> Steve Lianoglou >>>>> Computational Biologist >>>>> Bioinformatics and Computational Biology >>>>> Genentech >>>>> >>> >>> _______________________________________________ >>> Bioconductor mailing list >>> Bioconductor at r-project.org >>> https://stat.ethz.ch/mailman/listinfo/bioconductor >>> Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor >> > > _______________________________________________ > Bioconductor mailing list > Bioconductor at r-project.org > https://stat.ethz.ch/mailman/listinfo/bioconductor > Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor

ADD REPLY • link 10.8 years ago Michael Love 41k

0

Entering edit mode

Hi, Mike: Thanks for the explanation. They are really helpful. My concern is about the size factors estimation. I'm not familiar with the details of VST so my understanding might be wrong. My understanding is that the VST is applied on the count normalized data, i.e., size factors must be estimated from the data. I believe that they are also data-dependent. So I'm wondering when one use the frozen parameters for the future data, how are the size factors being computed? Does DESeq2 use the loggeomeans of the future data to estimate the size factor, or does it have them in the frozen parameters for reuse? Or does the choice matter to VST? And in the case I encountered, somehow VST has little effects to the data (there might be a fitness problem to the model for our data). So we decided that we just stick to count normalized data by counts(ads, normalized = TRUE). Hence which loggeomeans to use does matter. And for the future data, which is usually a small testing set, we plan to use the loggeomeans of our large training data to calculate the size factors. Jianping On Jul 17, 2013, at 2:54 PM, Michael Love wrote: > hi Jianping, > > I thought the discussion was about normalization using VST, I'm not > sure what is meant by normalization otherwise. > > by the way, the VST parameters can be re-assigned in DESeq2 like so: > > dispersionFunction(ddsNew) <- dispersionFunction(ddsOld) > > The call to the transformation should then specify blind=FALSE, so as > to bypass the internal re-estimation of dispersions. At the moment, > you also need to have some dispersions estimated for ddsNew (or to set > the dispersions to any numeric values), to avoid re-estimation of > dispersion internally, although I will fix this so that the VST only > checks for an existing dispersion function. > > Mike > > On Wed, Jul 17, 2013 at 9:03 PM, <jhua at="" tgen.org=""> wrote: >> Hi, Wolfgang: >> >> Thanks for pointing this out. This sounds really convenient. I'll definitely check it out on how to freeze the parameters. >> >> Also how about normalization? Is there a similar procedure that I can freeze the profile for future sample normalization? Right now I do it by my own simple routine. But it would be wonderful if this can be done internally. Thanks. >> >> >> Jianping >> >> >> >> On Jul 17, 2013, at 11:49 AM, Wolfgang Huber wrote: >> >>> Hi Jianping >>> >>> good point about the parameter-dependence (i.e. dataset- dependence) of the variance stabilising transformations (VST) in DESeq2. >>> However, once the typical coverage and noise characteristics of the RNA-Seq assay used are established, one can 'freeze' the VST parameters and then just use that for future samples. >>> >>> As always, QC of new data for compliance with the expectations from the learned ('frozen') characteristics will be needed. >>> >>> Best wishes >>> Wolfgang >>> >>> On 17 Jul 2013, at 20:16, <jhua at="" tgen.org=""> wrote: >>> >>>> This sounds an OK approach to me. >>>> >>>> One thing you might take into consideration is that the classifier design usually involves independent validation data. If you are going to validate your classifier with the same type of RNAseq data, in general you need to normalize/variance stabilize all of them in one cohort. But sometimes the validation data are not collected until I report really positive results on training data only, which end up with another round of full normalization, training, and testing... >>>> >>>> Jianping Hua, Ph. D. >>>> Research Assistant Professor >>>> Computational Biology Division >>>> Translational Genomics Research Institute (TGen) >>>> >>>> >>>> >>>>> >>>>> Steve! >>>>> >>>>> I was thinking along these same lines: estimating dispersions then using a >>>>> variance stabilizing transformation. However, I am not sure how proper this >>>>> approach is? >>>>> >>>>> Can anyone confirm the validity of this approach? >>>>> >>>>> Michael >>>>> >>>>> >>>>> On Mon, Jul 15, 2013 at 3:58 PM, Steve Lianoglou >>>>> <lianoglou.steve at="" gene.com="">wrote: >>>>> >>>>>> Hi, >>>>>> >>>>>> On Mon, Jul 15, 2013 at 2:42 PM, Michael Breen >>>>>> <breenbioinformatics at="" gmail.com=""> wrote: >>>>>>> Hi all, >>>>>>> We have a large RNAseq data set. Apart from identifying differentially >>>>>>> expressed genes with these data we are also interested in classification >>>>>> in >>>>>>> terms of developing a pronostic and diagnostic classifier. >>>>>>> >>>>>>> Normally, our approach would utilize a machine learning classifier, as >>>>>> SVM, >>>>>>> and typically proceed with a nested cross-validation approach. >>>>>>> >>>>>>> >>>>>>> The vast majority of these programs and packages have been designed >>>>>>> utilizing microarray data. >>>>>>> >>>>>>> Are there any reasonable biases which one should consider before using >>>>>> such >>>>>>> already published approaches on RNAseq data? >>>>>>> >>>>>>> Do the distributions of the different data types matter at all? >>>>>>> >>>>>>> If so, does an application exist using an SVM taking into consideration >>>>>>> RNAseq raw counts? >>>>>> >>>>>> One approach would be to take the output from one of the variance >>>>>> stabilizing transformations in DESeq2 as the input to your machine >>>>>> learning approach. >>>>>> >>>>>> See: >>>>>> >>>>>> R> library(DESeq2) >>>>>> R> ?varianceStabilizingTransformation >>>>>> >>>>>> and the Section 7 of the DESeq2 vignette (count data transformations): >>>>>> >>>>>> >>>>>> http://bioconductor.org/packages/release/bioc/vignettes/DESeq2/ inst/doc/DESeq2.pdf >>>>>> >>>>>> HTH, >>>>>> -steve >>>>>> >>>>>> -- >>>>>> Steve Lianoglou >>>>>> Computational Biologist >>>>>> Bioinformatics and Computational Biology >>>>>> Genentech >>>>>> >>>> >>>> _______________________________________________ >>>> Bioconductor mailing list >>>> Bioconductor at r-project.org >>>> https://stat.ethz.ch/mailman/listinfo/bioconductor >>>> Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor >>> >> >> _______________________________________________ >> Bioconductor mailing list >> Bioconductor at r-project.org >> https://stat.ethz.ch/mailman/listinfo/bioconductor >> Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor

ADD REPLY • link 10.8 years ago jhua@tgen.org ▴ 60

0

Entering edit mode

hi Jianping, On Thu, Jul 18, 2013 at 1:15 AM, <jhua at="" tgen.org=""> wrote: > Hi, Mike: > > Thanks for the explanation. They are really helpful. > > My concern is about the size factors estimation. I'm not familiar with the details of VST so my understanding might be wrong. > > My understanding is that the VST is applied on the count normalized data, i.e., size factors must be estimated from the data. I believe that they are also data-dependent. So I'm wondering when one use the frozen parameters for the future data, how are the size factors being computed? Does DESeq2 use the loggeomeans of the future data to estimate the size factor, or does it have them in the frozen parameters for reuse? Or does the choice matter to VST? Now I see your point. You are correct that the size factors would be computed using the log geometric means of the new data. If you want to generate size factors for a new dataset that are commensurate with the old dataset, you could do: allCounts <- cbind(counts(ddsNew), counts(ddsOld)) allSF <- estimateSizeFactorsForMatrix(allCounts) sizeFactors(ddsNew) <- allSF[1:ncol(ddsNew)] > > And in the case I encountered, somehow VST has little effects to the data (there might be a fitness problem to the model for our data). So we decided that we just stick to count normalized data by counts(ads, normalized = TRUE). Hence which loggeomeans to use does matter. And for the future data, which is usually a small testing set, we plan to use the loggeomeans of our large training data to calculate the size factors. > Are you then taking the log plus a pseudocount of the size-factor-normalized data? It might be good to examine with and without VST using meanSdPlot as we have in the section "Effects of transformations on the variance" in the vignette. If you have elevated variance at low counts without using the VST, this could be detrimental to the performance of a classifier. Mike > > Jianping > > > > On Jul 17, 2013, at 2:54 PM, Michael Love wrote: > >> hi Jianping, >> >> I thought the discussion was about normalization using VST, I'm not >> sure what is meant by normalization otherwise. >> >> by the way, the VST parameters can be re-assigned in DESeq2 like so: >> >> dispersionFunction(ddsNew) <- dispersionFunction(ddsOld) >> >> The call to the transformation should then specify blind=FALSE, so as >> to bypass the internal re-estimation of dispersions. At the moment, >> you also need to have some dispersions estimated for ddsNew (or to set >> the dispersions to any numeric values), to avoid re-estimation of >> dispersion internally, although I will fix this so that the VST only >> checks for an existing dispersion function. >> >> Mike >> >> On Wed, Jul 17, 2013 at 9:03 PM, <jhua at="" tgen.org=""> wrote: >>> Hi, Wolfgang: >>> >>> Thanks for pointing this out. This sounds really convenient. I'll definitely check it out on how to freeze the parameters. >>> >>> Also how about normalization? Is there a similar procedure that I can freeze the profile for future sample normalization? Right now I do it by my own simple routine. But it would be wonderful if this can be done internally. Thanks. >>> >>> >>> Jianping >>> >>> >>> >>> On Jul 17, 2013, at 11:49 AM, Wolfgang Huber wrote: >>> >>>> Hi Jianping >>>> >>>> good point about the parameter-dependence (i.e. dataset- dependence) of the variance stabilising transformations (VST) in DESeq2. >>>> However, once the typical coverage and noise characteristics of the RNA-Seq assay used are established, one can 'freeze' the VST parameters and then just use that for future samples. >>>> >>>> As always, QC of new data for compliance with the expectations from the learned ('frozen') characteristics will be needed. >>>> >>>> Best wishes >>>> Wolfgang >>>> >>>> On 17 Jul 2013, at 20:16, <jhua at="" tgen.org=""> wrote: >>>> >>>>> This sounds an OK approach to me. >>>>> >>>>> One thing you might take into consideration is that the classifier design usually involves independent validation data. If you are going to validate your classifier with the same type of RNAseq data, in general you need to normalize/variance stabilize all of them in one cohort. But sometimes the validation data are not collected until I report really positive results on training data only, which end up with another round of full normalization, training, and testing... >>>>> >>>>> Jianping Hua, Ph. D. >>>>> Research Assistant Professor >>>>> Computational Biology Division >>>>> Translational Genomics Research Institute (TGen) >>>>> >>>>> >>>>> >>>>>> >>>>>> Steve! >>>>>> >>>>>> I was thinking along these same lines: estimating dispersions then using a >>>>>> variance stabilizing transformation. However, I am not sure how proper this >>>>>> approach is? >>>>>> >>>>>> Can anyone confirm the validity of this approach? >>>>>> >>>>>> Michael >>>>>> >>>>>> >>>>>> On Mon, Jul 15, 2013 at 3:58 PM, Steve Lianoglou >>>>>> <lianoglou.steve at="" gene.com="">wrote: >>>>>> >>>>>>> Hi, >>>>>>> >>>>>>> On Mon, Jul 15, 2013 at 2:42 PM, Michael Breen >>>>>>> <breenbioinformatics at="" gmail.com=""> wrote: >>>>>>>> Hi all, >>>>>>>> We have a large RNAseq data set. Apart from identifying differentially >>>>>>>> expressed genes with these data we are also interested in classification >>>>>>> in >>>>>>>> terms of developing a pronostic and diagnostic classifier. >>>>>>>> >>>>>>>> Normally, our approach would utilize a machine learning classifier, as >>>>>>> SVM, >>>>>>>> and typically proceed with a nested cross-validation approach. >>>>>>>> >>>>>>>> >>>>>>>> The vast majority of these programs and packages have been designed >>>>>>>> utilizing microarray data. >>>>>>>> >>>>>>>> Are there any reasonable biases which one should consider before using >>>>>>> such >>>>>>>> already published approaches on RNAseq data? >>>>>>>> >>>>>>>> Do the distributions of the different data types matter at all? >>>>>>>> >>>>>>>> If so, does an application exist using an SVM taking into consideration >>>>>>>> RNAseq raw counts? >>>>>>> >>>>>>> One approach would be to take the output from one of the variance >>>>>>> stabilizing transformations in DESeq2 as the input to your machine >>>>>>> learning approach. >>>>>>> >>>>>>> See: >>>>>>> >>>>>>> R> library(DESeq2) >>>>>>> R> ?varianceStabilizingTransformation >>>>>>> >>>>>>> and the Section 7 of the DESeq2 vignette (count data transformations): >>>>>>> >>>>>>> >>>>>>> http://bioconductor.org/packages/release/bioc/vignettes/DESeq2 /inst/doc/DESeq2.pdf >>>>>>> >>>>>>> HTH, >>>>>>> -steve >>>>>>> >>>>>>> -- >>>>>>> Steve Lianoglou >>>>>>> Computational Biologist >>>>>>> Bioinformatics and Computational Biology >>>>>>> Genentech >>>>>>> >>>>> >>>>> _______________________________________________ >>>>> Bioconductor mailing list >>>>> Bioconductor at r-project.org >>>>> https://stat.ethz.ch/mailman/listinfo/bioconductor >>>>> Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor >>>> >>> >>> _______________________________________________ >>> Bioconductor mailing list >>> Bioconductor at r-project.org >>> https://stat.ethz.ch/mailman/listinfo/bioconductor >>> Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor >

ADD REPLY • link 10.8 years ago Michael Love 41k

0

Entering edit mode

Hi >> My understanding is that the VST is applied on the count normalized >> data, i.e., size factors must be estimated from the data. I >> believe that they are also data-dependent. So I'm wondering when >> one use the frozen parameters for the future data, how are the size >> factors being computed? Does DESeq2 use the loggeomeans of the >> future data to estimate the size factor, or does it have them in >> the frozen parameters for reuse? Or does the choice matter to VST? Size factors are estimated as follows: First, we construct a "virtual reference", which is simply the geometric mean of all counts: geomeans <- exp( rowMeans( log( counts ) ) ) Then, for each sample j, the size factor is the median of this sample's count values to the reference values: sf[i] <- median( counts[i,] / geomeans ) If you want to use in a new data set frozen parameters from an old data set, then, to be on the safe side, you might also want to use the 'geomeans' vector of the old data set to calculate the size factors for the new data set. The folloing code (untested) should do the trick: loggeomeansOld <- exp( rowMeans( log( counts(ddsOld) ) ) ) sizeFactors( ddsNew ) <- apply( counts(ddsNew), 2, function(cnts) exp( median( (log(cnts) - loggeomeansOld)[ is.finite(loggeomeansOld) & (cnts>0) ] ) ) ) Even though the difference might not matter in practice, this here might in fact be the cleaner way than recalculating the size factors in the usual way. Simon

ADD REPLY • link 10.8 years ago Simon Anders ★ 3.7k

0

Entering edit mode

Hi, Mike: My approach is the same as Simon Anders's. For our data, the variance is low in low counts region. The variance increases with counts, yet somehow start to dip suddenly at very high counts region. Do you have any experience with such type of data? Is this normal? Jianping On Jul 18, 2013, at 2:44 AM, Michael Love wrote: > hi Jianping, > > On Thu, Jul 18, 2013 at 1:15 AM, <jhua at="" tgen.org=""> wrote: >> Hi, Mike: >> >> Thanks for the explanation. They are really helpful. >> >> My concern is about the size factors estimation. I'm not familiar with the details of VST so my understanding might be wrong. >> >> My understanding is that the VST is applied on the count normalized data, i.e., size factors must be estimated from the data. I believe that they are also data-dependent. So I'm wondering when one use the frozen parameters for the future data, how are the size factors being computed? Does DESeq2 use the loggeomeans of the future data to estimate the size factor, or does it have them in the frozen parameters for reuse? Or does the choice matter to VST? > > > Now I see your point. You are correct that the size factors would be > computed using the log geometric means of the new data. > > If you want to generate size factors for a new dataset that are > commensurate with the old dataset, you could do: > > allCounts <- cbind(counts(ddsNew), counts(ddsOld)) > allSF <- estimateSizeFactorsForMatrix(allCounts) > sizeFactors(ddsNew) <- allSF[1:ncol(ddsNew)] > > >> >> And in the case I encountered, somehow VST has little effects to the data (there might be a fitness problem to the model for our data). So we decided that we just stick to count normalized data by counts(ads, normalized = TRUE). Hence which loggeomeans to use does matter. And for the future data, which is usually a small testing set, we plan to use the loggeomeans of our large training data to calculate the size factors. >> > > > Are you then taking the log plus a pseudocount of the > size-factor-normalized data? It might be good to examine with and > without VST using meanSdPlot as we have in the section "Effects of > transformations on the variance" in the vignette. If you have > elevated variance at low counts without using the VST, this could be > detrimental to the performance of a classifier. > > Mike > > >> >> Jianping >> >> >> >> On Jul 17, 2013, at 2:54 PM, Michael Love wrote: >> >>> hi Jianping, >>> >>> I thought the discussion was about normalization using VST, I'm not >>> sure what is meant by normalization otherwise. >>> >>> by the way, the VST parameters can be re-assigned in DESeq2 like so: >>> >>> dispersionFunction(ddsNew) <- dispersionFunction(ddsOld) >>> >>> The call to the transformation should then specify blind=FALSE, so as >>> to bypass the internal re-estimation of dispersions. At the moment, >>> you also need to have some dispersions estimated for ddsNew (or to set >>> the dispersions to any numeric values), to avoid re-estimation of >>> dispersion internally, although I will fix this so that the VST only >>> checks for an existing dispersion function. >>> >>> Mike >>> >>> On Wed, Jul 17, 2013 at 9:03 PM, <jhua at="" tgen.org=""> wrote: >>>> Hi, Wolfgang: >>>> >>>> Thanks for pointing this out. This sounds really convenient. I'll definitely check it out on how to freeze the parameters. >>>> >>>> Also how about normalization? Is there a similar procedure that I can freeze the profile for future sample normalization? Right now I do it by my own simple routine. But it would be wonderful if this can be done internally. Thanks. >>>> >>>> >>>> Jianping >>>> >>>> >>>> >>>> On Jul 17, 2013, at 11:49 AM, Wolfgang Huber wrote: >>>> >>>>> Hi Jianping >>>>> >>>>> good point about the parameter-dependence (i.e. dataset- dependence) of the variance stabilising transformations (VST) in DESeq2. >>>>> However, once the typical coverage and noise characteristics of the RNA-Seq assay used are established, one can 'freeze' the VST parameters and then just use that for future samples. >>>>> >>>>> As always, QC of new data for compliance with the expectations from the learned ('frozen') characteristics will be needed. >>>>> >>>>> Best wishes >>>>> Wolfgang >>>>> >>>>> On 17 Jul 2013, at 20:16, <jhua at="" tgen.org=""> wrote: >>>>> >>>>>> This sounds an OK approach to me. >>>>>> >>>>>> One thing you might take into consideration is that the classifier design usually involves independent validation data. If you are going to validate your classifier with the same type of RNAseq data, in general you need to normalize/variance stabilize all of them in one cohort. But sometimes the validation data are not collected until I report really positive results on training data only, which end up with another round of full normalization, training, and testing... >>>>>> >>>>>> Jianping Hua, Ph. D. >>>>>> Research Assistant Professor >>>>>> Computational Biology Division >>>>>> Translational Genomics Research Institute (TGen) >>>>>> >>>>>> >>>>>> >>>>>>> >>>>>>> Steve! >>>>>>> >>>>>>> I was thinking along these same lines: estimating dispersions then using a >>>>>>> variance stabilizing transformation. However, I am not sure how proper this >>>>>>> approach is? >>>>>>> >>>>>>> Can anyone confirm the validity of this approach? >>>>>>> >>>>>>> Michael >>>>>>> >>>>>>> >>>>>>> On Mon, Jul 15, 2013 at 3:58 PM, Steve Lianoglou >>>>>>> <lianoglou.steve at="" gene.com="">wrote: >>>>>>> >>>>>>>> Hi, >>>>>>>> >>>>>>>> On Mon, Jul 15, 2013 at 2:42 PM, Michael Breen >>>>>>>> <breenbioinformatics at="" gmail.com=""> wrote: >>>>>>>>> Hi all, >>>>>>>>> We have a large RNAseq data set. Apart from identifying differentially >>>>>>>>> expressed genes with these data we are also interested in classification >>>>>>>> in >>>>>>>>> terms of developing a pronostic and diagnostic classifier. >>>>>>>>> >>>>>>>>> Normally, our approach would utilize a machine learning classifier, as >>>>>>>> SVM, >>>>>>>>> and typically proceed with a nested cross-validation approach. >>>>>>>>> >>>>>>>>> >>>>>>>>> The vast majority of these programs and packages have been designed >>>>>>>>> utilizing microarray data. >>>>>>>>> >>>>>>>>> Are there any reasonable biases which one should consider before using >>>>>>>> such >>>>>>>>> already published approaches on RNAseq data? >>>>>>>>> >>>>>>>>> Do the distributions of the different data types matter at all? >>>>>>>>> >>>>>>>>> If so, does an application exist using an SVM taking into consideration >>>>>>>>> RNAseq raw counts? >>>>>>>> >>>>>>>> One approach would be to take the output from one of the variance >>>>>>>> stabilizing transformations in DESeq2 as the input to your machine >>>>>>>> learning approach. >>>>>>>> >>>>>>>> See: >>>>>>>> >>>>>>>> R> library(DESeq2) >>>>>>>> R> ?varianceStabilizingTransformation >>>>>>>> >>>>>>>> and the Section 7 of the DESeq2 vignette (count data transformations): >>>>>>>> >>>>>>>> >>>>>>>> http://bioconductor.org/packages/release/bioc/vignettes/DESeq 2/inst/doc/DESeq2.pdf >>>>>>>> >>>>>>>> HTH, >>>>>>>> -steve >>>>>>>> >>>>>>>> -- >>>>>>>> Steve Lianoglou >>>>>>>> Computational Biologist >>>>>>>> Bioinformatics and Computational Biology >>>>>>>> Genentech >>>>>>>> >>>>>> >>>>>> _______________________________________________ >>>>>> Bioconductor mailing list >>>>>> Bioconductor at r-project.org >>>>>> https://stat.ethz.ch/mailman/listinfo/bioconductor >>>>>> Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor >>>>> >>>> >>>> _______________________________________________ >>>> Bioconductor mailing list >>>> Bioconductor at r-project.org >>>> https://stat.ethz.ch/mailman/listinfo/bioconductor >>>> Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor >>

ADD REPLY • link 10.8 years ago jhua@tgen.org ▴ 60

0

Entering edit mode

On Thu, Jul 18, 2013 at 4:45 PM, <jhua at="" tgen.org=""> wrote: > Hi, Mike: > > My approach is the same as Simon Anders's. > > For our data, the variance is low in low counts region. The variance increases with counts, yet somehow start to dip suddenly at very high counts region. > > Do you have any experience with such type of data? Is this normal? > This sounds somewhat like a description of the meanSdPlot for the log2(n + 1) transformation in the vignette (or are you referring to the VST data?). We do see an increase then decrease of variance/sd of log transformed data, though the location of the turning points will obviously depend a lot on the experiment and on the range of the mean counts. Therefore a good argument for using VST. Mike

ADD REPLY • link 10.8 years ago Michael Love 41k

0

Entering edit mode

Hi, Mike: Actually both log2(n+1) and VST gives roughly the same meanSdPlot, although VST one is slightly flatten. Jianping On Jul 19, 2013, at 2:35 AM, Michael Love wrote: > On Thu, Jul 18, 2013 at 4:45 PM, <jhua at="" tgen.org=""> wrote: >> Hi, Mike: >> >> My approach is the same as Simon Anders's. >> >> For our data, the variance is low in low counts region. The variance increases with counts, yet somehow start to dip suddenly at very high counts region. >> >> Do you have any experience with such type of data? Is this normal? >> > > This sounds somewhat like a description of the meanSdPlot for the > log2(n + 1) transformation in the vignette (or are you referring to > the VST data?). We do see an increase then decrease of variance/sd of > log transformed data, though the location of the turning points will > obviously depend a lot on the experiment and on the range of the mean > counts. Therefore a good argument for using VST. > > Mike

ADD REPLY • link 10.8 years ago jhua@tgen.org ▴ 60

Login before adding your answer.