DESeq2 - regularised log transformation blind or not?

0

Entering edit mode

Mike Stubbington ▴ 30

@mike-stubbington-6418

Last seen 9.6 years ago

Hi, I have just been reading the updated vignette for DESeq2 in the bioconductor devel branch (http://bioconductor.org/packages/devel/bioc /vignettes/DESeq2/inst/doc/DESeq2.pdf) and was interested by the comments in section 2.1.1 about the appropriateness of setting the blind argument when performing regularised log transformation. Specifically, the comment that ?...blind dispersion estimation is not the appropriate choice if one expects that many or the majority of genes (rows) will have large differences in counts which are explanable by the experimental design?? Given this, I would really appreciate some further advice about when one should set blind=FALSE. For example, I am performing gene clustering using RNA-seq data for different six cell types. I would certainly expect a lot of genes to vary between the samples. Is this a case when blind=FALSE might be appropriate? Thank you for your help, Mike

Clustering DESeq2 Clustering DESeq2 • 6.0k views

ADD COMMENT • link updated 10.2 years ago by Wolfgang Huber ★ 13k • written 10.2 years ago by Mike Stubbington ▴ 30

0

Entering edit mode

Wolfgang Huber ★ 13k

@wolfgang-huber-3550

Last seen 10 days ago

EMBL European Molecular Biology Laborat…

Hi Mike Thanks. The other Mike (Love) will chime in regarding the theoretical considerations regarding the two choices (blind=FALSE or TRUE). What I?d be interested in is whether the two make any significant difference to the clustering result (e.g. PCA/MDS plot) for your data? best wishes Wolfgang On 24 Feb 2014, at 15:21, Mike Stubbington <mstubb at="" ebi.ac.uk=""> wrote: > Hi, > > I have just been reading the updated vignette for DESeq2 in the bioconductor devel branch (http://bioconductor.org/packages/devel/bioc /vignettes/DESeq2/inst/doc/DESeq2.pdf) and was interested by the comments in section 2.1.1 about the appropriateness of setting the blind argument when performing regularised log transformation. Specifically, the comment that > > ?...blind dispersion estimation is not the appropriate choice if one expects that many or the majority of genes (rows) will have large differences in counts which are explanable by the experimental design?? > > Given this, I would really appreciate some further advice about when one should set blind=FALSE. > > For example, I am performing gene clustering using RNA-seq data for different six cell types. I would certainly expect a lot of genes to vary between the samples. Is this a case when blind=FALSE might be appropriate? > > Thank you for your help, > > Mike > > _______________________________________________ > Bioconductor mailing list > Bioconductor at r-project.org > https://stat.ethz.ch/mailman/listinfo/bioconductor > Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor

ADD COMMENT • link 10.2 years ago Wolfgang Huber ★ 13k

0

Entering edit mode

hi Mike, On 24 Feb 2014, at 15:21, Mike Stubbington <mstubb@ebi.ac.uk> wrote: > > > Hi, > > > > I have just been reading the updated vignette for DESeq2 in the > bioconductor devel branch ( > http://bioconductor.org/packages/devel/bioc/vignettes/DESeq2/inst/do c/DESeq2.pdf) > and was interested by the comments in section 2.1.1 about the > appropriateness of setting the blind argument when performing regularised > log transformation. Specifically, the comment that > > > > â...blind dispersion estimation is not the appropriate choice if one > expects that many or the majority of genes (rows) will have large > differences in counts which are explanable by the experimental designâ¦â > > > > Given this, I would really appreciate some further advice about when one > should set blind=FALSE. > > > > For example, I am performing gene clustering using RNA-seq data for > different six cell types. I would certainly expect a lot of genes to vary > between the samples. Is this a case when blind=FALSE might be appropriate? > > > > â Yes, I think it would be appropriate to use blind=FALSE here. I added this note to the vignette after this discussion a month ago: https://stat.ethz.ch/pipermail/bioconductor/2014-January/057293.html âBy default, the VST and rlog transformation use blind=TRUE, so that if people are using these transformations for quality assessment, the experimental design has absolutely no influence on the transformations (i.e. it is an unsupervised method). When blind=FALSE, the experimental design is only used by the VST and rlog transformations in calculating the gene-wise dispersion estimates, in order to fit a trend line through the dispersions over the mean. Only the trend line is then used by the transformations, not the gene-wise estimates. Therefore, for visualization, clustering, or machine learning applications I tend to recommend blind=FALSE. The downside of setting blind=TRUE, is that large differences due to the experimental design (e.g., cell types in your case, or different water columns in the linked discussion above), will inflate the gene-wise dispersion estimates. When most of the genes contain such large differences across conditions, this will raise the trend-line, and then the transformed values will be greatly shrunken toward each other for most genes, which is an undesirable loss of signal. Mike [[alternative HTML version deleted]]

ADD REPLY • link 10.2 years ago Michael Love 41k

0

Entering edit mode

Hi Mike S, Mike L.?s explanation is consistent with the impression that the replicates seem slightly closer to each other (compare to the between cell type distances) in the blind=FALSE plot. I.e. you might be picking up somewhat more noise in the blind=TRUE case. It might also be worth exploring the ?ntop? argument of plotPCA. Best wishes Wolfgang On 24 Feb 2014, at 17:22, Michael Love <michaelisaiahlove at="" gmail.com=""> wrote: > hi Mike, > > > On 24 Feb 2014, at 15:21, Mike Stubbington <mstubb at="" ebi.ac.uk=""> wrote: > > > Hi, > > > > I have just been reading the updated vignette for DESeq2 in the bioconductor devel branch (http://bioconductor.org/packages/devel/bioc /vignettes/DESeq2/inst/doc/DESeq2.pdf) and was interested by the comments in section 2.1.1 about the appropriateness of setting the blind argument when performing regularised log transformation. Specifically, the comment that > > > > ?...blind dispersion estimation is not the appropriate choice if one expects that many or the majority of genes (rows) will have large differences in counts which are explanable by the experimental design?? > > > > Given this, I would really appreciate some further advice about when one should set blind=FALSE. > > > > For example, I am performing gene clustering using RNA-seq data for different six cell types. I would certainly expect a lot of genes to vary between the samples. Is this a case when blind=FALSE might be appropriate? > > > > ? > Yes, I think it would be appropriate to use blind=FALSE here. I added this note to the vignette after this discussion a month ago: > > https://stat.ethz.ch/pipermail/bioconductor/2014-January/057293.html > > ?By default, the VST and rlog transformation use blind=TRUE, so that if people are using these transformations for quality assessment, the experimental design has absolutely no influence on the transformations (i.e. it is an unsupervised method). > > When blind=FALSE, the experimental design is only used by the VST and rlog transformations in calculating the gene-wise dispersion estimates, in order to fit a trend line through the dispersions over the mean. Only the trend line is then used by the transformations, not the gene-wise estimates. Therefore, for visualization, clustering, or machine learning applications I tend to recommend blind=FALSE. > > The downside of setting blind=TRUE, is that large differences due to the experimental design (e.g., cell types in your case, or different water columns in the linked discussion above), will inflate the gene- wise dispersion estimates. When most of the genes contain such large differences across conditions, this will raise the trend-line, and then the transformed values will be greatly shrunken toward each other for most genes, which is an undesirable loss of signal. > > Mike > >

ADD REPLY • link 10.2 years ago Wolfgang Huber ★ 13k

0

Entering edit mode

Dear Wolfgang, Thank you for your reply. Attached are the PCA plots generated by plotPCA() from the rlog transformed data with blind set to TRUE or FALSE. Each cell type has two replicates. I would appreciate your thoughts on them. If it helps in framing my question, I am more interested in how the genes cluster within a cell-type than how the cell types cluster. Yours, Mike On 24 Feb 2014, at 14:46, Wolfgang Huber <whuber at="" embl.de=""> wrote: > Hi Mike > > Thanks. > The other Mike (Love) will chime in regarding the theoretical considerations regarding the two choices (blind=FALSE or TRUE). > What I?d be interested in is whether the two make any significant difference to the clustering result (e.g. PCA/MDS plot) for your data? > > best wishes > Wolfgang > > On 24 Feb 2014, at 15:21, Mike Stubbington <mstubb at="" ebi.ac.uk=""> wrote: > >> Hi, >> >> I have just been reading the updated vignette for DESeq2 in the bioconductor devel branch (http://bioconductor.org/packages/devel/bioc /vignettes/DESeq2/inst/doc/DESeq2.pdf) and was interested by the comments in section 2.1.1 about the appropriateness of setting the blind argument when performing regularised log transformation. Specifically, the comment that >> >> ?...blind dispersion estimation is not the appropriate choice if one expects that many or the majority of genes (rows) will have large differences in counts which are explanable by the experimental design?? >> >> Given this, I would really appreciate some further advice about when one should set blind=FALSE. >> >> For example, I am performing gene clustering using RNA-seq data for different six cell types. I would certainly expect a lot of genes to vary between the samples. Is this a case when blind=FALSE might be appropriate? >> >> Thank you for your help, >> >> Mike >> >> _______________________________________________ >> Bioconductor mailing list >> Bioconductor at r-project.org >> https://stat.ethz.ch/mailman/listinfo/bioconductor >> Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor >

ADD REPLY • link 10.2 years ago Mike Stubbington ▴ 30

0

Entering edit mode

Hi Mike > If it helps in framing my question, I am more interested in how the genes cluster within a cell-type than how the cell types cluster. I am not sure I can reconcile this question and the data (experimental design) you presented. If the aim is to cluster genes within cell type, would you not do more than two replicates per cell type? (Clustering of genes based on only two samples seems the equivalent of "underpowered?.) And does it mean you are interested in doing six different clusterings of genes, and comparing them? I suppose these are not single-cell data? Since the default workflow of DESeq2 may find difficulties with such data, due to their greater sampling noise. Wolfgang > Yours, > > Mike > > <blind.png><notblind.png> > > > On 24 Feb 2014, at 14:46, Wolfgang Huber <whuber at="" embl.de=""> wrote: > >> Hi Mike >> >> Thanks. >> The other Mike (Love) will chime in regarding the theoretical considerations regarding the two choices (blind=FALSE or TRUE). >> What I?d be interested in is whether the two make any significant difference to the clustering result (e.g. PCA/MDS plot) for your data? >> >> best wishes >> Wolfgang >> >> On 24 Feb 2014, at 15:21, Mike Stubbington <mstubb at="" ebi.ac.uk=""> wrote: >> >>> Hi, >>> >>> I have just been reading the updated vignette for DESeq2 in the bioconductor devel branch (http://bioconductor.org/packages/devel/bioc /vignettes/DESeq2/inst/doc/DESeq2.pdf) and was interested by the comments in section 2.1.1 about the appropriateness of setting the blind argument when performing regularised log transformation. Specifically, the comment that >>> >>> ?...blind dispersion estimation is not the appropriate choice if one expects that many or the majority of genes (rows) will have large differences in counts which are explanable by the experimental design?? >>> >>> Given this, I would really appreciate some further advice about when one should set blind=FALSE. >>> >>> For example, I am performing gene clustering using RNA-seq data for different six cell types. I would certainly expect a lot of genes to vary between the samples. Is this a case when blind=FALSE might be appropriate? >>> >>> Thank you for your help, >>> >>> Mike >>> >>> _______________________________________________ >>> Bioconductor mailing list >>> Bioconductor at r-project.org >>> https://stat.ethz.ch/mailman/listinfo/bioconductor >>> Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor >> >

ADD REPLY • link 10.2 years ago Wolfgang Huber ★ 13k

0

Entering edit mode

Dear Wolfgang and Mike, These are not single-cell data. I agree that this is not a powerful approach! Rest assured that no major conclusions will be drawn from the gene clusters; it was more for my own interest whilst making sure that I understand the correct use of the rlog transform. I should probably have said ?I am more interested *at the moment* in how the genes cluster?? since I?m also interested in the way that the cell types cluster. The change from FALSE to TRUE seemed to have a greater effect upon the gene clusters than the cell clusters so it was that that piqued my interest. --------- Mike, Thank you very much for your reply. It was enormously helpful. If I may, I would like to ask one more question: I would like to look at differential gene expression between the cell types using contrasts. Would you recommend 1) Using DESeq2 v 1.2.10 with betaPrior=FALSE as an argument when calling DESeq or 2) Using the development version where expanded model matrices have been implemented? Thank you again for your help, Mike On 24 Feb 2014, at 15:33, Wolfgang Huber <whuber at="" embl.de=""> wrote: > Hi Mike > >> If it helps in framing my question, I am more interested in how the genes cluster within a cell-type than how the cell types cluster. > > I am not sure I can reconcile this question and the data (experimental design) you presented. If the aim is to cluster genes within cell type, would you not do more than two replicates per cell type? (Clustering of genes based on only two samples seems the equivalent of "underpowered?.) > And does it mean you are interested in doing six different clusterings of genes, and comparing them? > > I suppose these are not single-cell data? Since the default workflow of DESeq2 may find difficulties with such data, due to their greater sampling noise. > > Wolfgang > > > >> Yours, >> >> Mike >> >> <blind.png><notblind.png> >> >> >> On 24 Feb 2014, at 14:46, Wolfgang Huber <whuber at="" embl.de=""> wrote: >> >>> Hi Mike >>> >>> Thanks. >>> The other Mike (Love) will chime in regarding the theoretical considerations regarding the two choices (blind=FALSE or TRUE). >>> What I?d be interested in is whether the two make any significant difference to the clustering result (e.g. PCA/MDS plot) for your data? >>> >>> best wishes >>> Wolfgang >>> >>> On 24 Feb 2014, at 15:21, Mike Stubbington <mstubb at="" ebi.ac.uk=""> wrote: >>> >>>> Hi, >>>> >>>> I have just been reading the updated vignette for DESeq2 in the bioconductor devel branch (http://bioconductor.org/packages/devel/bioc /vignettes/DESeq2/inst/doc/DESeq2.pdf) and was interested by the comments in section 2.1.1 about the appropriateness of setting the blind argument when performing regularised log transformation. Specifically, the comment that >>>> >>>> ?...blind dispersion estimation is not the appropriate choice if one expects that many or the majority of genes (rows) will have large differences in counts which are explanable by the experimental design?? >>>> >>>> Given this, I would really appreciate some further advice about when one should set blind=FALSE. >>>> >>>> For example, I am performing gene clustering using RNA-seq data for different six cell types. I would certainly expect a lot of genes to vary between the samples. Is this a case when blind=FALSE might be appropriate? >>>> >>>> Thank you for your help, >>>> >>>> Mike >>>> >>>> _______________________________________________ >>>> Bioconductor mailing list >>>> Bioconductor at r-project.org >>>> https://stat.ethz.ch/mailman/listinfo/bioconductor >>>> Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor >>> >> >

ADD REPLY • link 10.2 years ago Mike Stubbington ▴ 30

0

Entering edit mode

Hi Mike, Sorry, I missed your second question, On Feb 24, 2014 10:48 AM, "Mike Stubbington" <mstubb@ebi.ac.uk> wrote: > > Dear Wolfgang and Mike, > > These are not single-cell data. > > I agree that this is not a powerful approach! Rest assured that no major conclusions will be drawn from the gene clusters; it was more for my own interest whilst making sure that I understand the correct use of the rlog transform. I should probably have said âI am more interested *at the moment* in how the genes clusterâ¦â since Iâm also interested in the way that the cell types cluster. The change from FALSE to TRUE seemed to have a greater effect upon the gene clusters than the cell clusters so it was that that piqued my interest. > > > --------- > > Mike, > > Thank you very much for your reply. It was enormously helpful. > > > > If I may, I would like to ask one more question: I would like to look at differential gene expression between the cell types using contrasts. Would you recommend > > 1) Using DESeq2 v 1.2.10 with betaPrior=FALSE as an argument when calling DESeq > > or > > 2) Using the development version where expanded model matrices have been implemented? > Either approach is statistically sound. The major difference is that (1) does not involve moderation of the effect size estimates, i.e. the log fold change estimates. The p-values are often similar. (2) will be available with the release of Bioconductor 2.14 on April 14. Installing the development branch involves installing the development version of R, which potentially can lead to complications depending on your system. So (1) might be easiest for now. Mike > > Thank you again for your help, > > Mike > > On 24 Feb 2014, at 15:33, Wolfgang Huber <whuber@embl.de> wrote: > > > Hi Mike > > > >> If it helps in framing my question, I am more interested in how the genes cluster within a cell-type than how the cell types cluster. > > > > I am not sure I can reconcile this question and the data (experimental design) you presented. If the aim is to cluster genes within cell type, would you not do more than two replicates per cell type? (Clustering of genes based on only two samples seems the equivalent of "underpoweredâ.) > > And does it mean you are interested in doing six different clusterings of genes, and comparing them? > > > > I suppose these are not single-cell data? Since the default workflow of DESeq2 may find difficulties with such data, due to their greater sampling noise. > > > > Wolfgang > > > > > > > >> Yours, > >> > >> Mike > >> > >> <blind.png><notblind.png> > >> > >> > >> On 24 Feb 2014, at 14:46, Wolfgang Huber <whuber@embl.de> wrote: > >> > >>> Hi Mike > >>> > >>> Thanks. > >>> The other Mike (Love) will chime in regarding the theoretical considerations regarding the two choices (blind=FALSE or TRUE). > >>> What Iâd be interested in is whether the two make any significant difference to the clustering result (e.g. PCA/MDS plot) for your data? > >>> > >>> best wishes > >>> Wolfgang > >>> > >>> On 24 Feb 2014, at 15:21, Mike Stubbington <mstubb@ebi.ac.uk> wrote: > >>> > >>>> Hi, > >>>> > >>>> I have just been reading the updated vignette for DESeq2 in the bioconductor devel branch ( http://bioconductor.org/packages/devel/bioc/vignettes/DESeq2/inst/doc/ DESeq2.pdf) and was interested by the comments in section 2.1.1 about the appropriateness of setting the blind argument when performing regularised log transformation. Specifically, the comment that > >>>> > >>>> â...blind dispersion estimation is not the appropriate choice if one expects that many or the majority of genes (rows) will have large differences in counts which are explanable by the experimental designâ¦â > >>>> > >>>> Given this, I would really appreciate some further advice about when one should set blind=FALSE. > >>>> > >>>> For example, I am performing gene clustering using RNA-seq data for different six cell types. I would certainly expect a lot of genes to vary between the samples. Is this a case when blind=FALSE might be appropriate? > >>>> > >>>> Thank you for your help, > >>>> > >>>> Mike > >>>> > >>>> _______________________________________________ > >>>> Bioconductor mailing list > >>>> Bioconductor@r-project.org > >>>> https://stat.ethz.ch/mailman/listinfo/bioconductor > >>>> Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor > >>> > >> > > > [[alternative HTML version deleted]]

ADD REPLY • link 10.1 years ago Michael Love 41k

Login before adding your answer.