normalizing RNAseq with batch/block-level bulk DE

0

Entering edit mode

Aaron Mackey ▴ 170

@aaron-mackey-4358

Last seen 9.6 years ago

Is VST-normalization (a la DESeq) considered the right way to deal with large scale differences in mean "baseline" expression across experimental blocks? Is there a normalization method that can take into account the design matrix (or at least the batch/block columns)? I don't want to remove the batch/block effects, but TMM and friends all assume near-constant expression across the design, which is violated by our (nuisance) block-level differences in composition. We see this when we compare edgeR TMM-normalized log(cpm) to qRT-PCR data; the TMM-normalization has smoothed out the block differences that the Ct values still exhibit (though cpm and Ct are still strongly correlated, there is a Ct "shift" for each different block that is not seen in the cpm). Thanks in advance for any insights/thoughts on the issue, -Aaron [[alternative HTML version deleted]]

Normalization edgeR Normalization edgeR • 1.2k views

ADD COMMENT • link updated 11.5 years ago by Wolfgang Huber ★ 13k • written 11.5 years ago by Aaron Mackey ▴ 170

0

Entering edit mode

Wolfgang Huber ★ 13k

@wolfgang-huber-3550

Last seen 15 days ago

EMBL European Molecular Biology Laborat…

Dear Aaron DESeq's variance stabilising transformation does not do normalisation. By "deal with large scale differences in mean 'baseline' expression across experimental blocks" do you mean that you are considering a comparison between different biological conditions where you expect that a lot of gene expression levels are changed? The best here is to work with a set of negative control genes: these can either be spike- ins or a category of genes from which you know that they shouldn't change too much. Then, call 'estimateSizeFactors' only on the data of these, but apply to all data (by using the assignment function 'sizeFactors<-'). Best wishes Wolfgang. Il giorno Oct 25, 2012, alle ore 5:43 PM, Aaron Mackey <amackey at="" virginia.edu=""> ha scritto: > Is VST-normalization (a la DESeq) considered the right way to deal with > large scale differences in mean "baseline" expression across experimental > blocks? Is there a normalization method that can take into account the > design matrix (or at least the batch/block columns)? I don't want to > remove the batch/block effects, but TMM and friends all assume > near-constant expression across the design, which is violated by our > (nuisance) block-level differences in composition. We see this when we > compare edgeR TMM-normalized log(cpm) to qRT-PCR data; the > TMM-normalization has smoothed out the block differences that the Ct values > still exhibit (though cpm and Ct are still strongly correlated, there is a > Ct "shift" for each different block that is not seen in the cpm). > > Thanks in advance for any insights/thoughts on the issue, > -Aaron > > [[alternative HTML version deleted]] > > _______________________________________________ > Bioconductor mailing list > Bioconductor at r-project.org > https://stat.ethz.ch/mailman/listinfo/bioconductor > Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor

ADD COMMENT • link 11.5 years ago Wolfgang Huber ★ 13k

0

Entering edit mode

I meant that the experimental design contains both factors of interest, and nuisance factors, both of which contribute to variation across samples. While the factors of interest may relate to a small number of gene/isoform changes (with potentially large magnitudes), the nuisance factor differences are far more abundant, though usually of much smaller magnitude. I wish we had spike-ins, but we'll consider coming up with a category of "constant" genes. What do you think about identifying such a list via an automated, iterative bootstrap: select 500 random genes with minimum coefficient of variation across the experiment (ignoring the design), estimateSizeFactors with these, then recalculate cpm; re- select the best 500 control genes and keep iterating until you stabilize the selected genes, and/or the size factors ... ? Thanks, -Aaron On Thu, Oct 25, 2012 at 3:35 PM, Wolfgang Huber <whuber@embl.de> wrote: > Dear Aaron > > DESeq's variance stabilising transformation does not do normalisation. > > By "deal with large scale differences in mean 'baseline' expression across > experimental blocks" do you mean that you are considering a comparison > between different biological conditions where you expect that a lot of gene > expression levels are changed? The best here is to work with a set of > negative control genes: these can either be spike-ins or a category of > genes from which you know that they shouldn't change too much. Then, call > 'estimateSizeFactors' only on the data of these, but apply to all data (by > using the assignment function 'sizeFactors<-'). > > Best wishes > Wolfgang. > > Il giorno Oct 25, 2012, alle ore 5:43 PM, Aaron Mackey < > amackey@virginia.edu> ha scritto: > > > Is VST-normalization (a la DESeq) considered the right way to deal with > > large scale differences in mean "baseline" expression across experimental > > blocks? Is there a normalization method that can take into account the > > design matrix (or at least the batch/block columns)? I don't want to > > remove the batch/block effects, but TMM and friends all assume > > near-constant expression across the design, which is violated by our > > (nuisance) block-level differences in composition. We see this when we > > compare edgeR TMM-normalized log(cpm) to qRT-PCR data; the > > TMM-normalization has smoothed out the block differences that the Ct > values > > still exhibit (though cpm and Ct are still strongly correlated, there is > a > > Ct "shift" for each different block that is not seen in the cpm). > > > > Thanks in advance for any insights/thoughts on the issue, > > -Aaron > > > > [[alternative HTML version deleted]] > > > > _______________________________________________ > > Bioconductor mailing list > > Bioconductor@r-project.org > > https://stat.ethz.ch/mailman/listinfo/bioconductor > > Search the archives: > http://news.gmane.org/gmane.science.biology.informatics.conductor > > [[alternative HTML version deleted]]

ADD REPLY • link 11.5 years ago Aaron Mackey ▴ 170

0

Entering edit mode

Hi Aaron I would try this, although whether it worked and makes sense you will only know afterwards (or never :). Best wishes Wolfgang Il giorno Oct 25, 2012, alle ore 10:28 PM, Aaron Mackey <amackey at="" virginia.edu=""> ha scritto: > I meant that the experimental design contains both factors of interest, and nuisance factors, both of which contribute to variation across samples. While the factors of interest may relate to a small number of gene/isoform changes (with potentially large magnitudes), the nuisance factor differences are far more abundant, though usually of much smaller magnitude. I wish we had spike-ins, but we'll consider coming up with a category of "constant" genes. What do you think about identifying such a list via an automated, iterative bootstrap: select 500 random genes with minimum coefficient of variation across the experiment (ignoring the design), estimateSizeFactors with these, then recalculate cpm; re-select the best 500 control genes and keep iterating until you stabilize the selected genes, and/or the size factors ... ? > > Thanks, > -Aaron > > > On Thu, Oct 25, 2012 at 3:35 PM, Wolfgang Huber <whuber at="" embl.de=""> wrote: > Dear Aaron > > DESeq's variance stabilising transformation does not do normalisation. > > By "deal with large scale differences in mean 'baseline' expression across experimental blocks" do you mean that you are considering a comparison between different biological conditions where you expect that a lot of gene expression levels are changed? The best here is to work with a set of negative control genes: these can either be spike- ins or a category of genes from which you know that they shouldn't change too much. Then, call 'estimateSizeFactors' only on the data of these, but apply to all data (by using the assignment function 'sizeFactors<-'). > > Best wishes > Wolfgang. > > Il giorno Oct 25, 2012, alle ore 5:43 PM, Aaron Mackey <amackey at="" virginia.edu=""> ha scritto: > > > Is VST-normalization (a la DESeq) considered the right way to deal with > > large scale differences in mean "baseline" expression across experimental > > blocks? Is there a normalization method that can take into account the > > design matrix (or at least the batch/block columns)? I don't want to > > remove the batch/block effects, but TMM and friends all assume > > near-constant expression across the design, which is violated by our > > (nuisance) block-level differences in composition. We see this when we > > compare edgeR TMM-normalized log(cpm) to qRT-PCR data; the > > TMM-normalization has smoothed out the block differences that the Ct values > > still exhibit (though cpm and Ct are still strongly correlated, there is a > > Ct "shift" for each different block that is not seen in the cpm). > > > > Thanks in advance for any insights/thoughts on the issue, > > -Aaron > > > > [[alternative HTML version deleted]] > > > > _______________________________________________ > > Bioconductor mailing list > > Bioconductor at r-project.org > > https://stat.ethz.ch/mailman/listinfo/bioconductor > > Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor > >

ADD REPLY • link 11.5 years ago Wolfgang Huber ★ 13k

Login before adding your answer.