Opinions on array design, normalization, and linear modeling with LIMMA

0

Entering edit mode

Jianping Jin ▴ 890

@jianping-jin-1212

Last seen 11.4 years ago

Yong, What is your reference sample(s) for this test run? Looks like the experiment and reference samples are quite different. JJ- --On Wednesday, October 31, 2007 4:40 PM -0500 Yong Yin <yyin at="" watson.wustl.edu=""> wrote: > Dear list, > > I am new to BioConductor, so please forgive me if my questions are > naive to you. > > We designed an Agilent 4x44k array, with the same 44K probes printed > 4 times in the 4 blocks. These 44K probes are designed based on a low- > coverage genome sequencing project for a parasitic nematode. Our > purpose is to investigate gene expression during early embryogenesis > of the nematode. > > We have received results from a test run to evaluate the array > quality. Samples applied on the chip were from two time points during > the nematode embryogenesis. As a experiment, I have been following > the LIMMA manual step-by-step, treating the results as a simple two- > sample comparison with both technical and biological replication. I > have uploaded 3 images in the following location and would love to > hear what you folks think: > > ftp://genome.wustl.edu/private/272205387781472/yong_data.071031/ > > The general quality of the array is very good, I can't find any > indication of quality problem. The file "MA_RGLW1.pdf" is a MA plot > of raw RG data for one of the 4 blocks. After background correction > with "normexp" and within-array normalization with global loess, its > MA plot is shown as in "MA_MALWC1.pdf". > > Given that we are studying early embryogenesis, we should expect that > a lot of genes are differentially expressed at these two time points. > In the MA plots, I think we indeed see lots of DE. However, > according to what I read, the underline assumption for such > normalization is that the majority of the genes under investigation > should not be differentially expressed. I also read from other > people's posts that I should keep the normalization as simple as > possible and the "good" data will always be good. > > From my MA plots, do you think my normalization is reasonable with > this data? If not, are there suggestions what to do? a different > normalization method? or even change the design of the array with a > set of spike-in control probes to use for normalization? > > The two time points in this test run are actually the beginning and > the ending points of the developmental stages that we are planning to > investigate. We are considering to use a pooled-sample as a common > reference. We hope a pooled reference like this will decrease the > degrees of differential expression between any two samples of our > study. Does this sound like a good idea? > > After normalization with loess, I went ahead to the step of linear > modeling with eBayes and got the following QQ plot: > "QQPlot_fitLWC2eBayes.pdf'. > > Does the modeling look reasonable, according to your experience? > > Any opinions and advices are greatly appreciated. > > Best, > > Yong Yin, Ph.D. > > Senior Scientist > Genome Sequencing Center > Washington University School of Medicine, Campus box 8501 > 4444 Forest Park > Saint Louis, MO 63108 > > Tel: (314) 286-1415 > > > > [[alternative HTML version deleted]] > > _______________________________________________ > Bioconductor mailing list > Bioconductor at stat.math.ethz.ch > https://stat.ethz.ch/mailman/listinfo/bioconductor > Search the archives: > http://news.gmane.org/gmane.science.biology.informatics.conductor ################################## Jianping Jin Ph.D. Bioinformatics scientist Center for Bioinformatics Room 3133 Bioinformatics building CB# 7104 University of Chapel Hill Chapel Hill, NC 27599 Phone: (919)843-6105 FAX: (919)843-3103 E-Mail: jjin at email.unc.edu

Sequencing Coverage Normalization limma Sequencing Coverage Normalization limma • 1.5k views

ADD COMMENT • link updated 18.2 years ago by Yong Yin ▴ 60 • written 18.2 years ago by Jianping Jin ▴ 890

0

Entering edit mode

Yong Yin ▴ 60

@yong-yin-2457

Last seen 11.4 years ago

An embedded and charset-unspecified text was scrubbed... Name: not available Url: https://stat.ethz.ch/pipermail/bioconductor/attachments/20071101/ 4335ad2a/attachment.pl

ADD COMMENT • link 18.2 years ago Yong Yin ▴ 60

0

Entering edit mode

Hi Yong, I have never seen a MA plot with such wide spread spots. It may be caused by its real biology or technique artifacts. My suggestion is to do more data quality assessment, such as "plotDensities". Dye swap labeling or using a common reference RNA may help to confirm the difference or problems. JJ- --On Thursday, November 01, 2007 11:13 AM -0500 Yong Yin <yyin at="" watson.wustl.edu=""> wrote: > Dear list, > > > I think I need to simplify my question. > > > I have two samples, each from a time point of its embryogenesis. They are > applied on a two-color Agilent array to compare between each other. > > > The raw data has a MA-plot like this: > > > > ftp://genome.wustl.edu/private/272205387781472/yong_data.071031/MA_R GLW1. > pdf > > > After "normexp" and global loess, the MA-plot does change it's shape as > seen here: > > > > ftp://genome.wustl.edu/private/272205387781472/yong_data.071031/MA_M ALWC1 > .pdf > > > My 1st question: > > > Does my data have too much differential expression, according to your > experience? > > > Apparently, Jianping thinks so. > > > Then my 2nd question: > > > Is it still ok to use global loess for normalization? > > > Thanks so much, I need your opinions. > > > I am running the latest R and all packages. Commands I used are: > > > >> RGLWC <- backgroundCorrect(RGLW, method="normexp", offset=50) > >> MALWC?<-?normalizeWithinArrays(RGLWC, method="loess") > > > > > Best, > > > Yong > > > > > On Nov 1, 2007, at 8:47 AM, Jianping Jin wrote: > > > Yong, > > > What is your reference sample(s) for this test run? Looks like the > experiment and reference samples are quite different. > > > JJ- > > > --On Wednesday, October 31, 2007 4:40 PM -0500 Yong Yin > <yyin at="" watson.wustl.edu=""> wrote: > > > > > Dear list, > > > I am new to BioConductor, so please forgive me if my questions are > naive to you. > > > We designed an Agilent 4x44k array, with the same 44K probes printed > 4 times in the 4 blocks. These 44K probes are designed based on a low- > coverage genome sequencing project for a parasitic nematode. Our > purpose is to investigate gene expression during early embryogenesis > of the nematode. > > > We have received results from a test run to evaluate the array > quality. Samples applied on the chip were from two time points during > the nematode embryogenesis. As a experiment, I have been following > the LIMMA manual step-by-step, treating the results as a simple two- > sample comparison with both technical and biological replication. I > have uploaded 3 images in the following location and would love to > hear what you folks think: > > > ftp://genome.wustl.edu/private/272205387781472/yong_data.071031/ > > > The general quality of the array is very good, I can't find any > indication of quality problem. The file "MA_RGLW1.pdf" is a MA plot > of raw RG data for one of the 4 blocks. After background correction > with "normexp" and within-array normalization with global loess, its > MA plot is shown as in "MA_MALWC1.pdf". > > > Given that we are studying early embryogenesis, we should expect that > a lot of genes are differentially expressed at these two time points. > In the MA plots, I think we indeed see lots of DE.? However, > according to what I read, the underline assumption for such > normalization is that the majority of the genes under investigation > should not be differentially expressed. I also read from other > people's posts that I should keep the normalization as simple as > possible and the "good" data will always be good. > > > ?From my MA plots, do you think my normalization is reasonable with > this data? If not, are there suggestions what to do? a different > normalization method? or even change the design of the array with a > set of spike-in control probes to use for normalization? > > > The two time points in this test run are actually the beginning and > the ending points of the developmental stages that we are planning to > investigate. We are considering to use a pooled-sample as a common > reference. We hope a pooled reference like this will decrease the > degrees of differential expression between any two samples of our > study. Does this sound like a good idea? > > > After normalization with loess, I went ahead to the step of linear > modeling with eBayes and got the following QQ plot: > "QQPlot_fitLWC2eBayes.pdf'. > > > Does the modeling look reasonable, according to your experience? > > > Any opinions and advices are greatly appreciated. > > > Best, > > > Yong Yin, Ph.D. > > > Senior Scientist > Genome Sequencing Center > Washington University School of Medicine, Campus box 8501 > 4444 Forest Park > Saint Louis, MO 63108 > > > Tel: (314) 286-1415 > > > > > > > [[alternative HTML version deleted]] > > > _______________________________________________ > Bioconductor mailing list > Bioconductor at stat.math.ethz.ch > https://stat.ethz.ch/mailman/listinfo/bioconductor > Search the archives: > http://news.gmane.org/gmane.science.biology.informatics.conductor > > > > > > > > > ################################## > Jianping Jin Ph.D. > Bioinformatics scientist > Center for Bioinformatics > Room 3133 Bioinformatics building > CB# 7104 > University of Chapel Hill > Chapel Hill, NC 27599 > Phone: (919)843-6105 > FAX: ? (919)843-3103 > E-Mail: jjin at email.unc.edu > > > > > > > Yong Yin, Ph.D. > > > Senior Scientist > Genome Sequencing Center > Washington University School of Medicine,?Campus box 8501 > 4444 Forest Park > Saint Louis, MO 63108 > > > Tel: (314) 286-1415 > ################################## Jianping Jin Ph.D. Bioinformatics scientist Center for Bioinformatics Room 3133 Bioinformatics building CB# 7104 University of Chapel Hill Chapel Hill, NC 27599 Phone: (919)843-6105 FAX: (919)843-3103 E-Mail: jjin at email.unc.edu

ADD REPLY • link 18.2 years ago Jianping Jin ▴ 890

0

Entering edit mode

I agree here, the scale on the y-axis is quite dramatic. Note that we are not necessarily saying that too many genes are DE, but that some of them have dramatic fold changes. Most of the normalization techniques are derived under the assumption that not too many genes are DE. Facing your problem of many DE genes, some people would say "clearly the assumptions are not correct". I would say that you should use the methods that gives you the best inference. Sometimes people have observed that applying the "standard" normalization techniques actually improve their calls, even on datasets with many DE genes. You will probably need some control spots on the array to really quantify this. I think most of us need more time with the data in order to really give you any recommendations. You should seek out a local expert. Kasper On Nov 1, 2007, at 9:46 AM, Jianping Jin wrote: > Hi Yong, > > I have never seen a MA plot with such wide spread spots. It may be > caused > by its real biology or technique artifacts. My suggestion is to do > more > data quality assessment, such as "plotDensities". Dye swap labeling or > using a common reference RNA may help to confirm the difference or > problems. > > JJ- > > --On Thursday, November 01, 2007 11:13 AM -0500 Yong Yin > <yyin at="" watson.wustl.edu=""> wrote: > >> Dear list, >> >> >> I think I need to simplify my question. >> >> >> I have two samples, each from a time point of its embryogenesis. >> They are >> applied on a two-color Agilent array to compare between each other. >> >> >> The raw data has a MA-plot like this: >> >> >> >> ftp://genome.wustl.edu/private/272205387781472/yong_data.071031/ >> MA_RGLW1. >> pdf >> >> >> After "normexp" and global loess, the MA-plot does change it's >> shape as >> seen here: >> >> >> >> ftp://genome.wustl.edu/private/272205387781472/yong_data.071031/ >> MA_MALWC1 >> .pdf >> >> >> My 1st question: >> >> >> Does my data have too much differential expression, according to your >> experience? >> >> >> Apparently, Jianping thinks so. >> >> >> Then my 2nd question: >> >> >> Is it still ok to use global loess for normalization? >> >> >> Thanks so much, I need your opinions. >> >> >> I am running the latest R and all packages. Commands I used are: >> >> >> >>> RGLWC <- backgroundCorrect(RGLW, method="normexp", offset=50) >> >>> MALWC <- normalizeWithinArrays(RGLWC, method="loess") >> >> >> >> >> Best, >> >> >> Yong >> >> >> >> >> On Nov 1, 2007, at 8:47 AM, Jianping Jin wrote: >> >> >> Yong, >> >> >> What is your reference sample(s) for this test run? Looks like the >> experiment and reference samples are quite different. >> >> >> JJ- >> >> >> --On Wednesday, October 31, 2007 4:40 PM -0500 Yong Yin >> <yyin at="" watson.wustl.edu=""> wrote: >> >> >> >> >> Dear list, >> >> >> I am new to BioConductor, so please forgive me if my questions are >> naive to you. >> >> >> We designed an Agilent 4x44k array, with the same 44K probes printed >> 4 times in the 4 blocks. These 44K probes are designed based on a >> low- >> coverage genome sequencing project for a parasitic nematode. Our >> purpose is to investigate gene expression during early embryogenesis >> of the nematode. >> >> >> We have received results from a test run to evaluate the array >> quality. Samples applied on the chip were from two time points during >> the nematode embryogenesis. As a experiment, I have been following >> the LIMMA manual step-by-step, treating the results as a simple two- >> sample comparison with both technical and biological replication. I >> have uploaded 3 images in the following location and would love to >> hear what you folks think: >> >> >> ftp://genome.wustl.edu/private/272205387781472/yong_data.071031/ >> >> >> The general quality of the array is very good, I can't find any >> indication of quality problem. The file "MA_RGLW1.pdf" is a MA plot >> of raw RG data for one of the 4 blocks. After background correction >> with "normexp" and within-array normalization with global loess, its >> MA plot is shown as in "MA_MALWC1.pdf". >> >> >> Given that we are studying early embryogenesis, we should expect that >> a lot of genes are differentially expressed at these two time points. >> In the MA plots, I think we indeed see lots of DE. However, >> according to what I read, the underline assumption for such >> normalization is that the majority of the genes under investigation >> should not be differentially expressed. I also read from other >> people's posts that I should keep the normalization as simple as >> possible and the "good" data will always be good. >> >> >> From my MA plots, do you think my normalization is reasonable with >> this data? If not, are there suggestions what to do? a different >> normalization method? or even change the design of the array with a >> set of spike-in control probes to use for normalization? >> >> >> The two time points in this test run are actually the beginning and >> the ending points of the developmental stages that we are planning to >> investigate. We are considering to use a pooled-sample as a common >> reference. We hope a pooled reference like this will decrease the >> degrees of differential expression between any two samples of our >> study. Does this sound like a good idea? >> >> >> After normalization with loess, I went ahead to the step of linear >> modeling with eBayes and got the following QQ plot: >> "QQPlot_fitLWC2eBayes.pdf'. >> >> >> Does the modeling look reasonable, according to your experience? >> >> >> Any opinions and advices are greatly appreciated. >> >> >> Best, >> >> >> Yong Yin, Ph.D. >> >> >> Senior Scientist >> Genome Sequencing Center >> Washington University School of Medicine, Campus box 8501 >> 4444 Forest Park >> Saint Louis, MO 63108 >> >> >> Tel: (314) 286-1415 >> >> >> >> >> >> >> [[alternative HTML version deleted]] >> >> >> _______________________________________________ >> Bioconductor mailing list >> Bioconductor at stat.math.ethz.ch >> https://stat.ethz.ch/mailman/listinfo/bioconductor >> Search the archives: >> http://news.gmane.org/gmane.science.biology.informatics.conductor >> >> >> >> >> >> >> >> >> ################################## >> Jianping Jin Ph.D. >> Bioinformatics scientist >> Center for Bioinformatics >> Room 3133 Bioinformatics building >> CB# 7104 >> University of Chapel Hill >> Chapel Hill, NC 27599 >> Phone: (919)843-6105 >> FAX: (919)843-3103 >> E-Mail: jjin at email.unc.edu >> >> >> >> >> >> >> Yong Yin, Ph.D. >> >> >> Senior Scientist >> Genome Sequencing Center >> Washington University School of Medicine, Campus box 8501 >> 4444 Forest Park >> Saint Louis, MO 63108 >> >> >> Tel: (314) 286-1415 >> > > > > ################################## > Jianping Jin Ph.D. > Bioinformatics scientist > Center for Bioinformatics > Room 3133 Bioinformatics building > CB# 7104 > University of Chapel Hill > Chapel Hill, NC 27599 > Phone: (919)843-6105 > FAX: (919)843-3103 > E-Mail: jjin at email.unc.edu > > _______________________________________________ > Bioconductor mailing list > Bioconductor at stat.math.ethz.ch > https://stat.ethz.ch/mailman/listinfo/bioconductor > Search the archives: http://news.gmane.org/ > gmane.science.biology.informatics.conductor

ADD REPLY • link 18.2 years ago Kasper Daniel Hansen ★ 6.5k

0

Entering edit mode

Quoting Kasper Daniel Hansen <khansen at="" stat.berkeley.edu="">: > I agree here, the scale on the y-axis is quite dramatic. Note that we > are not necessarily saying that too many genes are DE, but that some > of them have dramatic fold changes. It really depends on the biology of teh experiment, and as during embryogenesis you have quite dramatic changes, I don't think the range of the M values is something to worry about... at least not without checking the biology first. The original poster seemed to expect a lot of variation between the time points compared. I have seem similar MA plots, when comparing for instance two cell lines that are supposedly derived from the same tissue... (a totally different problem, I know...) > Most of the normalization techniques are derived under the assumption > that not too many genes are DE. Facing your problem of many DE genes, > some people would say "clearly the assumptions are not correct". I > would say that you should use the methods that gives you the best > inference. Sometimes people have observed that applying the > "standard" normalization techniques actually improve their calls, > even on datasets with many DE genes. I don't think that's entirely correct. I don't think that the assumption is that not too many genes are not DE, but that *most* genes are not DE, or they're evenly spread between up/downregulation across the range of raw intensities measured. It's a fine distinction. Imagine an MA plot (raw data) where everything lies around the M=0 line, very tightly, with just a few genes straying up to higher |M| values. Then imagine anotehr MA plot where you have the same situation, plus another few thousand spots, evenly distributed up or down, with as extreme values as you like... Normalisation methods like loess simply try to determine what is "not changed": fit a regression curve and it will neatly follow along the M=0 line... It will do so in both cases indicated above. The question really is not simply that there are not many genes DE... if the % of DE genes is low, of course that makes things easier, as their contribution to the regression curve using all of the spots will be small. But you can have many DE genes and still be able to use loess perfectly happily. You really have to observe the data, and have an idea of the biology of teh experiment to know what you are expecting (if the bulk of teh data is really not DE). This is why it's so hard to recommend any way to normalise data just by looking at a plot... I'd say that in most experiments, a loess regression curve is good enough as a normalisation aid, and that's why people often use it with good results even when all the assumptions are not perfectly met, especially that of not having many DE genes. the only sure way to normalise any set of data is to have a good set of control spots whose behaviour is known a priori. But one can often do without it and get reasonable results. Most of us do :) > I think most of us need more time with the data in order to really > give you any recommendations. You should seek out a local expert. Good suggestion, and don't forget to explain the biology behind the experiment (i.e: the behaviour you expect, if known) Jose -- Dr. Jose I. de las Heras Email: J.delasHeras at ed.ac.uk The Wellcome Trust Centre for Cell Biology Phone: +44 (0)131 6513374 Institute for Cell & Molecular Biology Fax: +44 (0)131 6507360 Swann Building, Mayfield Road University of Edinburgh Edinburgh EH9 3JR UK -- The University of Edinburgh is a charitable body, registered in Scotland, with registration number SC005336.

ADD REPLY • link 18.2 years ago J.delasHeras@ed.ac.uk ★ 1.9k

Login before adding your answer.