Limma: background correction. Use or ignore?

0

Entering edit mode

J.delasHeras@ed.ac.uk ★ 1.9k

@jdelasherasedacuk-1189

Last seen 8.7 years ago

United Kingdom

Hi Jim, many thanks for your reply. > I have never been a big fan of subtracting background, especially if the > background of the slide is low and relatively consistent. I have two > main reasons for this. > > First, the portion of the slide used to estimate background doesn't have > any cDNA bound, so you are estimating the background binding of the spot > by using a portion of the slide that might not be very similar. When we > were doing more spotted arrays, we would always spot unrelated cDNA on > the slides as well (e.g., A.thaliana and salmon sperm DNA). These spots > almost always had a negative intensity if you subtracted the local > background, which indicates to me that cDNA does a better job of > blocking the slide than BSA or other blocking agents. That is true. In the arrays I use at the moment there is no unrelated cDNA spotted, but in the previous ones I used there were some bacterial spots, and I noticed that effect sometimes. The background I was getting was usually very low, but on a few ocassions that it was higher, I often *saw* a negative effect on those bacterial spots (and indeed on some others that just didn't hybridise). On the normally clean slides, looking at the data this becomes apparent too. > Second, you *are* adding more noise to the data. When you subtract, the > variances are additive. However, if you don't subtract then you take the > chance that you are biasing your expression values, especially if the > background from chip to chip isn't relatively consistent. So the > tradeoff is higher variance vs possible bias. If the background was > consistent I usually took a chance on the bias in order to reduce the > variance. As you note, the data usually look 'cleaner' if you don't > adjust the background. I just never imagined the effect could be so pronounced. In my new arrays I notice that a lot, I think because there are a lot of spots that have mid-low intensities... > Note that these points are directed towards simple subtraction of a > local background estimate. Other more sophisticated methods may help > address these shortcomings. I'm at the moment exploring different methods to estimate the background, and also to substract it, before I decide whether I should totally ignore background (I think my slides are generally clean enough, and I'd rather repeat one experiment than risking my overall results if one slide turns out dirtier). Your comments and those of Naomi and Gordon were very helpful. Thanks a lot. I was very surprised to learn how background substraction can affect the final data so much. I thought I was safe because my slides were quite uniform and clean... > As for references, have you looked at the references that Gordon gives > on the man page for backgroundCorrect()? That would probably be a good > place to start. I have, thanks... the problem I find is that they get too much into the actual algorithms and I am looking at a more basic issue... not so much "how" to remove background, but "why" do we have to, how it can affect our analysis, etc... But I did get some useful info there too, of course. Thanks again! Jose -- Dr. Jose I. de las Heras Email: J.delasHeras at ed.ac.uk The Wellcome Trust Centre for Cell Biology Phone: +44 (0)131 6513374 Institute for Cell & Molecular Biology Fax: +44 (0)131 6507360 Swann Building, Mayfield Road University of Edinburgh Edinburgh EH9 3JR UK

safe safe • 1.1k views

ADD COMMENT • link updated 18.1 years ago by Henrik Bengtsson ★ 2.4k • written 18.1 years ago by J.delasHeras@ed.ac.uk ★ 1.9k

0

Entering edit mode

J.delasHeras@ed.ac.uk ★ 1.9k

@jdelasherasedacuk-1189

Last seen 8.7 years ago

United Kingdom

Quoting Naomi Altman <naomi at="" stat.psu.edu="">: [Hide Quoted Text] I have investigated this (somewhat) experimentally. Background correction increases the variability of low-expression genes and reduces it for high expression. This corresponds to the RMA noise model since background correction would double the additive variance but not affect the multiplicative variance (which is the dominant source of variance for highly expressing genes.) --Naomi Thank you Naomi, I think this is one of the main reasons my new arrays suffered more noticeably after bkg-substraction. They contain a large number of low and mid-low intensity spots. I was even considering making two subsets of my spots (higher and lower intensity), taking those that consistently have lower signal, and analysing them separately (in fact, scanning the slides at higher power, to obtain stronger signals from those). I am not sure if I will be able to get more consistent results that way. I can scan at higher power and get reasonable intensities from the weaker spots, whilest keeping a low background... but then I have a lot of spots that are totally saturated and useless... that's why I thought of the division of the data set and double scan. But I haven't tried it yet. Jose -- Dr. Jose I. de las Heras Email: J.delasHeras at ed.ac.uk The Wellcome Trust Centre for Cell Biology Phone: +44 (0)131 6513374 Institute for Cell & Molecular Biology Fax: +44 (0)131 6507360 Swann Building, Mayfield Road University of Edinburgh Edinburgh EH9 3JR UK

ADD COMMENT • link 18.1 years ago J.delasHeras@ed.ac.uk ★ 1.9k

0

Entering edit mode

We have done up to 4 scans of the same arrays at different intensities with little apparent degradation of the signal. We found a setting of the scanner that preserved the full optical range, and did not try to combine data across scanner levels, although there is some literature on this. --Naomi At 09:35 AM 4/4/2006, J.delasHeras at ed.ac.uk wrote: >Quoting Naomi Altman <naomi at="" stat.psu.edu="">: > > >[Hide Quoted Text] >I have investigated this (somewhat) experimentally. Background >correction increases the variability of low-expression genes and >reduces it for high expression. This corresponds to the RMA noise >model since background correction would double the additive variance >but not affect the multiplicative variance (which is the dominant >source of variance for highly expressing genes.) > >--Naomi > >Thank you Naomi, I think this is one of the main reasons my new arrays >suffered more >noticeably after bkg-substraction. They contain a large number of low >and mid-low >intensity spots. > >I was even considering making two subsets of my spots (higher and lower >intensity), >taking those that consistently have lower signal, and analysing them >separately (in fact, >scanning the slides at higher power, to obtain stronger signals from >those). I am not >sure if I will be able to get more consistent results that way. I can >scan at higher >power and get reasonable intensities from the weaker spots, whilest >keeping a low >background... but then I have a lot of spots that are totally saturated >and useless... >that's why I thought of the division of the data set and double scan. >But I haven't tried >it yet. > >Jose > >-- >Dr. Jose I. de las Heras Email: J.delasHeras at ed.ac.uk >The Wellcome Trust Centre for Cell Biology Phone: +44 (0)131 6513374 >Institute for Cell & Molecular Biology Fax: +44 (0)131 6507360 >Swann Building, Mayfield Road >University of Edinburgh >Edinburgh EH9 3JR >UK > >_______________________________________________ >Bioconductor mailing list >Bioconductor at stat.math.ethz.ch >https://stat.ethz.ch/mailman/listinfo/bioconductor >Search the archives: >http://news.gmane.org/gmane.science.biology.informatics.conductor Naomi S. Altman 814-865-3791 (voice) Associate Professor Dept. of Statistics 814-863-7114 (fax) Penn State University 814-865-1348 (Statistics) University Park, PA 16802-2111

ADD REPLY • link 18.1 years ago Naomi Altman ★ 6.0k

0

Entering edit mode

Quoting Naomi Altman <naomi at="" stat.psu.edu="">: > We have done up to 4 scans of the same arrays at different > intensities with little apparent degradation of the signal. We found > a setting of the scanner that preserved the full optical range, and > did not try to combine data across scanner levels, although there is > some literature on this. > > --Naomi Thanks Naomi. I came across a PERL script called Masliner. It's described in Dudley et al. PNAS(2002) 99:7554-7559. I haven't tested it in any detail yet (so much to do, so little time...) but I had it set to run under Cygwin by our local computer guy. I also asked him to write a wrapper so that I only need to input my two .gpr files (it's only set for Genepix files) and the program will produce a "fake" output .gpr that can be used straight away (the PERL script alone forces you do do some manual changes). I mean to check it soon. Jose -- Dr. Jose I. de las Heras Email: J.delasHeras at ed.ac.uk The Wellcome Trust Centre for Cell Biology Phone: +44 (0)131 6513374 Institute for Cell & Molecular Biology Fax: +44 (0)131 6507360 Swann Building, Mayfield Road University of Edinburgh Edinburgh EH9 3JR UK

ADD REPLY • link 18.1 years ago J.delasHeras@ed.ac.uk ★ 1.9k

0

Entering edit mode

On 4/5/06, J.delasHeras at ed.ac.uk <j.delasheras at="" ed.ac.uk=""> wrote: > Quoting Naomi Altman <naomi at="" stat.psu.edu="">: > > > We have done up to 4 scans of the same arrays at different > > intensities with little apparent degradation of the signal. We found > > a setting of the scanner that preserved the full optical range, and > > did not try to combine data across scanner levels, although there is > > some literature on this. > > > > --Naomi > > Thanks Naomi. > > I came across a PERL script called Masliner. It's described in Dudley > et al. PNAS(2002) 99:7554-7559. > I haven't tested it in any detail yet (so much to do, so little > time...) but I had it set to run under Cygwin by our local computer > guy. I also asked him to write a wrapper so that I only need to input > my two .gpr files (it's only set for Genepix files) and the program > will produce a "fake" output .gpr that can be used straight away (the > PERL script alone forces you do do some manual changes). I mean to > check it soon. Hi again, the multiscan-calibration method in Bengtsson, H.; J?nsson, G. & Christersson, J.V. Calibration and assessment of channel-specific biases in microarray data with extended dynamical range, BMC Bioinformatics, 2004, is implemented in the aroma package (http://www.braju.com/R/). First, the method is applied to each channel separately, which means it can also be used for Affymetrix scans (if you can change the PMT). Second, each array has to be scanned at at least two different PMTs. Example: gpr <- GenePixData$read(pattern="scan[0-9].gpr") rg <- as.RGData(rg) calibrateMultiscan(rg) (You can set the fields of the gpr object and the save this to file, but I would not recommend to modify a GPR file, because its different fields will become inconsistent with each other.) If you prefer to use data in plain matrix, use the aroma.light package instead. Also, the help pages of this package are much more up to date. Consider one channel at the time. Assume that the signals from three scans are stored in vectors X1, X2 and X3. Then: X <- cbind(X1, X2, X3) Xest <- calibrateMultiscan(X) To get the actual parameter estimates, for instance the scanner offset, see the attributes of X. If you use BASE, the aroma.Base package has a plugin utilizing the aroma.light method. The advantages of multiscan calibration are several: 1) you remove the offset of the scanner, 2) the effect of the scanner noise is smaller because each feature is measured multiple times (contrary to Masliner), 3) the dynamical range of you scanner is increased, and 4) you do not have to worry about exact PMT settings and saturation (the estimation methods are rather robust against this and in the worst case you can always downweight such signals yourself). Note that the above applies to changing the PMT. We haven't tried the same modifying the laser power. If one identifies an offset by changing the laser power this way, it is a different type of offset than the one we talk about in the paper. The one found in the paper, is most likely due to the PMT (photo-multiplier tube) or the A/D converters following. Finally, I would like to collect scanner-offset estimates from as many sources as possible, so if you try the above method, I would be happy you could forward you estimates together with what scanner and scanner settings (power and PMT) you've used. Best, Henrik > Jose > > -- > Dr. Jose I. de las Heras Email: J.delasHeras at ed.ac.uk > The Wellcome Trust Centre for Cell Biology Phone: +44 (0)131 6513374 > Institute for Cell & Molecular Biology Fax: +44 (0)131 6507360 > Swann Building, Mayfield Road > University of Edinburgh > Edinburgh EH9 3JR > UK > > _______________________________________________ > Bioconductor mailing list > Bioconductor at stat.math.ethz.ch > https://stat.ethz.ch/mailman/listinfo/bioconductor > Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor

ADD REPLY • link 18.1 years ago Henrik Bengtsson ★ 2.4k

0

Entering edit mode

Henrik Bengtsson ★ 2.4k

@henrik-bengtsson-4333

Last seen 15 days ago

United States

Hi, I'm jumping in to this thread here. I will try to comment on most of the message, so there will be more following. People will probably disagree with me in some cases. On 3/31/06, James W. MacDonald <jmacdon at="" med.umich.edu=""> wrote: > Hi Jose, > > J.delasHeras at ed.ac.uk wrote: > > I have been using LimmaGUI for a while to analyse my cDNA microarrays. > > I have always used "substract" as a method for background correction. > > Why? Not sure. Intuitively it made sense, and I didn't observe any > > obvious problems. > > Once I played with the different methods for background correction > > available in LimmaGUI, and when looking at the MA plots I decided I > > preferred to substract. > > > > However, I have recently had problems with the statistics being quite > > poor in my analises (see my post a week ago or so about low B > > values)... and whilst checking the data, I noticed that at least in my > > current experiments, if I do no background correction at all the stats > > look a lot better, the MA plots look better, and everything looks > > better in general. The actual list of genes doesn't change a lot, but > > the values seem a lot tighter. A leading question: What do you mean by "MA plots look better"? They can look better in many ways, depending on what you are trying to answer. To simplify things very much, we have two cases of questions: 1) Find differentially expressed genes, that is, we are trying to test the null hypothesis H0: mu=0, against H1: mu != 0, where mu is the unknown log-ratio of the gene (in two samples). 2) Estimate the unknown log-ratio of the gene (in two sample), i.e. estimate mu. This may for instance be of interest in copy number analysis. In Case 1, it does not matter much if our *absolute* value of the mu estimates are biased or not - we are still trying to identify those away from zero. In other words, if we rescale the estimates we will, in theory, still be able to identify differentially expressed genes. This is what the variance stabilizing (VS) methods (Huber and Rocke & Durbin) is making use of. In Case 2, the unbiased estimates are by definition the quantities of interest. For this reason we cannot use for instance VS methods in this case. An exception is if you change your objection to identify, say, genes with copy numbers 0, 1, 2, 3, ... then we will be able to make a classification problem, and VS methods may still be valid. > > This makes me question whether we should background correct at all. My > > slides are pretty clean, low background. Am I not adding more noise to > > the data by removing background? > > I have never been a big fan of subtracting background, especially if the > background of the slide is low and relatively consistent. I have two > main reasons for this. > > First, the portion of the slide used to estimate background doesn't have > any cDNA bound, so you are estimating the background binding of the spot > by using a portion of the slide that might not be very similar. When we > were doing more spotted arrays, we would always spot unrelated cDNA on > the slides as well (e.g., A.thaliana and salmon sperm DNA). These spots > almost always had a negative intensity if you subtracted the local > background, which indicates to me that cDNA does a better job of > blocking the slide than BSA or other blocking agents. > > Second, you *are* adding more noise to the data. When you subtract, the > variances are additive. However, if you don't subtract then you take the > chance that you are biasing your expression values, especially if the > background from chip to chip isn't relatively consistent. So the > tradeoff is higher variance vs possible bias. If the background was > consistent I usually took a chance on the bias in order to reduce the > variance. As you note, the data usually look 'cleaner' if you don't > adjust the background. I totally agree with you, it [the log-ratio log-intensity scatter plot] "looks" cleaner, but it does not necessarily mean it is better. Especially if one deals with Case 2 above, I normally say that, if you do not see large variance in log-ratios at lower intensities, you are doing something wrong. This is of course not the full story - it depends what methods you use down the stream. However, I don't really trust someone who compares two log-ratio log-intensity plots, points at one of them, and says "I used this one because there is less spread". Hopefully not being too self-oriented, I would like to refer to Bengtsson & H?ssjer, Methodological study of affine transformations of gene expression data with proposed robust non-parametric multi-dimensional normalization method BMCBioinfo, 2006, for more details. I also have quite a few talks on the topic at http://www.maths.lth.se/bioinformatics/. The VS papers address this too, but much less explicit. > Note that these points are directed towards simple subtraction of a > local background estimate. Other more sophisticated methods may help > address these shortcomings. It is important to differentiate between true background and background methods. It is even more important to differentiate between all types of background that can be introduced in the microarray process. It can be introduced at many places, e.g. labelling, cross hybridization, dust, scanning, image analysis and so on. There is no single method that address all of them, and that is important to understand/accept. For instance, the paper Yang et al, Comparison of methods for image analysis on cDNA microarray data JCompGraphStat, 2002, show that different image-analysis methods estimate background differently. Thus, when we choose method, we introduce a bias (unless you're lucky enough to hit the right one). Similar conclusions can be drawn from Bengtsson & Bengtsson, Microarray image analysis: background estimation using quantile and morphological filters, BMCBioinfo, 2006. Another example is scanner bias. We found that both Axon and Agilent scanners introduce a substantial offset in signals. See Bengtsson et al, Calibration and assessment of channel-specific biases in microarray data with extended dynamical range, BMCBioinfo, 2004. The offset in both scanners was/is about 20 units on the range [0, 65535]. It does not sound too much, but 20 is definitely enough to bias you log-ratios. We have seen similar effects in Affymetrix scanners. Afterwards, we have identified some models of the same brands, that does not have such strong offset. Thus, when we choose a scanner we introduce bias. I'll reply in another message how to estimate and correct for this. It is easy. We can of course argue that classical image-analysis background correction methods correct for scanner bias too, i.e. y_fg = y + y_scanner +eps and y_bg = y_scanner + xi => y_est = y_fg - y_bg = y + eps'. However, xi will probably introduce unnecessary variance, but also bias. See the Bengtsson & Bengtsson paper for the latter. If we believe that features on the arrays can be contaminated by non-wanted fluorescent molecules, then we have another source of background, and so on. Finally, consider the following retorical questions. If we the accept that there are scanner offsets, which I believe we have been able to prove in the above paper, the it is very hard to argue that you should not correct for background. If we still argue that we should not substract, then what if we are using two different scanner brands/models for the same array and they introduce different offsets, the we get into a contradiction. In this way one can argue that it is very strange if we do not need to correct for additive background, whatever origin it has. /Henrik > As for references, have you looked at the references that Gordon gives > on the man page for backgroundCorrect()? That would probably be a good > place to start. > > Best, > > Jim > > > > > > Can anybody point me to a good reference to learn about the effects of > > background correction, pros and cons? I'm just a molecular biologist, > > not a statistician, but I need to understand a bit better these issues > > or there'll be no molecular biology to work on from my experiments! > > > > Jose > > > > > > > -- > James W. MacDonald, M.S. > Biostatistician > Affymetrix and cDNA Microarray Core > University of Michigan Cancer Center > 1500 E. Medical Center Drive > 7410 CCGC > Ann Arbor MI 48109 > 734-647-5623 > > > ********************************************************** > Electronic Mail is not secure, may not be read every day, and should not be used for urgent or sensitive issues. > > _______________________________________________ > Bioconductor mailing list > Bioconductor at stat.math.ethz.ch > https://stat.ethz.ch/mailman/listinfo/bioconductor > Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor >

ADD COMMENT • link 18.1 years ago Henrik Bengtsson ★ 2.4k

0

Entering edit mode

Quoting Henrik Bengtsson <hb at="" maths.lth.se="">: > A leading question: What do you mean by "MA plots look better"? They > can look better in many ways, depending on what you are trying to > answer. To simplify things very much, we have two cases of questions: > > 1) Find differentially expressed genes, that is, we are trying to test > the null hypothesis H0: mu=0, against H1: mu != 0, where mu is the > unknown log-ratio of the gene (in two samples). > > 2) Estimate the unknown log-ratio of the gene (in two sample), i.e. > estimate mu. This may for instance be of interest in copy number > analysis. > > In Case 1, it does not matter much if our *absolute* value of the mu > estimates are biased or not - we are still trying to identify those > away from zero. In other words, if we rescale the estimates we will, > in theory, still be able to identify differentially expressed genes. > This is what the variance stabilizing (VS) methods (Huber and Rocke & > Durbin) is making use of. This is my case. I am looking for genes that are differentially expressed. In fact, a lot of the times I am looking for genes that are NOT expressed (or have minimal expression) in either of the samples, and for these cases the values of M are irrelevant (I just care that they're high, in absolute terms). When I said the MA plots looked better, I was referring to the general distribution of the genes, and the way that known genes were located in the plot. Multiple spots for a given gene often clustered better. > Hopefully not being too self-oriented, I would like to refer to > Bengtsson & H?ssjer, Methodological study of affine transformations > of gene expression data with proposed robust non-parametric > multi-dimensional normalization method BMCBioinfo, 2006, for more > details. I also have quite a few talks on the topic at > http://www.maths.lth.se/bioinformatics/. The VS papers address this > too, but much less explicit. Thanks for that. I will take a look! > Another example is scanner bias. We found that both Axon and Agilent > scanners introduce a substantial offset in signals. See Bengtsson et > al, Calibration and assessment of channel-specific biases in > microarray data with extended dynamical range, BMCBioinfo, 2004. The > offset in both scanners was/is about 20 units on the range [0, 65535]. > It does not sound too much, but 20 is definitely enough to bias you > log-ratios. We have seen similar effects in Affymetrix scanners. > Afterwards, we have identified some models of the same brands, that > does not have such strong offset. Thus, when we choose a scanner we > introduce bias. I'll reply in another message how to estimate and > correct for this. It is easy. when looking at my data, I have observed that while the foreground signals were pretty much comparable between slides scanned with an Axon scanner, or an ArraywoRx one (the latter using white light rather than lasers), the background was very different between the two. As it turns out, teh Axon ones looked cleaner (using imageplot). In this case, I get the best stats if I do not correct for background (instead of substracting it). When the slides were scanned with the ArraywoRx scanner (higher background, and highly variable between channels), I get better results if I substract the background. The list of genes is not very different. But the stats are. Jose -- Dr. Jose I. de las Heras Email: J.delasHeras at ed.ac.uk The Wellcome Trust Centre for Cell Biology Phone: +44 (0)131 6513374 Institute for Cell & Molecular Biology Fax: +44 (0)131 6507360 Swann Building, Mayfield Road University of Edinburgh Edinburgh EH9 3JR UK

ADD REPLY • link 18.1 years ago J.delasHeras@ed.ac.uk ★ 1.9k

Login before adding your answer.