fold-change when no expression to high expression

0

Entering edit mode

Matthew McCormack ▴ 180

@matthew-mccormack-2021

Last seen 11 months ago

United States

Transcripts not expressed in control but which have high expression in treatment theoretically have an infinite fold-change. Preprossesing algorithms will provide numbers for fold-change for these genes, but to do this there seems to be an assumption that all genes are expressed to some small degree at all times and that the chip can reliably detect this. If this is not the case, then it would seem that the fold-change number the preprocessing algorithms provide for genes that go from no expression to expression would be very unreliable and would not be able to be compared with fold changes for other genes that have an appreciable signal intensity in both control and treatment. These genes, off-on genes, are biologically very important to identify. Not identifying these genes because of the low or no control signal intensity would provide misleading data from a biological viewpoint. Is there any algorithm on BioConductor that addresses this problem ? Matthew McCormack

GO Preprocessing GO Preprocessing • 3.5k views

ADD COMMENT • link updated 16.7 years ago by Wolfgang Huber ★ 13k • written 16.7 years ago by Matthew McCormack ▴ 180

0

Entering edit mode

Laurent Gautier ★ 2.3k

@laurent-gautier-29

Last seen 11.4 years ago

If I understand your question right, the issue is about fold-changes for which the denominator is very small/zero. You may consider adding a small offset to the signal ("fudge factor") making the denominator leave the "danger zone", or using a generalized-log transform (I think that the function glog() is in the package "vsn"). L. Matthew McCormack wrote: > Transcripts not expressed in control but which have high expression in > treatment theoretically have an infinite fold-change. Preprossesing > algorithms will provide numbers for fold-change for these genes, but to > do this there seems to be an assumption that all genes are expressed to > some small degree at all times and that the chip can reliably detect > this. If this is not the case, then it would seem that the fold- change > number the preprocessing algorithms provide for genes that go from no > expression to expression would be very unreliable and would not be able > to be compared with fold changes for other genes that have an > appreciable signal intensity in both control and treatment. These genes, > off-on genes, are biologically very important to identify. Not > identifying these genes because of the low or no control signal > intensity would provide misleading data from a biological viewpoint. Is > there any algorithm on BioConductor that addresses this problem ? > > Matthew McCormack > > _______________________________________________ > Bioconductor mailing list > Bioconductor at stat.math.ethz.ch > https://stat.ethz.ch/mailman/listinfo/bioconductor > Search the archives: > http://news.gmane.org/gmane.science.biology.informatics.conductor

ADD COMMENT • link 16.7 years ago Laurent Gautier ★ 2.3k

0

Entering edit mode

I don't think there is a standard way to deal with these genes, and you are right: to eliminate them would be missing some of the potentialy most interesting data. If you look at fold changes, there will always be some very large ones although not quite as large as infinite, because there's always some background intensity. Depending on the way you deal with background intensities, the fold difference can be larger or smaller, but at any rate larger than the rest. What I do is use that information (fold change) as is, as a ranking device if you wish. I don't like to call it fold change, 'though, because when I verify my data by RT-PCR, the fold differences I measure vary from those of the microarray (and unless you have control spots to calibrate the array, this will be the rule). I don't mean to tell you to change the way you call those ratios, but just highlight that if you want to talk of real fold changes, the data from a microarray will be unlikely to be accurate enough. With this premise, I don't worry about whether a gene that goes from zero in the control to X level of expression shows a "fold change" value of 40 or 400. Once you identify it as possibly off/on gene, the actual value is irrelevant. Things may look a bit better if using one colour arrays like Affymetrix or Nimblegen, where the oligos are synthesised on the array itself. There it's easier to identify genes that are not expressed (although there would always be a grey area), and you can express the data as log(intensity) rather than as log(ratio). Again, the expression values may not be as accurate as what you'll see with RT, but it makes (in my opinion) dealing with "on/off" genes a bit more reasonable. In my work, a lot fo what I do is based actually on looking for these genes that either become silenced or activated after a given treatment. I generally look for a signal threshold below which I can be confident that a gene will not be expressed, and another threshold above which I can be quite confident that a gene is expressed. Then compare both. log(ratios) I only use as a ranking parameter, something that gives me an idea of what genes show a larger change. There is a grey area, between the two thresholds... I am aware of that, I know I will miss things, probably some interesting ones too... but what I find is usually solid and I rather get a "cleaner" list by using two thresholds. It's not a complicated issue, intellectually. I think everybody deals with it in their own way. It is very imporatnt to know the limitations of any method, and when you present the data, present meaningful results. In my opinion to indicate a 1000 fold change when you're really talking about "I can see this gene expressed at about 1000 times the level of the control background" is not very nice. There is a point after which we're not talking fold-changes anymore, so I prefer to call them ratios or log(ratios) or whatever. Maybe it's just semantics, but it's the way I like to deal with the situation you describe. Adding a "fudge factor" is another way to deal with things like this. When background correction was performed simply by substracting a local background, one often obtained negative signals... to solve that, adding a small factor was commonly done. But I personally don't like to do that. At the end of the day, you can pick out those "on/off" genes with any (reasonable) method, and their ratios would never be meaningful as "fold differences"... Maybe I went on for to long, sorry :-) Jose Quoting Laurent Gautier <laurent at="" cbs.dtu.dk="">: > If I understand your question right, the issue is about fold-changes > for which the denominator is very small/zero. > > You may consider adding a small offset to the signal ("fudge factor") > making the denominator leave the "danger zone", or using a > generalized-log transform (I think that the function glog() is in the > package "vsn"). > > > L. > > > Matthew McCormack wrote: >> Transcripts not expressed in control but which have high expression >> in treatment theoretically have an infinite fold-change. >> Preprossesing algorithms will provide numbers for fold-change for >> these genes, but to do this there seems to be an assumption that >> all genes are expressed to some small degree at all times and that >> the chip can reliably detect this. If this is not the case, then it >> would seem that the fold-change number the preprocessing >> algorithms provide for genes that go from no expression to >> expression would be very unreliable and would not be able to be >> compared with fold changes for other genes that have an >> appreciable signal intensity in both control and treatment. These >> genes, off-on genes, are biologically very important to identify. >> Not identifying these genes because of the low or no control >> signal intensity would provide misleading data from a biological >> viewpoint. Is there any algorithm on BioConductor that addresses >> this problem ? >> >> Matthew McCormack >> >> _______________________________________________ >> Bioconductor mailing list >> Bioconductor at stat.math.ethz.ch >> https://stat.ethz.ch/mailman/listinfo/bioconductor >> Search the archives: >> http://news.gmane.org/gmane.science.biology.informatics.conductor > > _______________________________________________ > Bioconductor mailing list > Bioconductor at stat.math.ethz.ch > https://stat.ethz.ch/mailman/listinfo/bioconductor > Search the archives: > http://news.gmane.org/gmane.science.biology.informatics.conductor -- Dr. Jose I. de las Heras Email: J.delasHeras at ed.ac.uk The Wellcome Trust Centre for Cell Biology Phone: +44 (0)131 6513374 Institute for Cell & Molecular Biology Fax: +44 (0)131 6507360 Swann Building, Mayfield Road University of Edinburgh Edinburgh EH9 3JR UK -- The University of Edinburgh is a charitable body, registered in Scotland, with registration number SC005336.

ADD REPLY • link 16.7 years ago J.delasHeras@ed.ac.uk ★ 1.9k

0

Entering edit mode

Wolfgang Huber ★ 13k

@wolfgang-huber-3550

Last seen 4 months ago

EMBL European Molecular Biology Laborat…

Matthew McCormack ha scritto: > Transcripts not expressed in control but which have high expression in > treatment theoretically have an infinite fold-change. Preprossesing > algorithms will provide numbers for fold-change for these genes, but to > do this there seems to be an assumption that all genes are expressed to > some small degree at all times and that the chip can reliably detect > this. If this is not the case, then it would seem that the fold- change > number the preprocessing algorithms provide for genes that go from no > expression to expression would be very unreliable and would not be able > to be compared with fold changes for other genes that have an > appreciable signal intensity in both control and treatment. These genes, > off-on genes, are biologically very important to identify. Not > identifying these genes because of the low or no control signal > intensity would provide misleading data from a biological viewpoint. Is > there any algorithm on BioConductor that addresses this problem ? > > Matthew McCormack Hi Matthew, There is a discussion on this topic in chapter 5 of our "case studies" book [1]. More technically, also in [2], and very briefly in Section 12 of the vignette of the vsn package. Basically: these genes are of course very important. The variance stabilisation trick allows to still report reproducible "generalised log-ratios" in these cases, which are estimators of the true log- ratios that are shrunken towards 0 (from +/- infinity) and the amount of shrinkage depends on the sensitivity of the array, as estimated from the "background" component of noise. Note the word *estimator*: it is useful to distinguish your data-based estimate from the unknown, true value, and to know what stochastic and systematic effects might occur in between them. You are also right that the (log-)ratio is a compression of the data that looses information. If you do not want this information loss, you can always go back and look at the (glog) intensities in control and treatment. [1] Bioconductor Case Studies http://www.springer.com/statistics/stats+life+sci/book/978-0-387-77239 -4 [2] Huber W., Von Heydebreck A. and Vingron M. (2004) Error models for microarray intensities. http://www.ebi.ac.uk/huber/docs/huber_vingron_2004.pdf Best wishes Wolfgang ------------------------------------------------ Wolfgang Huber, EMBL, http://www.ebi.ac.uk/huber

ADD COMMENT • link 16.7 years ago Wolfgang Huber ★ 13k

Login before adding your answer.