Duplicate probes

0

Entering edit mode

Ed Siefker ▴ 230

@ed-siefker-5136

Last seen 5 months ago

United States

I am analyzing affymetrix hgu133plus2 arrays with limma. These arrays sometimes contain multiple probes for a single gene. I would like to combine the readings so that I get exactly one estimate of fold change per (entrez) gene. I looked at the duplicateCorrelation() function, but that doesn't seem to apply. If I understand correctly, it's for averaging duplicate spots per probe, not duplicate probes per gene. It requires the same number of duplicates across the chip anyway, which I don't have. Just for illustration, here's a sample of some normalized expression data: control average test average test-control Linear fold change GENE1 2.38127 4.00571 1.62444 3.08322 GENE1 12.1182 13.5405 1.42224 2.68001 GENE1 9.85812 11.4534 1.59533 3.02163 GENE2 12.9662 12.7992 -0.1670 0.89070 GENE3 12.9649 12.9777 0.01275 1.00887 GENE3 2.23400 2.22957 -0.0044 0.99693 GENE3 11.8682 11.6099 -0.2583 0.83606 So it's pretty obvious that I can't just average the expression values, as they range from around to around 12 for the same gene. It's also clear that I can't just filter out the probes with the least fold change, because that would lead to GENE1 and GENE3 being called as differentially expressed, when the data appears to support differential expression of GENE3 much more strongly than it does GENE1. For GENE3, 3 of 3 probes show a fold change near 3. For GENE1, 2 of 3 probes show no fold change at all. How do I use this information to adjust the estimation of confidence in differential expression?

hgu133plus2 probe limma hgu133plus2 probe limma • 2.4k views

ADD COMMENT • link updated 12.1 years ago by J.delasHeras@ed.ac.uk ★ 1.9k • written 12.1 years ago by Ed Siefker ▴ 230

0

Entering edit mode

James W. MacDonald 65k

@james-w-macdonald-5106

Last seen 2 minutes ago

United States

Hi Ed, On 3/21/2012 12:20 PM, Ed Siefker wrote: > I am analyzing affymetrix hgu133plus2 arrays with limma. > These arrays sometimes contain multiple probes for a > single gene. I would like to combine the readings so that > I get exactly one estimate of fold change per (entrez) gene. > > I looked at the duplicateCorrelation() function, but that > doesn't seem to apply. If I understand correctly, it's for > averaging duplicate spots per probe, not duplicate probes > per gene. It requires the same number of duplicates across > the chip anyway, which I don't have. > > Just for illustration, here's a sample of some normalized > expression data: > > > control average test average test-control Linear fold change > GENE1 2.38127 4.00571 1.62444 3.08322 > GENE1 12.1182 13.5405 1.42224 2.68001 > GENE1 9.85812 11.4534 1.59533 3.02163 > GENE2 12.9662 12.7992 -0.1670 0.89070 > GENE3 12.9649 12.9777 0.01275 1.00887 > GENE3 2.23400 2.22957 -0.0044 0.99693 > GENE3 11.8682 11.6099 -0.2583 0.83606 > > > So it's pretty obvious that I can't just average the expression > values, as they range from around to around 12 for the same > gene. It's also clear that I can't just filter out the probes with > the least fold change, because that would lead to GENE1 and > GENE3 being called as differentially expressed, when the data > appears to support differential expression of GENE3 much > more strongly than it does GENE1. > > For GENE3, 3 of 3 probes show a fold change near 3. For > GENE1, 2 of 3 probes show no fold change at all. How do I use > this information to adjust the estimation of confidence in differential > expression? Depends on what assumptions you want to make. You could assume that some of the probesets don't do a good job of measuring the transcript of interest, and just select the one probeset with the largest difference in a given comparison. See findLargest() in the genefilter package. You could assume that some of the probes/probesets don't really measure the transcript of interest and use an alternative probe to probeset mapping that only uses those probes shown to actually be complementary to the transcript of interest. See e.g. http://brainarray.mbni.med.umich.edu/Brainarray/Database/CustomCDF/15. 0.0/entrezg.asp. These re-mapped cdfs can be installed using biocLite("hgu133plus2hsentrezgcdf") and then used with e.g., ReadAffy(cdfname="hgu133plus2hsentrezgcdf"). Or you could assume that some of the duplicate probesets measure differentially spliced transcripts, and leave them all in, and deal with duplicates on the back end (validation, etc). I don't know of any other readily accessible ways to deal with these probesets, but others may chime in with suggestions. Best, Jim > > _______________________________________________ > Bioconductor mailing list > Bioconductor at r-project.org > https://stat.ethz.ch/mailman/listinfo/bioconductor > Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor -- James W. MacDonald, M.S. Biostatistician University of Washington Environmental and Occupational Health Sciences 4225 Roosevelt Way NE, # 100 Seattle WA 98105-6099

ADD COMMENT • link 12.1 years ago James W. MacDonald 65k

0

Entering edit mode

J.delasHeras@ed.ac.uk ★ 1.9k

@jdelasherasedacuk-1189

Last seen 8.7 years ago

United Kingdom

Quoting Ed Siefker <ebs15242 at="" gmail.com=""> on Wed, 21 Mar 2012 11:20:17 -0500: > I am analyzing affymetrix hgu133plus2 arrays with limma. > These arrays sometimes contain multiple probes for a > single gene. I would like to combine the readings so that > I get exactly one estimate of fold change per (entrez) gene. > > I looked at the duplicateCorrelation() function, but that > doesn't seem to apply. If I understand correctly, it's for > averaging duplicate spots per probe, not duplicate probes > per gene. It requires the same number of duplicates across > the chip anyway, which I don't have. > > Just for illustration, here's a sample of some normalized > expression data: > > > control average test average test-control Linear fold change > GENE1 2.38127 4.00571 1.62444 3.08322 > GENE1 12.1182 13.5405 1.42224 2.68001 > GENE1 9.85812 11.4534 1.59533 3.02163 > GENE2 12.9662 12.7992 -0.1670 0.89070 > GENE3 12.9649 12.9777 0.01275 1.00887 > GENE3 2.23400 2.22957 -0.0044 0.99693 > GENE3 11.8682 11.6099 -0.2583 0.83606 > > > So it's pretty obvious that I can't just average the expression > values, as they range from around to around 12 for the same > gene. It's also clear that I can't just filter out the probes with > the least fold change, because that would lead to GENE1 and > GENE3 being called as differentially expressed, when the data > appears to support differential expression of GENE3 much > more strongly than it does GENE1. > > For GENE3, 3 of 3 probes show a fold change near 3. For > GENE1, 2 of 3 probes show no fold change at all. How do I use > this information to adjust the estimation of confidence in differential > expression? Hi Ed, it's not trivial to decide what to do with multiple probes. There are methods to summarise probeset data, using some kind of weighted median algorithm and other ways. But the truth is that sometimes probes "misbehave": they do not provide the signal we expect. Perhaps they crosshybridise with other RNAs that we do not know in principle about, for instance transcript variants that are not annotated. I personally have decided to keep each probe separate in my analyses. When I look at my list of DE genes, I would expect to find that if a transcript is represented in my arrays 3 times, by 3 different probes, I would get all three in my DE list). If I get only 2, I can then ask why the third probe did not behave the same way... that information is sometimes interesting, as you have the sequence information and can check where it matches. Sometimes you just can't figure it out... but I think 2 out of 3 is decent, so I keep it in my list. Even if you only get one hit, it can be a good hit... The bottom line is it will be hard to decide which to discount and which to trust without detailed investigation... and possibly experimentation. That's ok if you decide to focus on a handful of transcripts after seeing your results, but not practical large-scale. So if you must provide just one number, I would choose one probe and display that information, but I would never average across probes. How to choose which one... it's up to you ;) If they behave similarly... pick one randomly. If there are two different behaviours... you can display a representative of each, or pick the most common one, or the one that displays a behaviour most interesting for your purposes. There is no general rule. I favour showing a representative for each behaviour: the fact that I do not understand why I get different behaviours does not necessarily mean one of them is an artifact, so I like to avoid discarding any information I might find useful later as I learn more about the system. When you then summarise and count genes/transcripts/probes, just state what it is that you are counting. I don't think there is anything wrong saying you identified 10 genes, and showing a table with 12 rows, where two of the genes have two entries each. But it all really depends on your goal. Jose -- Dr. Jose I. de las Heras Email: J.delasHeras at ed.ac.uk The Wellcome Trust Centre for Cell Biology Phone: +44 (0)131 6507090 Institute for Cell & Molecular Biology Fax: +44 (0)131 6507360 Swann Building, Mayfield Road University of Edinburgh Edinburgh EH9 3JR UK -- The University of Edinburgh is a charitable body, registered in Scotland, with registration number SC005336.

ADD COMMENT • link 12.1 years ago J.delasHeras@ed.ac.uk ★ 1.9k

Login before adding your answer.