how deal with multiplicate affy probes?

0

Entering edit mode

Johnnidis, Jonathan ▴ 50

@johnnidis-jonathan-689

Last seen 11.4 years ago

Hi Bioconductor, I'm a new list member and am not quite sure if this question is appropriate for the list, but will shoot anyway. I'm analyzing a bunch of data from Affy MgU74Av2 chips and am a bit perplexed as to how to treat conflicting expression data from multiplicate probe sets (that is a gene that has >1 probe set designed against it (for example, 97569_r_at and 97658_r_at are both probes for the Insulin gene). Specifically, if probe #1 for geneX indicates significant fold change for that gene, but probe #2 indicates something else (no fold change, or even fold change in the opposite direction! (rare, but possible)), how can the expression status of geneX be properly evaluated? Can one probe's measurement be considered more reliable than another's (and thus toss the one you suspect is wrong (although this could introduce experimental bias))? Or is it most appropriate to average the signal values for multiplicate probes together? Or is there some other method? On the MgU74Av2 chip at least, by my calculations there are at least 1079 genes that have >1 probe agianst them (2323 probes total that are 'multiplicates'), so the numbers are great enough to potentially impact my analysis. Any ideas/suggestions/criticisms will be much appreciated. with thanks, Jonathan Johnnidis

mgu74av2 probe affy mgu74av2 probe affy • 3.3k views

ADD COMMENT • link updated 21.8 years ago by Ron Ophir ▴ 80 • written 21.8 years ago by Johnnidis, Jonathan ▴ 50

0

Entering edit mode

Arne.Muller@aventis.com ▴ 620

@arnemulleraventiscom-466

Last seen 11.4 years ago

Hi Jonathan, interesting question. Basically if I'm just interested in the set of differentially regulated genes I ignore redundant affy probe sets. I.e. if at least one probe set for a given gene fullfills the selection criteria (fold change, p-value ...), I include the *gene* into my list. I usually convert all affy probeset accession codes into their corresponding LocuusLink IDs, from which I then remove duplicates. You could also use UniGene Cluster accession codes. Most of this info is provided my NetAffx. However, not all probe sets can be mapped to unigene or locuslink (I consider these as orphans and treat them as single genes each). Calculating a fold change for gene for which one has > 1 probe set is a nasty problem. Alternative splicing may play a role, too! I suggest to keep the most extreme fold change of the corresponding probe sets, since fold changes of the probe sets within a gene can be very different (also with different significance for differential expression). regards, Arne -- Arne Muller, Ph.D. Toxicogenomics, Aventis Pharma arne dot muller domain=aventis com > -----Original Message----- > From: bioconductor-bounces@stat.math.ethz.ch > [mailto:bioconductor-bounces@stat.math.ethz.ch]On Behalf Of Johnnidis, > Jonathan > Sent: 20 March 2004 16:40 > To: bioconductor@stat.math.ethz.ch > Subject: [BioC] how deal with multiplicate affy probes? > > > Hi Bioconductor, > > I'm a new list member and am not quite sure if this question > is appropriate for the list, but will shoot anyway. I'm > analyzing a bunch of data from Affy MgU74Av2 chips and am a > bit perplexed as to how to treat conflicting expression data > from multiplicate probe sets (that is a gene that has >1 > probe set designed against it (for example, 97569_r_at and > 97658_r_at are both probes for the Insulin gene). > > Specifically, if probe #1 for geneX indicates significant > fold change for that gene, but probe #2 indicates something > else (no fold change, or even fold change in the opposite > direction! (rare, but possible)), how can the expression > status of geneX be properly evaluated? Can one probe's > measurement be considered more reliable than another's (and > thus toss the one you suspect is wrong (although this could > introduce experimental bias))? Or is it most appropriate to > average the signal values for multiplicate probes together? > Or is there some other method? > > On the MgU74Av2 chip at least, by my calculations there are > at least 1079 genes that have >1 probe agianst them (2323 > probes total that are 'multiplicates'), so the numbers are > great enough to potentially impact my analysis. Any > ideas/suggestions/criticisms will be much appreciated. > > with thanks, > > Jonathan Johnnidis > > _______________________________________________ > Bioconductor mailing list > Bioconductor@stat.math.ethz.ch > https://www.stat.math.ethz.ch/mailman/listinfo/bioconductor >

ADD COMMENT • link 21.8 years ago Arne.Muller@aventis.com ▴ 620

0

Entering edit mode

Jeff Gentry ★ 3.9k

@jeff-gentry-12

Last seen 11.4 years ago

Laurent was having trouble mailing this, so I forwarded it for him: -------------------------------------------------------------- Jonathan, You bring up several issues (which I see as almost separated ones). One is the probe intensities: Up to a very recent date, there have been apparently a consensus in the Affy world. The individual probe intensities in a probe set were 'summarized' by one value. The trimmed average of probe intensities first suggested by the manufacturer was soon followed more sophisticated approaches, trying to provide robustness by isolating probes having a obviously erratic behavior. One can notice that the hybridization signal must depend on the nature of the probe (binding energy for example) and of the nature of the target(s) (cross-hybridization, self-hybridization, ...), and this is not accounted for in most of the methods proposed to "summarize" probe intensities into an "expression value". Methods for "averaging" the probe intensities in a probe set are only beginning to consider the physico-chemical properties of the respective probes in a probe set (in bioconductor, check the packages 'gcrma' and 'affypdnn'). What is needed to make the "summary expression value" paradigm more reliable is a transformation of individual probe intensities that discards probe-specific signal, to keep only an experiment-specific signal. One can see an analogy with the "between-chips" normalization step, but this time this is at the probe set level. Waiting for this "between-probes normalization", some have taken the approach of considering the individual probes in a probe set as separate measures for differential expression. One can see it as each probe being a referee in a jury, and each referee voting (or not) for differential expression. The package 'affy' offers facilities to explore this approach, check the function 'ppsetApply'. An another issue is the association of probes in probe sets. The probes were designed to match subsequences in target RNA. For any given RNA chosen to be address by a chip type, 20 (or 16 or whatever number) probes were chosen to match this RNA. The problem is that in many cases the gene sequences in databases are not 100% certain (and get eventually corrected). This what be called the Dorian Gray syndrome. Currently the probes on any given Affymetrix chip, the association of probes in probe sets, and the association of the probe set with a RNA (and its functional annotation) remain frozen in an apparent eternal youth, while their ability to monitor biological phenomena... degrades. I have built alternative mapping using very recent sets of RNA reference sequences (NCBI's RefSeqs) and modern Affymetrix chips (HG-U133A) and observed significant differences using both mappings. Those differences were enough to cause quite some difference in the outcome of an analysis. The (new) package 'altcdfenvs' contains tools to help building one's own mapping. These two issues are not completely independent, since "bad mapping" will certainly result in probe sets with probes showing an erratic behavior. >From my last experiences in analyzing Affymetrix data, I would consider carrying on several analysis (same data, different mappings, summary and non-summary approaches), especially before going for long and expensive complementary experiments. Hopin' it helps, Laurent > Hi Bioconductor, > > I'm a new list member and am not quite sure if this question is > appropriate for the list, but will shoot anyway. I'm analyzing a bunch > of data from Affy MgU74Av2 chips and am a bit perplexed as to how to > treat conflicting expression data from multiplicate probe sets (that > is a gene that has >1 probe set designed against it (for example, > 97569_r_at and 97658_r_at are both probes for the Insulin gene). > > Specifically, if probe #1 for geneX indicates significant fold change > for that gene, but probe #2 indicates something else (no fold change, > or even fold change in the opposite direction! (rare, but possible)), > how can the expression status of geneX be properly evaluated? Can one > probe's measurement be considered more reliable than another's (and > thus toss the one you suspect is wrong (although this could introduce > experimental bias))? Or is it most appropriate to average the signal > values for multiplicate probes together? Or is there some other > method? > > On the MgU74Av2 chip at least, by my calculations there are at least > 1079 genes that have >1 probe agianst them (2323 probes total that are > 'multiplicates'), so the numbers are great enough to potentially > impact my analysis. Any ideas/suggestions/criticisms will be much > appreciated. > > with thanks, > > Jonathan Johnnidis >

ADD COMMENT • link 21.8 years ago Jeff Gentry ★ 3.9k

0

Entering edit mode

Michael Seewald ▴ 130

@michael-seewald-574

Last seen 11.4 years ago

As a rule of thumb: If statistics based on a given probe set data tells you, that a transcript is significantly deregulated, you can usually trust it and discard every other probe set for that transcript! The thing to look at is the probe design itself: Download the probe set from NetAffx and blast the single probes agains the genome (e.g. in ensembl). You will be surprised, how many probes match up with introns or genomic regions that do not correspond to any cDNA! 2 examples: There are 4 probe sets for human Wnt6 (HG-U133AB), 2 match with the sense (!) strand and have to be discarded. Out of >12 probe sets for human CD44, only 4 have probes that are completely matching the transcripts. >8 can be discarded. Best, Michael PS: www.ensembl.org is always a good place to check probe sets. Their mapping of probe sets does not show the location of single probes, though... PPS: In affymetrix.com you can check out the "Details" view for a probe set. There you can discover, that 2 probe sets of Wnt 6 map to the (-) strand, which is bad. It doesn't tell you, however, that many probe sets match intron regions. On Sat, 20 Mar 2004, Johnnidis, Jonathan wrote: > I'm a new list member and am not quite sure if this question is appropriate > for the list, but will shoot anyway. I'm analyzing a bunch of data from Affy > MgU74Av2 chips and am a bit perplexed as to how to treat conflicting > expression data from multiplicate probe sets (that is a gene that has >1 > probe set designed against it (for example, 97569_r_at and 97658_r_at are > both probes for the Insulin gene).

ADD COMMENT • link 21.8 years ago Michael Seewald ▴ 130

0

Entering edit mode

Hello, As a note following on from Michael Seewald's message, I totally agree that there is a STRONG need to BLAST probe set sequences. I tend to use the probe set target sequence instead of the indicidual probe sequences however. You will be surprised to see the inconsistency of the Affy annotation, in many cases _at probes are really not unique at all. So if you are really interested in a transcript, BLAST it to make sure you are actually seeing what you think you are. Best regards to all, Lawrence ______________________________ Lawrence Paul Petalidis Ph.D. Candidate University of Cambridge Department of Pathology ______________________________ -----Original Message----- From: bioconductor-bounces@stat.math.ethz.ch [mailto:bioconductor-bounces@stat.math.ethz.ch]On Behalf Of Michael Seewald Sent: 25 March 2004 20:48 To: Johnnidis, Jonathan Cc: bioconductor@stat.math.ethz.ch Subject: Re: [BioC] how deal with multiplicate affy probes? As a rule of thumb: If statistics based on a given probe set data tells you, that a transcript is significantly deregulated, you can usually trust it and discard every other probe set for that transcript! The thing to look at is the probe design itself: Download the probe set from NetAffx and blast the single probes agains the genome (e.g. in ensembl). You will be surprised, how many probes match up with introns or genomic regions that do not correspond to any cDNA! 2 examples: There are 4 probe sets for human Wnt6 (HG-U133AB), 2 match with the sense (!) strand and have to be discarded. Out of >12 probe sets for human CD44, only 4 have probes that are completely matching the transcripts. >8 can be discarded. Best, Michael PS: www.ensembl.org is always a good place to check probe sets. Their mapping of probe sets does not show the location of single probes, though... PPS: In affymetrix.com you can check out the "Details" view for a probe set. There you can discover, that 2 probe sets of Wnt 6 map to the (-) strand, which is bad. It doesn't tell you, however, that many probe sets match intron regions. On Sat, 20 Mar 2004, Johnnidis, Jonathan wrote: > I'm a new list member and am not quite sure if this question is appropriate > for the list, but will shoot anyway. I'm analyzing a bunch of data from Affy > MgU74Av2 chips and am a bit perplexed as to how to treat conflicting > expression data from multiplicate probe sets (that is a gene that has >1 > probe set designed against it (for example, 97569_r_at and 97658_r_at are > both probes for the Insulin gene). _______________________________________________ Bioconductor mailing list Bioconductor@stat.math.ethz.ch https://www.stat.math.ethz.ch/mailman/listinfo/bioconductor

ADD REPLY • link 21.8 years ago Lawrence Paul Petalidis ▴ 130

0

Entering edit mode

Johnnidis, Jonathan ▴ 50

@johnnidis-jonathan-689

Last seen 11.4 years ago

thank you for your suggestions. However, in this instance I'm not interested in particular transcripts but rather an entire range of transcripts (several hundred)--so I'm not sure it would be feasible to individually lookup and verify every single probe set... Jonathan -----Original Message----- From: Lawrence Paul Petalidis [mailto:lpp22@cam.ac.uk] Sent: Thursday, March 25, 2004 6:35 PM To: Michael Seewald; Johnnidis, Jonathan Cc: bioconductor@stat.math.ethz.ch Subject: RE: [BioC] how deal with multiplicate affy probes? Hello, As a note following on from Michael Seewald's message, I totally agree that there is a STRONG need to BLAST probe set sequences. I tend to use the probe set target sequence instead of the indicidual probe sequences however. You will be surprised to see the inconsistency of the Affy annotation, in many cases _at probes are really not unique at all. So if you are really interested in a transcript, BLAST it to make sure you are actually seeing what you think you are. Best regards to all, Lawrence ______________________________ Lawrence Paul Petalidis Ph.D. Candidate University of Cambridge Department of Pathology ______________________________ -----Original Message----- From: bioconductor-bounces@stat.math.ethz.ch [mailto:bioconductor-bounces@stat.math.ethz.ch]On Behalf Of Michael Seewald Sent: 25 March 2004 20:48 To: Johnnidis, Jonathan Cc: bioconductor@stat.math.ethz.ch Subject: Re: [BioC] how deal with multiplicate affy probes? As a rule of thumb: If statistics based on a given probe set data tells you, that a transcript is significantly deregulated, you can usually trust it and discard every other probe set for that transcript! The thing to look at is the probe design itself: Download the probe set from NetAffx and blast the single probes agains the genome (e.g. in ensembl). You will be surprised, how many probes match up with introns or genomic regions that do not correspond to any cDNA! 2 examples: There are 4 probe sets for human Wnt6 (HG-U133AB), 2 match with the sense (!) strand and have to be discarded. Out of >12 probe sets for human CD44, only 4 have probes that are completely matching the transcripts. >8 can be discarded. Best, Michael PS: www.ensembl.org is always a good place to check probe sets. Their mapping of probe sets does not show the location of single probes, though... PPS: In affymetrix.com you can check out the "Details" view for a probe set. There you can discover, that 2 probe sets of Wnt 6 map to the (-) strand, which is bad. It doesn't tell you, however, that many probe sets match intron regions. On Sat, 20 Mar 2004, Johnnidis, Jonathan wrote: > I'm a new list member and am not quite sure if this question is appropriate > for the list, but will shoot anyway. I'm analyzing a bunch of data from Affy > MgU74Av2 chips and am a bit perplexed as to how to treat conflicting > expression data from multiplicate probe sets (that is a gene that has >1 > probe set designed against it (for example, 97569_r_at and 97658_r_at are > both probes for the Insulin gene). _______________________________________________ Bioconductor mailing list Bioconductor@stat.math.ethz.ch https://www.stat.math.ethz.ch/mailman/listinfo/bioconductor

ADD COMMENT • link 21.8 years ago Johnnidis, Jonathan ▴ 50

0

Entering edit mode

Arne.Muller@aventis.com ▴ 620

@arnemulleraventiscom-466

Last seen 11.4 years ago

Hi, you may be able to automate this by blasting all the target sequences (as Lawrence suggested) against the ENSEMBLE confirmed or predicted genes (i.e. not the complete genome but just the genes). Then only look at matches with >95% sequence identity (not sure about this cut off). In your analysis ignore all probe sets that do not have a confident match (<95% sequence id). One could say this is the actual informative subset of probe sets on the chip. Note that you can still get >1 match with >95% id per probe set! In my opinion the correspnding *single* expression measure is meaningless, since you cannot measure the >1 exprssion measures with the same probe set ... For cases with >1 gene per probe set (as determined by blast using ther target sequence) you may need to fall back to the single probe level where you may find that one of the genes has >95% sequence id in many probes whereas the other doesn't. As I said above you could automate this, but I it's not an easy task .. :-( regards, Arne -- Arne Muller, Ph.D. Toxicogenomics, Aventis Pharma arne dot muller domain=aventis com > -----Original Message----- > From: bioconductor-bounces@stat.math.ethz.ch > [mailto:bioconductor-bounces@stat.math.ethz.ch]On Behalf Of Johnnidis, > Jonathan > Sent: 26 March 2004 00:53 > To: Lawrence Paul Petalidis; Michael Seewald > Cc: bioconductor@stat.math.ethz.ch > Subject: RE: [BioC] how deal with multiplicate affy probes? > > > thank you for your suggestions. However, in this instance > I'm not interested in particular transcripts but rather an > entire range of transcripts (several hundred)--so I'm not > sure it would be feasible to individually lookup and verify > every single probe set... > Jonathan > > -----Original Message----- > From: Lawrence Paul Petalidis [mailto:lpp22@cam.ac.uk] > Sent: Thursday, March 25, 2004 6:35 PM > To: Michael Seewald; Johnnidis, Jonathan > Cc: bioconductor@stat.math.ethz.ch > Subject: RE: [BioC] how deal with multiplicate affy probes? > > > Hello, > As a note following on from Michael Seewald's message, I > totally agree that > there is a STRONG need to BLAST probe set sequences. I tend > to use the probe > set target sequence instead of the indicidual probe sequences > however. You > will be surprised to see the inconsistency of the Affy > annotation, in many > cases _at probes are really not unique at all. So if you are really > interested in a transcript, BLAST it to make sure you are > actually seeing > what you think you are. > > Best regards to all, Lawrence > > ______________________________ > Lawrence Paul Petalidis > Ph.D. Candidate > > University of Cambridge > Department of Pathology > ______________________________ > > -----Original Message----- > From: bioconductor-bounces@stat.math.ethz.ch > [mailto:bioconductor-bounces@stat.math.ethz.ch]On Behalf Of Michael > Seewald > Sent: 25 March 2004 20:48 > To: Johnnidis, Jonathan > Cc: bioconductor@stat.math.ethz.ch > Subject: Re: [BioC] how deal with multiplicate affy probes? > > > > As a rule of thumb: If statistics based on a given probe set > data tells you, > that a transcript is significantly deregulated, you can > usually trust it and > discard every other probe set for that transcript! > > The thing to look at is the probe design itself: Download the > probe set from > NetAffx and blast the single probes agains the genome (e.g. > in ensembl). You > will be surprised, how many probes match up with introns or > genomic regions > that do not correspond to any cDNA! > > 2 examples: There are 4 probe sets for human Wnt6 > (HG-U133AB), 2 match with > the sense (!) strand and have to be discarded. Out of >12 > probe sets for > human > CD44, only 4 have probes that are completely matching the > transcripts. >8 > can > be discarded. > > Best, > Michael > > PS: www.ensembl.org is always a good place to check probe sets. Their > mapping > of probe sets does not show the location of single probes, though... > > PPS: In affymetrix.com you can check out the "Details" view > for a probe set. > There you can discover, that 2 probe sets of Wnt 6 map to the > (-) strand, > which is bad. It doesn't tell you, however, that many probe sets match > intron > regions. > > > On Sat, 20 Mar 2004, Johnnidis, Jonathan wrote: > > I'm a new list member and am not quite sure if this question is > appropriate > > for the list, but will shoot anyway. I'm analyzing a bunch > of data from > Affy > > MgU74Av2 chips and am a bit perplexed as to how to treat conflicting > > expression data from multiplicate probe sets (that is a > gene that has >1 > > probe set designed against it (for example, 97569_r_at and > 97658_r_at are > > both probes for the Insulin gene). > > _______________________________________________ > Bioconductor mailing list > Bioconductor@stat.math.ethz.ch > https://www.stat.math.ethz.ch/mailman/listinfo/bioconductor > > _______________________________________________ > Bioconductor mailing list > Bioconductor@stat.math.ethz.ch > https://www.stat.math.ethz.ch/mailman/listinfo/bioconductor >

ADD COMMENT • link 21.8 years ago Arne.Muller@aventis.com ▴ 620

0

Entering edit mode

Lawrence Paul Petalidis ▴ 130

@lawrence-paul-petalidis-539

Last seen 11.4 years ago

Hello Laurent Thank you for your message. Yes, I do use the target sequence for BLASTing as opposed to performing multiple blasts with each of the probe sequences as I find this is faster. I am not suggesting that BLASTing all probe set target sequences is an option, not yet at least - and I am not performing this exercise for all genes. However, if there are genes that one is interested in particularly, and a priori, it is worth it. For example, I have genomic annotation for 5 genes (ie I know if they have amplifications or deletions in my tumour samples) and am looking at these genes on the Affy U133 chips very carefully (multiple probe sets issues, specificity) to assess correlations of expression with genomic status. For the MDM2 gene for example, a probe that was said to be _at really seems to pick up a sea of different transcript variants when assessed by BLAST (and verified by subsequent multiple sequence alignment of the target sequence against the transcript variant sequences found). Indeed, I agree, this is most likely because the information available at the time did not include these novel variants, but I had the impression that NetAffx was routinely updated against the current Unigene version. I am somewhat perplexed however, as it seems that although NetAffx is updated, some of the information is still based on the Unigene 133 version and in some cases the probe set display tool is not sufficiently up to date. In any case, my initial message did not refer to a widespread issue with the technology but aimed to raise a discussion on the issue of _at probe unique-ness, an issue that I believe could have been dealt by in a slightly better way in NetAffx. Many thanks for your attention, Lawrence ______________________________ Lawrence Paul Petalidis Ph.D. Candidate University of Cambridge Department of Pathology ______________________________ -----Original Message----- From: Laurent Gautier [mailto:lgautier@altern.org] Sent: 26 March 2004 15:06 To: Lawrence Paul Petalidis Cc: Michael Seewald; Johnnidis, Jonathan; bioconductor@stat.math.ethz.ch; maechler@stat.math.ethz.ch; jgentry@jimmy.harvard.edu Subject: RE: [BioC] how deal with multiplicate affy probes? On Thu, 2004-03-25 at 18:34, Lawrence Paul Petalidis wrote: > Hello, > As a note following on from Michael Seewald's message, I totally agree that > there is a STRONG need to BLAST probe set sequences. Do we really need to use BLAST (then how would we decide on cut-off values) ? The short probes are short oligonucleotides, so I think perfect string matches are likely to be enough in many cases. > > I tend to use the probe > set target sequence instead of the indicidual probe sequences however. At the risk of looking silly, may I ask you to detail a bit (I am not certain to understand... do you mean that you prefer working with the target sequence a given probe set is supposed to match ?... then you BLAST it against the rest of the world ?) > You > will be surprised to see the inconsistency of the Affy annotation, in many > cases _at probes are really not unique at all. I have spent some time damaging my sight by looking at how Affymetrix probes match reference sequences, and I would not be so fast at throwing the stone at them. What is there is not perfect (there are obvious problems), but: 1) it was done some time ago (the Dorian Gray syndrome referred in a previous mail)... and your very own "BLASTs" (or whatever else) could suffer from the same problem in some time 2) in some cases suspect that the people at Affymetrix did combined different sources of information to create the probes in a probe set (ex: a gene with tentatively 2 different isoforms, and two independants entries GENBANK, can lead to a unique probe set by setting the probes at appropriate locations.... whether it is relevant to merge two different isoforms into one goes can then be discussed, but that a different matter) > > So if you are really > interested in a transcript, BLAST it to make sure you are actually seeing > what you think you are. The notion of "alternative mappings" implemented in the package 'altcdfenvs' is worth a look. Staring at probe matches is probably not the idea of fun many people have, but apparently some start to do it for their favorite genes. I believe that a community-based mapping could benefit... well... the community... L. > Best regards to all, Lawrence > > ______________________________ > Lawrence Paul Petalidis > Ph.D. Candidate > > University of Cambridge > Department of Pathology > ______________________________ > > -----Original Message----- > From: bioconductor-bounces@stat.math.ethz.ch > [mailto:bioconductor-bounces@stat.math.ethz.ch]On Behalf Of Michael > Seewald > Sent: 25 March 2004 20:48 > To: Johnnidis, Jonathan > Cc: bioconductor@stat.math.ethz.ch > Subject: Re: [BioC] how deal with multiplicate affy probes? > > > > As a rule of thumb: If statistics based on a given probe set data tells you, > that a transcript is significantly deregulated, you can usually trust it and > discard every other probe set for that transcript! > > The thing to look at is the probe design itself: Download the probe set from > NetAffx and blast the single probes agains the genome (e.g. in ensembl). You > will be surprised, how many probes match up with introns or genomic regions > that do not correspond to any cDNA! > > 2 examples: There are 4 probe sets for human Wnt6 (HG-U133AB), 2 match with > the sense (!) strand and have to be discarded. Out of >12 probe sets for > human > CD44, only 4 have probes that are completely matching the transcripts. >8 > can > be discarded. > > Best, > Michael > > PS: www.ensembl.org is always a good place to check probe sets. Their > mapping > of probe sets does not show the location of single probes, though... > > PPS: In affymetrix.com you can check out the "Details" view for a probe set. > There you can discover, that 2 probe sets of Wnt 6 map to the (-) strand, > which is bad. It doesn't tell you, however, that many probe sets match > intron > regions. > > > On Sat, 20 Mar 2004, Johnnidis, Jonathan wrote: > > I'm a new list member and am not quite sure if this question is > appropriate > > for the list, but will shoot anyway. I'm analyzing a bunch of data from > Affy > > MgU74Av2 chips and am a bit perplexed as to how to treat conflicting > > expression data from multiplicate probe sets (that is a gene that has >1 > > probe set designed against it (for example, 97569_r_at and 97658_r_at are > > both probes for the Insulin gene). > > _______________________________________________ > Bioconductor mailing list > Bioconductor@stat.math.ethz.ch > https://www.stat.math.ethz.ch/mailman/listinfo/bioconductor > > _______________________________________________ > Bioconductor mailing list > Bioconductor@stat.math.ethz.ch > https://www.stat.math.ethz.ch/mailman/listinfo/bioconductor

ADD COMMENT • link 21.8 years ago Lawrence Paul Petalidis ▴ 130

0

Entering edit mode

Ron Ophir ▴ 80

@ron-ophir-303

Last seen 11.4 years ago

Hi All, There is a project called GeneAnnot from the people of GeneCards that implement this idea by "blatting" each probe to many RNA annotation resources and integrate the results into two scores that define the specificity and the sensitivity of the whole probe set. It was done on U95 sets and please encourage that to do it on other chips. You can fine the papers at: http://www.ncbi.nlm.nih.gov/entrez/query.fcgi?cmd=Retrieve&db=pubmed&d opt=Abstract&list_uids=14725348 and at http://www.ncbi.nlm.nih.gov/entrez/query.fcgi?cmd=Retrieve&db=pubmed&d opt=Abstract&list_uids=14962929 and please have a look at http://genecards.weizmann.ac.il/geneannot/ regards, Ron >>> <arne.muller@aventis.com> 03/26/04 11:26 AM >>> Hi, you may be able to automate this by blasting all the target sequences (as Lawrence suggested) against the ENSEMBLE confirmed or predicted genes (i.e. not the complete genome but just the genes). Then only look at matches with >95% sequence identity (not sure about this cut off). In your analysis ignore all probe sets that do not have a confident match (<95% sequence id). One could say this is the actual informative subset of probe sets on the chip. Note that you can still get >1 match with >95% id per probe set! In my opinion the correspnding *single* expression measure is meaningless, since you cannot measure the >1 exprssion measures with the same probe set ... For cases with >1 gene per probe set (as determined by blast using ther target sequence) you may need to fall back to the single probe level where you may find that one of the genes has >95% sequence id in many probes whereas the other doesn't. As I said above you could automate this, but I it's not an easy task .. :-( regards, Arne -- Arne Muller, Ph.D. Toxicogenomics, Aventis Pharma arne dot muller domain=aventis com > -----Original Message----- > From: bioconductor-bounces@stat.math.ethz.ch > [mailto:bioconductor-bounces@stat.math.ethz.ch]On Behalf Of Johnnidis, > Jonathan > Sent: 26 March 2004 00:53 > To: Lawrence Paul Petalidis; Michael Seewald > Cc: bioconductor@stat.math.ethz.ch > Subject: RE: [BioC] how deal with multiplicate affy probes? > > > thank you for your suggestions. However, in this instance > I'm not interested in particular transcripts but rather an > entire range of transcripts (several hundred)--so I'm not > sure it would be feasible to individually lookup and verify > every single probe set... > Jonathan > > -----Original Message----- > From: Lawrence Paul Petalidis [mailto:lpp22@cam.ac.uk] > Sent: Thursday, March 25, 2004 6:35 PM > To: Michael Seewald; Johnnidis, Jonathan > Cc: bioconductor@stat.math.ethz.ch > Subject: RE: [BioC] how deal with multiplicate affy probes? > > > Hello, > As a note following on from Michael Seewald's message, I > totally agree that > there is a STRONG need to BLAST probe set sequences. I tend > to use the probe > set target sequence instead of the indicidual probe sequences > however. You > will be surprised to see the inconsistency of the Affy > annotation, in many > cases _at probes are really not unique at all. So if you are really > interested in a transcript, BLAST it to make sure you are > actually seeing > what you think you are. > > Best regards to all, Lawrence > > ______________________________ > Lawrence Paul Petalidis > Ph.D. Candidate > > University of Cambridge > Department of Pathology > ______________________________ > > -----Original Message----- > From: bioconductor-bounces@stat.math.ethz.ch > [mailto:bioconductor-bounces@stat.math.ethz.ch]On Behalf Of Michael > Seewald > Sent: 25 March 2004 20:48 > To: Johnnidis, Jonathan > Cc: bioconductor@stat.math.ethz.ch > Subject: Re: [BioC] how deal with multiplicate affy probes? > > > > As a rule of thumb: If statistics based on a given probe set > data tells you, > that a transcript is significantly deregulated, you can > usually trust it and > discard every other probe set for that transcript! > > The thing to look at is the probe design itself: Download the > probe set from > NetAffx and blast the single probes agains the genome (e.g. > in ensembl). You > will be surprised, how many probes match up with introns or > genomic regions > that do not correspond to any cDNA! > > 2 examples: There are 4 probe sets for human Wnt6 > (HG-U133AB), 2 match with > the sense (!) strand and have to be discarded. Out of >12 > probe sets for > human > CD44, only 4 have probes that are completely matching the > transcripts. >8 > can > be discarded. > > Best, > Michael > > PS: www.ensembl.org is always a good place to check probe sets. Their > mapping > of probe sets does not show the location of single probes, though... > > PPS: In affymetrix.com you can check out the "Details" view > for a probe set. > There you can discover, that 2 probe sets of Wnt 6 map to the > (-) strand, > which is bad. It doesn't tell you, however, that many probe sets match > intron > regions. > > > On Sat, 20 Mar 2004, Johnnidis, Jonathan wrote: > > I'm a new list member and am not quite sure if this question is > appropriate > > for the list, but will shoot anyway. I'm analyzing a bunch > of data from > Affy > > MgU74Av2 chips and am a bit perplexed as to how to treat conflicting > > expression data from multiplicate probe sets (that is a > gene that has >1 > > probe set designed against it (for example, 97569_r_at and > 97658_r_at are > > both probes for the Insulin gene). > > _______________________________________________ > Bioconductor mailing list > Bioconductor@stat.math.ethz.ch > https://www.stat.math.ethz.ch/mailman/listinfo/bioconductor > > _______________________________________________ > Bioconductor mailing list > Bioconductor@stat.math.ethz.ch > https://www.stat.math.ethz.ch/mailman/listinfo/bioconductor > _______________________________________________ Bioconductor mailing list Bioconductor@stat.math.ethz.ch https://www.stat.math.ethz.ch/mailman/listinfo/bioconductor

ADD COMMENT • link 21.8 years ago Ron Ophir ▴ 80

0

Entering edit mode

Just to add two cents to the discussion: In our lab, we have a similar problem. We have oligo arrays (oligos slightly longer than affy) designed against a set of transcripts as defined a couple of years ago. Of course, over time, the annotations and predicted transcripts have changed, so we have resorted to blatting (probably not good to use blat for short oligos) all of the oligos against ensembl transcripts, refseq, and genbank est (and then mapping to unigene). Determining meaningful blat (or blast) cutoffs is difficult if not impossible to do only because hybridization may not be directly related to a score or even to %identity (some probes hybridize better than others), so we construct a database for each new build of the transcripts from the different annotators (NCBI, ensembl, etc) of the blat hits so that one can examine the characteristics of a suspect probe or set of probes in the context of the expression data (eg., 2 probes that hit the same transcript may or not behave the same way and if they don't, it is useful to quickly have access to blat information that might explain the effect). In summary, the process of blat->assign probe to transcript->interpret based on this single assignment may not be adequate in some situations. Having all results on hand in database form seems useful in our hands. Finally, as noted above, blatting or blasting against the genome does not get you the same information. Sean On 3/26/04 4:59 AM, "Ron Ophir" <lsophir@wisemail.weizmann.ac.il> wrote: > Hi All, > There is a project called GeneAnnot from the people of GeneCards that > implement this idea by "blatting" each probe to many RNA annotation > resources and integrate the results into two scores that define the > specificity and the sensitivity of the whole probe set. It was done on > U95 sets and please encourage that to do it on other chips. > You can fine the papers at: > http://www.ncbi.nlm.nih.gov/entrez/query.fcgi?cmd=Retrieve&db=pubmed &dopt=Abst > ract&list_uids=14725348 > and at > http://www.ncbi.nlm.nih.gov/entrez/query.fcgi?cmd=Retrieve&db=pubmed &dopt=Abst > ract&list_uids=14962929 > and please have a look at > http://genecards.weizmann.ac.il/geneannot/ > regards, > Ron > >>>> <arne.muller@aventis.com> 03/26/04 11:26 AM >>> > Hi, > > you may be able to automate this by blasting all the target sequences > (as > Lawrence suggested) against the ENSEMBLE confirmed or predicted genes > (i.e. > not the complete genome but just the genes). Then only look at matches > with >> 95% sequence identity (not sure about this cut off). In your analysis > ignore > all probe sets that do not have a confident match (<95% sequence id). > One > could say this is the actual informative subset of probe sets on the > chip. > > Note that you can still get >1 match with >95% id per probe set! In my > opinion the correspnding *single* expression measure is meaningless, > since > you cannot measure the >1 exprssion measures with the same probe set ... > > For cases with >1 gene per probe set (as determined by blast using ther > target sequence) you may need to fall back to the single probe level > where > you may find that one of the genes has >95% sequence id in many probes > whereas the other doesn't. > > As I said above you could automate this, but I it's not an easy task .. > :-( > > regards, > > Arne > > -- > Arne Muller, Ph.D. > Toxicogenomics, Aventis Pharma > arne dot muller domain=aventis com > >> -----Original Message----- >> From: bioconductor-bounces@stat.math.ethz.ch >> [mailto:bioconductor-bounces@stat.math.ethz.ch]On Behalf Of Johnnidis, >> Jonathan >> Sent: 26 March 2004 00:53 >> To: Lawrence Paul Petalidis; Michael Seewald >> Cc: bioconductor@stat.math.ethz.ch >> Subject: RE: [BioC] how deal with multiplicate affy probes? >> >> >> thank you for your suggestions. However, in this instance >> I'm not interested in particular transcripts but rather an >> entire range of transcripts (several hundred)--so I'm not >> sure it would be feasible to individually lookup and verify >> every single probe set... >> Jonathan >> >> -----Original Message----- >> From: Lawrence Paul Petalidis [mailto:lpp22@cam.ac.uk] >> Sent: Thursday, March 25, 2004 6:35 PM >> To: Michael Seewald; Johnnidis, Jonathan >> Cc: bioconductor@stat.math.ethz.ch >> Subject: RE: [BioC] how deal with multiplicate affy probes? >> >> >> Hello, >> As a note following on from Michael Seewald's message, I >> totally agree that >> there is a STRONG need to BLAST probe set sequences. I tend >> to use the probe >> set target sequence instead of the indicidual probe sequences >> however. You >> will be surprised to see the inconsistency of the Affy >> annotation, in many >> cases _at probes are really not unique at all. So if you are really >> interested in a transcript, BLAST it to make sure you are >> actually seeing >> what you think you are. >> >> Best regards to all, Lawrence >> >> ______________________________ >> Lawrence Paul Petalidis >> Ph.D. Candidate >> >> University of Cambridge >> Department of Pathology >> ______________________________ >> >> -----Original Message----- >> From: bioconductor-bounces@stat.math.ethz.ch >> [mailto:bioconductor-bounces@stat.math.ethz.ch]On Behalf Of Michael >> Seewald >> Sent: 25 March 2004 20:48 >> To: Johnnidis, Jonathan >> Cc: bioconductor@stat.math.ethz.ch >> Subject: Re: [BioC] how deal with multiplicate affy probes? >> >> >> >> As a rule of thumb: If statistics based on a given probe set >> data tells you, >> that a transcript is significantly deregulated, you can >> usually trust it and >> discard every other probe set for that transcript! >> >> The thing to look at is the probe design itself: Download the >> probe set from >> NetAffx and blast the single probes agains the genome (e.g. >> in ensembl). You >> will be surprised, how many probes match up with introns or >> genomic regions >> that do not correspond to any cDNA! >> >> 2 examples: There are 4 probe sets for human Wnt6 >> (HG-U133AB), 2 match with >> the sense (!) strand and have to be discarded. Out of >12 >> probe sets for >> human >> CD44, only 4 have probes that are completely matching the >> transcripts. >8 >> can >> be discarded. >> >> Best, >> Michael >> >> PS: www.ensembl.org is always a good place to check probe sets. Their >> mapping >> of probe sets does not show the location of single probes, though... >> >> PPS: In affymetrix.com you can check out the "Details" view >> for a probe set. >> There you can discover, that 2 probe sets of Wnt 6 map to the >> (-) strand, >> which is bad. It doesn't tell you, however, that many probe sets match >> intron >> regions. >> >> >> On Sat, 20 Mar 2004, Johnnidis, Jonathan wrote: >>> I'm a new list member and am not quite sure if this question is >> appropriate >>> for the list, but will shoot anyway. I'm analyzing a bunch >> of data from >> Affy >>> MgU74Av2 chips and am a bit perplexed as to how to treat conflicting >>> expression data from multiplicate probe sets (that is a >> gene that has >1 >>> probe set designed against it (for example, 97569_r_at and >> 97658_r_at are >>> both probes for the Insulin gene). >> >> _______________________________________________ >> Bioconductor mailing list >> Bioconductor@stat.math.ethz.ch >> https://www.stat.math.ethz.ch/mailman/listinfo/bioconductor >> >> _______________________________________________ >> Bioconductor mailing list >> Bioconductor@stat.math.ethz.ch >> https://www.stat.math.ethz.ch/mailman/listinfo/bioconductor >> > > _______________________________________________ > Bioconductor mailing list > Bioconductor@stat.math.ethz.ch > https://www.stat.math.ethz.ch/mailman/listinfo/bioconductor > > _______________________________________________ > Bioconductor mailing list > Bioconductor@stat.math.ethz.ch > https://www.stat.math.ethz.ch/mailman/listinfo/bioconductor >

ADD REPLY • link 21.8 years ago Sean Davis 21k

0

Entering edit mode

On Fri, 26 Mar 2004, Sean Davis wrote: > Finally, as noted above, blatting or blasting against the genome does not > get you the same information. Sorry, I didn't get your point: If *everything* is mapped to the genome, both probes and transcript, what do you miss? Shouldn't the curated and curated genome be the reference everything else is linked to (in 2004)? I don't think, the transcript based mapping as done in GeneAnnot is the way to do it. It complicates things without necessity. Best wishes, Michael

ADD REPLY • link 21.8 years ago Michael Seewald ▴ 130

0

Entering edit mode

Michael, In reading my statement out of context, I should clarify a bit. The problem is that the space in which one is searching for blast or blat "hits" is larger (unnecessarily large) in the genomic case as compared to the transcript case. That is, for expression analysis, one does not need or even want to know if a probe hits some anonymous piece of DNA that is not represented as a transcript (or for some researchers, as an annotated gene in some curated gene effort). In practice, what can happen is that a probe may align to multiple places in the genome, only one of which represents a "true" gene, others representing either common repeat elements (yes, I think there are probably probes in production arrays that for one reason or another have many hits in the genome) or pseudogenes. One can argue about the meaning of these hits, but unless there is a way of determining which of the multiple hits is against an annotated gene, the probe is not particularly useful for expression analysis. Yes, in 2004, it is fairly easy to determine if a hit is against an annotated stretch of DNA, but this is an added step (and not entirely trivial--think splice sites-->gaps) as compared to just looking for similarity between the probes and a library of transcripts. For CGH, the opposite is true. To what transcripts or genes a set of oligos aligns is less interesting than the genomic DNA that they align to. Hope that clarifies a bit. Sean On 3/30/04 4:08 AM, "Michael Seewald" <mseewald@gmx.de> wrote: > > On Fri, 26 Mar 2004, Sean Davis wrote: >> Finally, as noted above, blatting or blasting against the genome does not >> get you the same information. > > Sorry, I didn't get your point: If *everything* is mapped to the genome, both > probes and transcript, what do you miss? Shouldn't the curated and curated > genome be the reference everything else is linked to (in 2004)? I don't think, > the transcript based mapping as done in GeneAnnot is the way to do it. It > complicates things without necessity. > > Best wishes, > Michael >

ADD REPLY • link 21.8 years ago Sean Davis 21k

0

Entering edit mode

On Tue, 30 Mar 2004, Sean Davis wrote: > In reading my statement out of context, I should clarify a bit. The problem > is that the space in which one is searching for blast or blat "hits" is > larger (unnecessarily large) in the genomic case as compared to the > transcript case. OK, now I see what you mean - and I disagree. ;) I am not so much concerned about probes hitting multiple spots in the genome (you can work that out) as rather probes hitting introns as opposed to 3'UTR regions. As far as introns are concerned, you are right. The transcript databases are good enough to tell us, this probe hits an intron or an exon. However, I can remember a couple of cases, where I saw a probe hitting an UTR region - and it gave excellent signal - and at the same time it did *NOT* map to the transcript. In your analysis, you would miss that probe set. My personal conclusion was, that a) the probe was ok and b) some transcript sequences are 3'-truncated. I didn't dive into literature to check for reasons, though. Of course, at some point, there must have been a transcript displaying that very probe sequence, otherwise Affy wouldn't have taken it for the design. There are many more reasons to map everything to the genome - SNP analysis, epigenetics, CGH are only some of them. Best wishes, Michael

ADD REPLY • link 21.8 years ago Michael Seewald ▴ 130

Login before adding your answer.