redundant probe sets in Affymetrix HG-U219

0

Entering edit mode

Andreas Heider ▴ 340

@andreas-heider-4538

Last seen 9.2 years ago

Dear Bioconductor mailing list, is ther a sensible way to deal with redundant probesets on Affymetrix chips like the HG-U219? For Example: Probe Set ID RefSeq Transcript ID 11715100_at NM_003534 11715101_s_at NM_003534 11715102_x_at NM_003534 Should I get the median/mean of te expression intensities? Or select the highest? And what would be the procedre in R to do it? I mean, how do I tell R to return the median of expression values if there are more than 1 probesets for only 1 refseq ID? I hope you can help me, Andreas [[alternative HTML version deleted]]

probe probe • 1.3k views

ADD COMMENT • link updated 13.0 years ago by James W. MacDonald 65k • written 13.0 years ago by Andreas Heider ▴ 340

1

Entering edit mode

James W. MacDonald 65k

@james-w-macdonald-5106

Last seen 3 hours ago

United States

Hi Andreas, On 4/14/2011 5:27 AM, Andreas Heider wrote: > Dear Bioconductor mailing list, > is ther a sensible way to deal with redundant probesets on Affymetrix chips > like the HG-U219? Define sensible. There are some things you can do, but each comes with its own assumptions. There is the findLargest() function in genefilter that will select the probeset with the largest value of a test statistic. This assumes (among other things) that all of the redundant probesets measure the same thing. But note that the _x_ and _s_ in the probesets you list below indicate that when Affy designed that chip the probesets cross-hybridized with unrelated or related transcripts, respectively. You can use the MBNI re-mapped cdfs, which take current versions of the genome and filter out probes that don't uniquely hybridize to the genome, and then map probes to probesets based on e.g., Entrez Gene IDs. This eliminates the problem of multiple probesets, but you then have to contend with probesets that vary from ~3 probes up to 100 or more. As you can imagine, the probesets with 3 probes will have much larger standard errors than those with say 100 probes. This makes downstream analyses more difficult unless you choose to simply ignore that fact. You could ignore the fact that you have multiple probesets that may or may not be measuring the same thing, and assume independence (which, of course isn't even true when you have no redundant probesets). No real satisfying alternatives, IMO, so you have to pick your poison. Best, Jim > For Example: > Probe Set ID RefSeq Transcript ID 11715100_at NM_003534 11715101_s_at > NM_003534 11715102_x_at NM_003534 > Should I get the median/mean of te expression intensities? Or select the > highest? And what would be the procedre in R to do it? I mean, how do I tell > R to return the median of expression values if there are more than 1 > probesets for only 1 refseq ID? > > I hope you can help me, Andreas > > [[alternative HTML version deleted]] > > _______________________________________________ > Bioconductor mailing list > Bioconductor at r-project.org > https://stat.ethz.ch/mailman/listinfo/bioconductor > Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor -- James W. MacDonald, M.S. Biostatistician Douglas Lab University of Michigan Department of Human Genetics 5912 Buhl 1241 E. Catherine St. Ann Arbor MI 48109-5618 734-615-7826 ********************************************************** Electronic Mail is not secure, may not be read every day, and should not be used for urgent or sensitive issues

ADD COMMENT • link 13.0 years ago James W. MacDonald 65k

Login before adding your answer.