Duplicate probe coordinates with pd.hugene.2.1.st and oligo

0

Entering edit mode

Stephen Piccolo ▴ 590

@stephen-piccolo-6761

Last seen 3.6 years ago

United States

I?m trying to process some CEL files from Affy HuGene 2.1st platform. But it seems there may be a problem with the pd.hugene.2.1.st package or with the way oligo is handling them (or with something I am doing). Below is the code that I am using and the output I?m getting. affyExpressionFS <- read.celfiles(celFilePath) xCoord = getX(affyExpressionFS, type="pm") yCoord = getY(affyExpressionFS, type="pm") pmSeq = pmSequence(affyExpressionFS) print(length(xCoord)) print(length(yCoord)) print(length(pmSeq)) print(length(shouldUseProbes)) [1] 1022045 [1] 1022045 [1] 1025088 [1] 1025088 Shouldn?t the lengths of these all be identical? Also, I am seeing duplicate values for the x_y coordinates. For example, it is saying there are 8 probes with x_y coordinates of 1000_198, and the intensity values are different for each probe. Is there something I am missing? Or could this be due to a bug? > sessionInfo() R version 3.1.0 (2014-04-10) Platform: x86_64-apple-darwin13.1.0 (64-bit) locale: [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8 attached base packages: [1] parallel stats graphics grDevices utils datasets methods [8] base other attached packages: [1] SCAN.UPC_2.6.0 sva_3.10.0 mgcv_1.7-29 [4] nlme_3.1-117 corpcor_1.6.6 foreach_1.4.2 [7] affyio_1.32.0 affy_1.42.2 GEOquery_2.30.0 [10] oligo_1.28.2 Biostrings_2.32.0 XVector_0.4.0 [13] IRanges_1.22.8 oligoClasses_1.26.0 Biobase_2.24.0 [16] BiocGenerics_0.10.0 loaded via a namespace (and not attached): [1] affxparser_1.36.0 BiocInstaller_1.14.2 bit_1.1-12 [4] codetools_0.2-8 DBI_0.2-7 ff_2.2-13 [7] GenomeInfoDb_1.0.2 GenomicRanges_1.16.3 grid_3.1.0 [10] iterators_1.0.7 lattice_0.20-29 MASS_7.3-33 [13] Matrix_1.1-3 preprocessCore_1.26.1 RCurl_1.95-4.1 [16] splines_3.1.0 stats4_3.1.0 tools_3.1.0 [19] XML_3.98-1.1 zlibbioc_1.10.0 Thanks, -Steve ??????????????????????????????????? Stephen Piccolo, Ph.D. Postdoctoral Research Associate Affiliations: Department of Pharmacology and Toxicology, University of Utah Division of Computational Biomedicine, Boston University School of Medicine ???????????????????????????????????

probe affy PROcess oligo probe affy PROcess oligo • 1.2k views

ADD COMMENT • link updated 9.9 years ago by James W. MacDonald 65k • written 9.9 years ago by Stephen Piccolo ▴ 590

0

Entering edit mode

James W. MacDonald 65k

@james-w-macdonald-5106

Last seen 7 hours ago

United States

Hi Steve, On 6/11/2014 5:17 PM, Steve Piccolo wrote: > I?m trying to process some CEL files from Affy HuGene 2.1st platform. But > it seems there may be a problem with the pd.hugene.2.1.st package or with > the way oligo is handling them (or with something I am doing). Below is > the code that I am using and the output I?m getting. > > affyExpressionFS <- read.celfiles(celFilePath) > xCoord = getX(affyExpressionFS, type="pm") > yCoord = getY(affyExpressionFS, type="pm") > > pmSeq = pmSequence(affyExpressionFS) > > print(length(xCoord)) > print(length(yCoord)) > print(length(pmSeq)) > print(length(shouldUseProbes)) > > [1] 1022045 > [1] 1022045 > [1] 1025088 > [1] 1025088 > > > Shouldn?t the lengths of these all be identical? Also, I am seeing > duplicate values for the x_y coordinates. For example, it is saying there > are 8 probes with x_y coordinates of 1000_198, and the intensity values > are different for each probe. I think you might be conflating probe with probeset. If we look at the pmfeature table for the (x,y) coordinate you mention, we see this: fid fsetid atom x y 719881 236621 17016826 719881 1000 198 739683 236621 17026617 739683 1000 198 744589 236621 17028715 744589 1000 198 750333 236621 17031494 750333 1000 198 755872 236621 17033950 755872 1000 198 761063 236621 17036233 761063 1000 198 766702 236621 17038992 766702 1000 198 772172 236621 17041577 772172 1000 198 So you are correct that this probe is in the pmfeature table 8 times. This is because it is in eight different probesets (the fsetid column), and that is when you summarize at the probeset level. In other words, this single probe (the fid 236621) is used eight different times when you summarize using target = "probeset". If you summarize at the transcript level (target = "core") this particular probe (fid) is also distributed into eight different probesets. You don't show how you are getting the intensity values, so I can't comment on the different values. I would bet however that you are looking at eight different probesets after a summarization step, rather than the same probe intensity eight times. Having explained that part, note that getX() and getY() are by default getting data at the 'probeset' level, which includes all the duplicated probes. The actual call will end up being SELECT fid, x FROM pmfeature; and the structure of the pmfeature table is as you see above, so in essence you are just getting the fid and x columns. On the other hand, pmSequence() can get sequences based on whether or not you are summarizing at the probeset or the transcript (or 'core') level. So if you had done: > z <- pmSequencepd.hugene.2.1.st, target = "probeset") > length(z) [1] 1022045 you would get comparable lengths. Now why are there more sequences at the 'core' level? It's because there is even more sharing of the probes at that level. In other words, a given probe may be in even more probesets at the 'core' level than it was if you summarized at the 'probeset' level. Best, Jim > > Is there something I am missing? Or could this be due to a bug? > > > >> sessionInfo() > R version 3.1.0 (2014-04-10) > Platform: x86_64-apple-darwin13.1.0 (64-bit) > > locale: > [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8 > > attached base packages: > [1] parallel stats graphics grDevices utils datasets methods > [8] base > > other attached packages: > [1] SCAN.UPC_2.6.0 sva_3.10.0 mgcv_1.7-29 > [4] nlme_3.1-117 corpcor_1.6.6 foreach_1.4.2 > [7] affyio_1.32.0 affy_1.42.2 GEOquery_2.30.0 > [10] oligo_1.28.2 Biostrings_2.32.0 XVector_0.4.0 > [13] IRanges_1.22.8 oligoClasses_1.26.0 Biobase_2.24.0 > [16] BiocGenerics_0.10.0 > > loaded via a namespace (and not attached): > [1] affxparser_1.36.0 BiocInstaller_1.14.2 bit_1.1-12 > [4] codetools_0.2-8 DBI_0.2-7 ff_2.2-13 > [7] GenomeInfoDb_1.0.2 GenomicRanges_1.16.3 grid_3.1.0 > [10] iterators_1.0.7 lattice_0.20-29 MASS_7.3-33 > [13] Matrix_1.1-3 preprocessCore_1.26.1 RCurl_1.95-4.1 > [16] splines_3.1.0 stats4_3.1.0 tools_3.1.0 > [19] XML_3.98-1.1 zlibbioc_1.10.0 > > > Thanks, > -Steve > > ??????????????????????????????????? > Stephen Piccolo, Ph.D. > Postdoctoral Research Associate > > Affiliations: > Department of Pharmacology and Toxicology, University of Utah > Division of Computational Biomedicine, Boston University School of > Medicine > ??????????????????????????????????? > > _______________________________________________ > Bioconductor mailing list > Bioconductor at r-project.org > https://stat.ethz.ch/mailman/listinfo/bioconductor > Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor > -- James W. MacDonald, M.S. Biostatistician University of Washington Environmental and Occupational Health Sciences 4225 Roosevelt Way NE, # 100 Seattle WA 98105-6099

ADD COMMENT • link 9.9 years ago James W. MacDonald 65k

Login before adding your answer.