Pd info package affy 10K array
0
0
Entering edit mode
@henrik-bengtsson-4333
Last seen 7 days ago
United States
Hi, FYI and related to this one, I've posted a 'Request for more consistent filenames for chip type files' to the "General" forum of the Affymetrix Developers Network, cf. http://www.affymetrix.com/community/forums/thread.jspa?threadID=6481. /Henrik On Mon, Jun 30, 2008 at 11:49 AM, Henrik Bengtsson <hb at="" stat.berkeley.edu=""> wrote: > Hi, > > I can confirm that the probe sequence file for Mapping10K_Xba142 > [http://www.affymetrix.com/Auth/analysis/downloads/data/Mapping10Kv2 _probe_tab.zip] > linked to at the 'Mapping 10K 2.0 Array - Support Materials' page > [http://www.affymetrix.com/support/technical/byproduct.affx?product= 10k-20] > does indeed look like it is for Mapping10K_Xba131, e.g. the available > X and Y positions are in [1,710] and [1,707] which is clearly outside > the dimension of the Mapping10K_Xba142 chip type 658x658. > > Did you post this in the Affymetrix Forum > > https://www.affymetrix.com/community/forums/index.jspa > > or directly to the support? Is there a thread where I can post a follow up? > > -Henrik > > > On Thu, Jun 26, 2008 at 2:25 PM, Michael Gormley > <michael.gormley at="" gmail.com=""> wrote: >> This is the same source where I obtained the files originally. I have >> brought this issue to the attention of affy technical support. Hoping they >> can get me the correct probe sequence file. >> >> On Thu, Jun 26, 2008 at 2:26 PM, James W. MacDonald <jmacdon at="" med.umich.edu=""> >> wrote: >>> >>> Interesting. >>> >>> To test the problems Michael was having, I simply went to Affy's product >>> support page and downloaded the library file, annotation file, and sequence >>> file. So it appears they have things mixed up on that page, and there isn't >>> anything obvious about the sequence file that would inform anybody it is >>> wrong: >>> >>> > dir(pattern = "^Mapping") >>> [1] "Mapping10K_probe_tab" "Mapping10K_Xba142.CDF" >>> [3] "Mapping10K_Xba142.na25.annot.csv" >>> >>> Best, >>> >>> Jim >>> >>> >>> >>> Henrik Bengtsson wrote: >>>> >>>> Note that there are two different Affymetrix 10K chip types, namely >>>> Mapping10K_Xba131 (aka 'Mapping 10K Array') and Mapping10K_Xba142 (aka >>>> 'Mapping 10K Array 2.0'). The probe sequence file you refer to seems >>>> to be for the former, which is a larger chip. Details on the official >>>> Affymetrix CDFs (converted to binary though): >>>> >>>>> library(aroma.affymetrix) >>>>> cdf <- AffymetrixCdfFile$byChipType("Mapping10K_Xba142") >>>>> cdf >>>> >>>> AffymetrixCdfFile: >>>> Path: annotationData/chipTypes/Mapping10K_Xba142 >>>> Filename: Mapping10K_Xba142.cdf >>>> Filesize: 9.53MB >>>> Chip type: Mapping10K_Xba142 >>>> RAM: 0.00MB >>>> File format: v4 (binary; XDA) >>>> Dimension: 658x658 >>>> Number of cells: 432964 >>>> Number of units: 10208 >>>> Cells per unit: 42.41 >>>> Number of QC units: 9 >>>> >>>>> cdf <- AffymetrixCdfFile$byChipType("Mapping10K_Xba131") >>>>> cdf >>>> >>>> AffymetrixCdfFile: >>>> Path: annotationData/chipTypes/Mapping10K_Xba131 >>>> Filename: Mapping10K_Xba131.cdf >>>> Filesize: 10.79MB >>>> Chip type: Mapping10K_Xba131 >>>> RAM: 0.00MB >>>> File format: v4 (binary; XDA) >>>> Dimension: 712x712 >>>> Number of cells: 506944 >>>> Number of units: 11564 >>>> Cells per unit: 43.84 >>>> Number of QC units: 9 >>>> >>>> FYI: I try to collect information about various Affymetrix chip types at: >>>> >>>> >>>> http://groups.google.com/group/aroma-affymetrix/web /documentation-on-chip-types >>>> >>>> Final comment: I would like to emphasize the difference between 'chip >>>> type' and 'CDF'; a chip type refers to a unique product coming out of >>>> Affymetrix, whereas a CDF refers to an annotation of a chip type. >>>> There can be many different CDFs for each chip type, but only one chip >>>> type per CDF. >>>> >>>> Cheers >>>> >>>> Henrik >>>> >>>> On Thu, Jun 26, 2008 at 9:42 AM, James W. MacDonald >>>> <jmacdon at="" med.umich.edu=""> wrote: >>>>> >>>>> Hi Michael, >>>>> >>>>> Michael Gormley wrote: >>>>>> >>>>>> I get an error when running the makePdInfoPackage function to make a >>>>>> PdInfo >>>>>> package for the 10K mapping array. The output from the function reads: >>>>>> >>>>>>> makePdInfoPackage(pkg,destDir=".") >>>>>> >>>>>> Creating package in ./pd.mapping10k.xba142 >>>>>> loadUnitsByBatch took 22.86 sec >>>>>> loadAffyCsv took 2.79 sec >>>>>> Error in sqliteExecStatement(con, statement, bind.data) : >>>>>> RS-DBI driver: (RS_SQLite_exec: could not execute: PRIMARY KEY must be >>>>>> unique) >>>>>> In addition: Warning messages: >>>>>> 1: In is.na(v) : is.na() applied to non-(list or vector) of type 'NULL' >>>>>> 2: In is.na(v) : is.na() applied to non-(list or vector) of type 'NULL' >>>>>> 3: In is.na(v) : is.na() applied to non-(list or vector) of type 'NULL' >>>>>> Timing stopped at: 0.36 0.01 0.44 >>>>> >>>>> I have spent some time looking at this, and it appears that the problem >>>>> is >>>>> due to inconsistencies between the cdf and probe sequence files. As far >>>>> as I >>>>> can tell there are many probe locations ((x, y) coordinates) in the cdf >>>>> that >>>>> don't exist in the probe sequence file, and vice versa. >>>>> >>>>> The function loadAffySeqCsv() reads in a chunk of data from the probe >>>>> sequence file, then matches the indices (computed from the (x, y) >>>>> coordinates) of these data with the indices that were generated using >>>>> the >>>>> cdf data. In the first chunk of 1000 probesets, there are only 8223 >>>>> probesets that match between the two data sources. I don't think this >>>>> would >>>>> normally be a problem, except for the fact that 1000 probesets from the >>>>> sequence file should *exactly* line up with what we got from the cdf. >>>>> >>>>> But the real problem that arises is this: >>>>> >>>>> The computation of indices is based on the dimensions of the chip. If we >>>>> query the cdf to find what the dimensions are we get this: >>>>> >>>>> readCdfHeader(cdfFile) >>>>> $ncols >>>>> [1] 658 >>>>> >>>>> $nrows >>>>> [1] 658 >>>>> >>>>> So we compute the indices thus: >>>>> >>>>> index <- x + 1 + y * ncols >>>>> >>>>> This will give unique indices for all (x, y) coordinates on the chip, >>>>> assuming we agree that the dimensions of the chip are 658 x 658. >>>>> However, >>>>> the sequence file doesn't agree: >>>>> >>>>> pmdf[pmdf$fid == 9264,] >>>>> fset.name x y offset seq tstrand type >>>>> tallele >>>>> 7077 SNP_A-1507675 709 13 0 TGCCCTGAATGTTTCAGCACATCTA r PM >>>>> T >>>>> fid >>>>> 7077 9264 >>>>> >>>>> The above is one line from the first 1000 probesets. Note that the (x, >>>>> y) >>>>> coordinates are (709, 13)! When we calculate the index (fid) we get >>>>> 9264. >>>>> Unfortunately, if we use (51, 14) we also get 9264. Because the sequence >>>>> file isn't playing by the rules, we end up with a total of 25 duplicate >>>>> indices. Since the index values are the primary key for the table we are >>>>> trying to populate we get an error because you can't have duplicated >>>>> primary >>>>> keys. >>>>> >>>>> So long story short, the sequence file for this chip is broken - the >>>>> apparent maximum (x, y) coordinate is (710, 707) which is well beyond >>>>> what >>>>> the cdf claims. Or maybe the cdf is broken - I don't really know. The >>>>> end >>>>> result is that this will never work until Affy comes up with some >>>>> consistent >>>>> information for the chip. >>>>> >>>>> Best, >>>>> >>>>> Jim >>>>> >>>>> >>>>> >>>>> >>>>>>> traceback() >>>>>> >>>>>> 12: .Call("RS_SQLite_exec", conId, statement, bind.data, PACKAGE = >>>>>> .SQLitePkgName) >>>>>> 11: sqliteExecStatement(con, statement, bind.data) >>>>>> 10: sqliteQuickSQL(conn, statement, bind.data, ...) >>>>>> 9: dbGetPreparedQuery(db, sql, bind.data = mmdf) >>>>>> 8: dbGetPreparedQuery(db, sql, bind.data = mmdf) >>>>>> 7: loadAffySeqCsv(db, csvSeqFile, cdfFile, batch_size = batch_size) >>>>>> 6: eval(expr, envir, enclos) >>>>>> 5: eval(expr, envir = loc.frame) >>>>>> 4: ST(loadAffySeqCsv(db, csvSeqFile, cdfFile, batch_size = batch_size)) >>>>>> 3: buildPdInfoDb(object at cdfFile, object at csvAnnoFile, object at csvSeqFile, >>>>>> dbFilePath, seqMatFile, batch_size = batch_size, verbose = !quiet) >>>>>> 2: makePdInfoPackage(pkg, destDir = ".") >>>>>> 1: makePdInfoPackage(pkg, destDir = ".") >>>>>> >>>>>> I noticed a prior post that suggested that this may be due to entering >>>>>> a >>>>>> record into a table with a Feature ID that is already in the table. Is >>>>>> this >>>>>> the case? Is there a work-around here? >>>>>> >>>>>> Thanks, >>>>>> Mike Gormley >>>>>> >>>>>> [[alternative HTML version deleted]] >>>>>> >>>>>> _______________________________________________ >>>>>> Bioconductor mailing list >>>>>> Bioconductor at stat.math.ethz.ch >>>>>> https://stat.ethz.ch/mailman/listinfo/bioconductor >>>>>> Search the archives: >>>>>> http://news.gmane.org/gmane.science.biology.informatics.conductor >>>>> >>>>> -- >>>>> James W. MacDonald, M.S. >>>>> Biostatistician >>>>> Affymetrix and cDNA Microarray Core >>>>> University of Michigan Cancer Center >>>>> 1500 E. Medical Center Drive >>>>> 7410 CCGC >>>>> Ann Arbor MI 48109 >>>>> 734-647-5623 >>>>> >>>>> _______________________________________________ >>>>> Bioconductor mailing list >>>>> Bioconductor at stat.math.ethz.ch >>>>> https://stat.ethz.ch/mailman/listinfo/bioconductor >>>>> Search the archives: >>>>> http://news.gmane.org/gmane.science.biology.informatics.conductor >>>>> >>> >>> -- >>> James W. MacDonald, M.S. >>> Biostatistician >>> Affymetrix and cDNA Microarray Core >>> University of Michigan Cancer Center >>> 1500 E. Medical Center Drive >>> 7410 CCGC >>> Ann Arbor MI 48109 >>> 734-647-5623 >> >> >
Microarray Annotation Network Cancer cdf probe affy Microarray Annotation Network Cancer • 741 views
ADD COMMENT

Login before adding your answer.

Traffic: 806 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6