Entering edit mode
FYI/follow up: Affymetrix has now put up the correct probe sequence
file for Mapping10K_Xba142:
http://www.affymetrix.com/support/technical/byproduct.affx?product=10k
-20
/Henrik
On Mon, Jun 30, 2008 at 11:49 AM, Henrik Bengtsson <hb at="" stat.berkeley.edu=""> wrote:
> Hi,
>
> I can confirm that the probe sequence file for Mapping10K_Xba142
> [http://www.affymetrix.com/Auth/analysis/downloads/data/Mapping10Kv2
_probe_tab.zip]
> linked to at the 'Mapping 10K 2.0 Array - Support Materials' page
> [http://www.affymetrix.com/support/technical/byproduct.affx?product=
10k-20]
> does indeed look like it is for Mapping10K_Xba131, e.g. the
available
> X and Y positions are in [1,710] and [1,707] which is clearly
outside
> the dimension of the Mapping10K_Xba142 chip type 658x658.
>
> Did you post this in the Affymetrix Forum
>
> https://www.affymetrix.com/community/forums/index.jspa
>
> or directly to the support? Is there a thread where I can post a
follow up?
>
> -Henrik
>
>
> On Thu, Jun 26, 2008 at 2:25 PM, Michael Gormley
> <michael.gormley at="" gmail.com=""> wrote:
>> This is the same source where I obtained the files originally. I
have
>> brought this issue to the attention of affy technical support.
Hoping they
>> can get me the correct probe sequence file.
>>
>> On Thu, Jun 26, 2008 at 2:26 PM, James W. MacDonald <jmacdon at="" med.umich.edu="">
>> wrote:
>>>
>>> Interesting.
>>>
>>> To test the problems Michael was having, I simply went to Affy's
product
>>> support page and downloaded the library file, annotation file, and
sequence
>>> file. So it appears they have things mixed up on that page, and
there isn't
>>> anything obvious about the sequence file that would inform anybody
it is
>>> wrong:
>>>
>>> > dir(pattern = "^Mapping")
>>> [1] "Mapping10K_probe_tab" "Mapping10K_Xba142.CDF"
>>> [3] "Mapping10K_Xba142.na25.annot.csv"
>>>
>>> Best,
>>>
>>> Jim
>>>
>>>
>>>
>>> Henrik Bengtsson wrote:
>>>>
>>>> Note that there are two different Affymetrix 10K chip types,
namely
>>>> Mapping10K_Xba131 (aka 'Mapping 10K Array') and Mapping10K_Xba142
(aka
>>>> 'Mapping 10K Array 2.0'). The probe sequence file you refer to
seems
>>>> to be for the former, which is a larger chip. Details on the
official
>>>> Affymetrix CDFs (converted to binary though):
>>>>
>>>>> library(aroma.affymetrix)
>>>>> cdf <- AffymetrixCdfFile$byChipType("Mapping10K_Xba142")
>>>>> cdf
>>>>
>>>> AffymetrixCdfFile:
>>>> Path: annotationData/chipTypes/Mapping10K_Xba142
>>>> Filename: Mapping10K_Xba142.cdf
>>>> Filesize: 9.53MB
>>>> Chip type: Mapping10K_Xba142
>>>> RAM: 0.00MB
>>>> File format: v4 (binary; XDA)
>>>> Dimension: 658x658
>>>> Number of cells: 432964
>>>> Number of units: 10208
>>>> Cells per unit: 42.41
>>>> Number of QC units: 9
>>>>
>>>>> cdf <- AffymetrixCdfFile$byChipType("Mapping10K_Xba131")
>>>>> cdf
>>>>
>>>> AffymetrixCdfFile:
>>>> Path: annotationData/chipTypes/Mapping10K_Xba131
>>>> Filename: Mapping10K_Xba131.cdf
>>>> Filesize: 10.79MB
>>>> Chip type: Mapping10K_Xba131
>>>> RAM: 0.00MB
>>>> File format: v4 (binary; XDA)
>>>> Dimension: 712x712
>>>> Number of cells: 506944
>>>> Number of units: 11564
>>>> Cells per unit: 43.84
>>>> Number of QC units: 9
>>>>
>>>> FYI: I try to collect information about various Affymetrix chip
types at:
>>>>
>>>>
>>>> http://groups.google.com/group/aroma-affymetrix/web
/documentation-on-chip-types
>>>>
>>>> Final comment: I would like to emphasize the difference between
'chip
>>>> type' and 'CDF'; a chip type refers to a unique product coming
out of
>>>> Affymetrix, whereas a CDF refers to an annotation of a chip type.
>>>> There can be many different CDFs for each chip type, but only one
chip
>>>> type per CDF.
>>>>
>>>> Cheers
>>>>
>>>> Henrik
>>>>
>>>> On Thu, Jun 26, 2008 at 9:42 AM, James W. MacDonald
>>>> <jmacdon at="" med.umich.edu=""> wrote:
>>>>>
>>>>> Hi Michael,
>>>>>
>>>>> Michael Gormley wrote:
>>>>>>
>>>>>> I get an error when running the makePdInfoPackage function to
make a
>>>>>> PdInfo
>>>>>> package for the 10K mapping array. The output from the
function reads:
>>>>>>
>>>>>>> makePdInfoPackage(pkg,destDir=".")
>>>>>>
>>>>>> Creating package in ./pd.mapping10k.xba142
>>>>>> loadUnitsByBatch took 22.86 sec
>>>>>> loadAffyCsv took 2.79 sec
>>>>>> Error in sqliteExecStatement(con, statement, bind.data) :
>>>>>> RS-DBI driver: (RS_SQLite_exec: could not execute: PRIMARY KEY
must be
>>>>>> unique)
>>>>>> In addition: Warning messages:
>>>>>> 1: In is.na(v) : is.na() applied to non-(list or vector) of
type 'NULL'
>>>>>> 2: In is.na(v) : is.na() applied to non-(list or vector) of
type 'NULL'
>>>>>> 3: In is.na(v) : is.na() applied to non-(list or vector) of
type 'NULL'
>>>>>> Timing stopped at: 0.36 0.01 0.44
>>>>>
>>>>> I have spent some time looking at this, and it appears that the
problem
>>>>> is
>>>>> due to inconsistencies between the cdf and probe sequence files.
As far
>>>>> as I
>>>>> can tell there are many probe locations ((x, y) coordinates) in
the cdf
>>>>> that
>>>>> don't exist in the probe sequence file, and vice versa.
>>>>>
>>>>> The function loadAffySeqCsv() reads in a chunk of data from the
probe
>>>>> sequence file, then matches the indices (computed from the (x,
y)
>>>>> coordinates) of these data with the indices that were generated
using
>>>>> the
>>>>> cdf data. In the first chunk of 1000 probesets, there are only
8223
>>>>> probesets that match between the two data sources. I don't think
this
>>>>> would
>>>>> normally be a problem, except for the fact that 1000 probesets
from the
>>>>> sequence file should *exactly* line up with what we got from the
cdf.
>>>>>
>>>>> But the real problem that arises is this:
>>>>>
>>>>> The computation of indices is based on the dimensions of the
chip. If we
>>>>> query the cdf to find what the dimensions are we get this:
>>>>>
>>>>> readCdfHeader(cdfFile)
>>>>> $ncols
>>>>> [1] 658
>>>>>
>>>>> $nrows
>>>>> [1] 658
>>>>>
>>>>> So we compute the indices thus:
>>>>>
>>>>> index <- x + 1 + y * ncols
>>>>>
>>>>> This will give unique indices for all (x, y) coordinates on the
chip,
>>>>> assuming we agree that the dimensions of the chip are 658 x 658.
>>>>> However,
>>>>> the sequence file doesn't agree:
>>>>>
>>>>> pmdf[pmdf$fid == 9264,]
>>>>> fset.name x y offset seq tstrand
type
>>>>> tallele
>>>>> 7077 SNP_A-1507675 709 13 0 TGCCCTGAATGTTTCAGCACATCTA
r PM
>>>>> T
>>>>> fid
>>>>> 7077 9264
>>>>>
>>>>> The above is one line from the first 1000 probesets. Note that
the (x,
>>>>> y)
>>>>> coordinates are (709, 13)! When we calculate the index (fid) we
get
>>>>> 9264.
>>>>> Unfortunately, if we use (51, 14) we also get 9264. Because the
sequence
>>>>> file isn't playing by the rules, we end up with a total of 25
duplicate
>>>>> indices. Since the index values are the primary key for the
table we are
>>>>> trying to populate we get an error because you can't have
duplicated
>>>>> primary
>>>>> keys.
>>>>>
>>>>> So long story short, the sequence file for this chip is broken -
the
>>>>> apparent maximum (x, y) coordinate is (710, 707) which is well
beyond
>>>>> what
>>>>> the cdf claims. Or maybe the cdf is broken - I don't really
know. The
>>>>> end
>>>>> result is that this will never work until Affy comes up with
some
>>>>> consistent
>>>>> information for the chip.
>>>>>
>>>>> Best,
>>>>>
>>>>> Jim
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>>> traceback()
>>>>>>
>>>>>> 12: .Call("RS_SQLite_exec", conId, statement, bind.data,
PACKAGE =
>>>>>> .SQLitePkgName)
>>>>>> 11: sqliteExecStatement(con, statement, bind.data)
>>>>>> 10: sqliteQuickSQL(conn, statement, bind.data, ...)
>>>>>> 9: dbGetPreparedQuery(db, sql, bind.data = mmdf)
>>>>>> 8: dbGetPreparedQuery(db, sql, bind.data = mmdf)
>>>>>> 7: loadAffySeqCsv(db, csvSeqFile, cdfFile, batch_size =
batch_size)
>>>>>> 6: eval(expr, envir, enclos)
>>>>>> 5: eval(expr, envir = loc.frame)
>>>>>> 4: ST(loadAffySeqCsv(db, csvSeqFile, cdfFile, batch_size =
batch_size))
>>>>>> 3: buildPdInfoDb(object at cdfFile, object at csvAnnoFile,
object at csvSeqFile,
>>>>>> dbFilePath, seqMatFile, batch_size = batch_size, verbose =
!quiet)
>>>>>> 2: makePdInfoPackage(pkg, destDir = ".")
>>>>>> 1: makePdInfoPackage(pkg, destDir = ".")
>>>>>>
>>>>>> I noticed a prior post that suggested that this may be due to
entering
>>>>>> a
>>>>>> record into a table with a Feature ID that is already in the
table. Is
>>>>>> this
>>>>>> the case? Is there a work-around here?
>>>>>>
>>>>>> Thanks,
>>>>>> Mike Gormley
>>>>>>
>>>>>> [[alternative HTML version deleted]]
>>>>>>
>>>>>> _______________________________________________
>>>>>> Bioconductor mailing list
>>>>>> Bioconductor at stat.math.ethz.ch
>>>>>> https://stat.ethz.ch/mailman/listinfo/bioconductor
>>>>>> Search the archives:
>>>>>>
http://news.gmane.org/gmane.science.biology.informatics.conductor
>>>>>
>>>>> --
>>>>> James W. MacDonald, M.S.
>>>>> Biostatistician
>>>>> Affymetrix and cDNA Microarray Core
>>>>> University of Michigan Cancer Center
>>>>> 1500 E. Medical Center Drive
>>>>> 7410 CCGC
>>>>> Ann Arbor MI 48109
>>>>> 734-647-5623
>>>>>
>>>>> _______________________________________________
>>>>> Bioconductor mailing list
>>>>> Bioconductor at stat.math.ethz.ch
>>>>> https://stat.ethz.ch/mailman/listinfo/bioconductor
>>>>> Search the archives:
>>>>>
http://news.gmane.org/gmane.science.biology.informatics.conductor
>>>>>
>>>
>>> --
>>> James W. MacDonald, M.S.
>>> Biostatistician
>>> Affymetrix and cDNA Microarray Core
>>> University of Michigan Cancer Center
>>> 1500 E. Medical Center Drive
>>> 7410 CCGC
>>> Ann Arbor MI 48109
>>> 734-647-5623
>>
>>
>