Question

Can "Glue Grant" arrays be processed with pd.hta.2.0?

0

Entering edit mode

Robert Castelo ★ 3.3k

@rcastelo

Last seen 3 days ago

Barcelona/Universitat Pompeu Fabra

hi all,

I'm interested in analyzing microarray data from the article http://www.ncbi.nlm.nih.gov/pubmed/26052715 deposited at GEO here:

http://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE69686

even though the data was recently deposited, they were generated from "Glue Grant" Affymetrix chips described in the article http://www.ncbi.nlm.nih.gov/pubmed/21317363

as part of that article, they put available a number of files to facilitate processing of these kind of data here:

http://gluegrant1.stanford.edu/~DIC/GGHarray

and I thought I could build the corresponding pd.* package but unfortunately the *.probeset.csv file that goes on the 'probeFile' slot of the 'AffyGenePDInfoPkgSeed' object is missing and have not been able to find it after googling.

during my googling I found the following BioC post:

http://grokbase.com/p/r/bioconductor/139as9419c/bioc-affymetrix-hta-2-array-analysis

that briefly mentions these "Glue Grant" arrays as predecessors of the commercial HTA-2.0 Affymetrix arrays. so i thought i could try to process them using the available pd.hta.2.0 package. however, the CEL files have as chip name 'hGlue_3_0_v1'. This is the header of one of them:

library(affyio)
read.celfile.header("GSM1707422_6699_51A_hGlue_3_0_v1_.CEL.gz")
$cdfName
[1] "hGlue_3_0_v1"
$`CEL dimensions`
Cols Rows
2680 2572

so, i decided to download the source of pd.hta.2.0_*.tar.gz, uncompress it, rename the hta.2.0 bits to hGlue_3_0_v1 and package it again as pd.hGlue_3_0_v1_*.tar.gz and try the oligo pipeline.

so far it seems everything to work, which is nice, but i would like to know if 1. anybody has an idea of possible pitfalls with this approach since in principle they are different chips; and 2. if anybody knows where could i access the *.probeset.csv file that goes into the 'probeFile' slot of the 'AffyGenePDInfoPkgSeed' object for these Glue Grant chips, so that i could build the corresponding pd.* package without worrying whether the pd.hta-2.0_*.tar.gz package is the right solution.

thanks!! and happy x-mas everyone!

robert.

pd.hta.2.0 ggh3 oligo glue grant • 2.4k views

ADD COMMENT • link updated 8.3 years ago by James W. MacDonald 65k • written 8.3 years ago by Robert Castelo ★ 3.3k

score 2 · Accepted Answer · 2015-12-23

2

Entering edit mode

James W. MacDonald 65k

@james-w-macdonald-5106

Last seen 3 hours ago

United States

Hi Robert,

A couple of points. First, I recently checked in a fix to pdInfoBuilder for the HTA2.0 arrays, as well as an updated pd.hta.2.0 package. Make sure you have version 3.12.1.

Second, Benilton added some functionality to pdInfoBuilder last release that allows you to build pdInfoPackages using a generic format. Right now there is only functionality to help people take CDF files and then make a pdInfoPackage. But in principle it can be done with an arbitrary chip type, and this won't require the probeset.csv file. However it will take some work on your part; I have been meaning to do a workflow showing how to use just the pgf, clf, mps files to generate a pdInfoPackage, but haven't found the time just yet.

Anyway, the basic idea is to generate a data.frame, where each row has information about each probe, and how it maps to various probesets, as well as x,y coordinates, genomic position and sequence. Not all of those fields are mandatory. As an example, this is the data.frame for the ag.cdf file:

  fs1_man_fsetid   fs1_type fs1_direction   x  y  fid atom probe_type
 1 AFFX-MurIL2_at expression     antisense 481 11 6356    0         pm
 2 AFFX-MurIL2_at expression     antisense 481 12 6890    0         mm
 3 AFFX-MurIL2_at expression     antisense 482 11 6357    1         pm
 4 AFFX-MurIL2_at expression     antisense 482 12 6891    1         mm
 5 AFFX-MurIL2_at expression     antisense 483 11 6358    2         pm
 6 AFFX-MurIL2_at expression     antisense 483 12 6892    2         mm

The fid is the individual probe ID, so if this probe were used in multiple probesets there would be additional columns like fs2_man_fsetid, fs2_type, fs2_direction, etc. There is more information in the sources for the generic array. It's functional, but not well fleshed out just yet.

You may well be able to get away with using the pd.hta.2.0 package. The test would be to use pdInfoBuilder:::parseClfPgf() on both the pgf and clf files for the glue arrays, as well as for the HTA 2.0 array and make sure they are comparable in some sense (where comparable is somewhere between identical and not similar at all).

ADD COMMENT • link 8.3 years ago James W. MacDonald 65k

0

Entering edit mode

hi James,

thanks for your detailed response. I have done the comparison between the CLF and PGF files from the two chips and to me they look quite different. I find it surprising since I expected that as "predecessor" the "Glue Grant" (GGH3) array would be a subset of the HTA-2.0. This is what I found:

library(pdInfoBuilder)

ggh3 <- pdInfoBuilder:::parsePgfClf(pgfFile="GEO/chip/GPL20292_hGlue_3_0_v1.pgf",
                                    clfFile="GEO/chip/GPL20292_hGlue_3_0_v1.clf")
hta2 <- pdInfoBuilder:::parsePgfClf(pgfFile="NetAffx/Analysis/HTA-2_0.r3.pgf",
                                    clfFile="NetAffx/Analysis/HTA-2_0.r3.clf")

dim(ggh3$probes.table)
## [1] 6892960       9
ggh3$geometry
## [1] "2572;2680"
dim(hta2$probes.table)
## [1] 7577330       9
hta2$geometry
## [1] "2572;2680"

so, initially, the older GGH3 chip has 7577330-6892960=684370 fewer probes than the commercial HTA-2.0 but the same "geometry". The first few rows of each table of probe information look like this:

head(ggh3$probes.table)
##       fid    man_fsetid fsetid           pstype    atom    x    y ptype                  sequence
## 1 3226405 JUC0400089039 127219   main->junction  634082 2364 1203 pm:at TAAACATCTGGTTATAACATCCTAG
## 2 1261956 JUC0400089039 127219   main->junction  634081 2355  470 pm:at AAACATCTGGTTATAACATCCTAGC
## 3 5605980 JUC0400089039 127219   main->junction  634079 2099 2091 pm:at ACATCTGGTTATAACATCCTAGCCG
## 4 3834499 JUC0400089039 127219   main->junction  634080 2098 1430 pm:at AACATCTGGTTATAACATCCTAGCC
## 5  837135   Multi_17208 695965      main->multi 3484783  974  312 pm:at TGCAGCCTCCAGGACTGGGGCGTCA
## 6 3098276      PS158258  28336 main->exon->hjay  139645  195 1156 pm:at AACTTTTAGGGTAGTGAATCGTAGT
head(hta2$probes.table)
##       fid man_fsetid   fsetid         pstype atom    x    y ptype                  sequence
## 1    9555 2824539_st 18670000 normgene->exon    1 1514    3 pm:st GGAGAAGGCCCCTGCAACAGACGTC
## 2 6121713 2824540_st 18670001 normgene->exon    2  592 2284 pm:st TTAGCCGACGGTAGGCTCCTCGACT
## 3  725124 2824540_st 18670001 normgene->exon    3 1523  270 pm:st AATAAAGGAATGGTTAGCCGACGGT
## 4 5156822 2824541_st 18670002 normgene->exon    4  501 1924 pm:st ACCGAAGTATTTAGTGAGCGTGACC
## 5 1997636 2824541_st 18670002 normgene->exon    5 1035  745 pm:st GCGGACATGCACAGGAATCAGAAAC
## 6 2794445 2824541_st 18670002 normgene->exon    6 1884 1042 pm:st CCTCCGATCGAGGAACGGGAAATAT

they are already different, and the 'pstype' column reveals a quite different set of types of probes:

table(ggh3$probes.table$pstype)
##
##               control->affx       control->affx->23bpmm           control->affx->5Q
##                        1642                         462                         480
##          control->affx->AFR control->affx->housekeeping         control->affx->rRNA
##                        5420                        1316                         240
##    control->affx->u133plus2   control->bgp->antigenomic control->bgp->antigenomic22
##                        4400                       16943                        8350
##       control->bgp->genomic            control->default               control->rRNA
##                       20282                      327930                        1691
##        control->spike->ERCC control->spike->polyAmutant         control->spike->TAG
##                       22358                       16308                         748
##               control->text        main->antisense->ASK                  main->dmet
##                         560                       80601                      218317
##           main->exon->addon            main->exon->hjay              main->junction
##                       76176                     3092940                     1060703
##                 main->multi                 main->ncRNA          main->siRNA->miRNA
##                      198840                        5869                       11654
##        main->siRNA->piwiRNA             main->snp->exon         main->snp->junction
##                       13234                      921134                       59758
##             main->tag->tag4        main->tag->universal             main->tilingutu
##                          18                      222975                      488581
##              normgene->exon            normgene->intron
##                        2952                       10078
table(hta2$probes.table$pstype)
##
##              control->affx         control->affx->asc   control->affx->bac_spike    
##                        720                      57060                        396
##        control->affx->ercc  control->affx->ercc->step control->affx->polya_spike
##                      41561                       1870                       1126
##  control->bgp->antigenomic                       main             normgene->exon
##                      16943                    7441480                       3043
##           normgene->intron
##                      13131

as well as the 'ptype' column:

table(ggh3$probes.table$ptype)
##
##   blank generic:at generic:st      mm:at      mm:st      pm:at      pm:st  thermo:at  thermo:st
##  318806        186       6074       2207       2206    6280382     279675       1712       1712
table(hta2$probes.table$ptype)
##
##   mm:at   mm:st   pm:at   pm:st
##     410     711   17353 7558856

the feature identifiers 'fid' of GGH3 are nearly all of them part of the 'fid' column in the HTA-2.0 and when they match, they lead to the same values in the 'x' and 'y' columns but not the same probe sequence:

mt <- match(ggh3$probes.table$fid, hta2$probes.table$fid)
sum(!is.na(mt))
## [1] 6745406
all(ggh3$probes.table[!is.na(mt), "x"] == hta2$probes.table[mt[!is.na(mt)], "x"])
## [1] TRUE
all(ggh3$probes.table[!is.na(mt), "y"] == hta2$probes.table[mt[!is.na(mt)], "y"])
## [1] TRUE
table(ggh3$probes.table[!is.na(mt), "sequence"] == hta2$probes.table[mt[!is.na(mt)], "sequence"])
##
##   FALSE    TRUE
## 6744686     720

in fact, only a small subset of the probe sequences match between the two chips:

mt <- match(ggh3$probes.table$sequence, hta2$probes.table$sequence)
sum(!is.na(mt))
## [1] 50118

so, i would say that even though the physical support and geometry may be similar, the two chips are quite differently organized. i'm now surprised that i could use the current pd.hta-2.0 package to run the oligo pipeline on these chips.

anyway, i see now more clearly that i will have to try to build a pd.hglue.3.0.v1 package specific for this GGH3 chips. from what you describe and looking into the source code you indicated, i understand that i have to build a data.frame object from the CLF, PGF and MPS files and this data.frame object will be the only piece of information required to build the package. so, the question is how exactly should i build this data.frame. The bit of the source code that i presume you are pointing out to says the following:

setMethod("makePdInfoPackage", "GenericPDInfoPkgSeed",
          function(object, destDir=".", batch_size=10000, quiet=FALSE, unlink=FALSE) {

              ## Probe input table
              ## - fsK_man_fsetid (... fsK_fieldName) / fid / x / y / sequence / probe_type* [pm/mm/bg/...] / (... fields)

so, i guess this probe table could be built approximately as follows after reading the information as i did above for GGH3:

probetable <- data.frame(fs1_man_fsetid=ggh3$probes.table$man_fsetid,
                         fid=ggh3$probes.table$fid,
                         x=ggh3$probes.table$x,
                         y=ggh3$probes.table$y,
                         sequence=ggh3$probes.table$sequence,
                         probe_type=ggh3$probes.table$ptype)

however, i have not used at this point the MPS file and i guess this may be the bit involving the multiple columns for probes being shared among probesets that you were talking about. Could you give me a more concrete advice about how to integrate the information from the MPS file?

thanks!!

robert.

ADD REPLY • link 8.3 years ago Robert Castelo ★ 3.3k

1

Entering edit mode

Hi Robert,

I think the following should work, but no guarantees...

ggh <- pdInfoBuilder:::parsePgfClf("hGlue2_0.r1.pgf","hGlue2_0.r1.clf")

core <- pdInfoBuilder:::mpsParser("hGlue2_0.r1.TC.mps")

probetable <- ggh$probes.table

names(probetable)[c(2,8)] <- c("fs1_man_fsetid","probe_type")

matcher <- match(probetable$fsetid, core$fsetid)

probetable$fs2_meta_fsetid <- core$meta_fsetid[matcher]
probetable <- probetable[,c(2,10,1,4:9)] ## get rid of fsetid column

p <- new("GenericPDInfoPkgSeed", table = probetable, pkgName = "pd.glue")

## you might want a different name

makePdInfoPackage(p)

ADD REPLY • link 8.3 years ago James W. MacDonald 65k

0

Entering edit mode

hi James,

this is great, thanks a lot for crafting a solution. I have found however a couple of errors when trying to run it. For two of them I could find the way forward, but for the third one I would appreciate if you could look into it.

1. The first is straightforward, the builder function tries to use the '%dopar%' function without importing it:

makePdInfoPackage(p)
========================================================================================
Building annotation for Generic Array
========================================================================================
Error in getAllFSetMpsTables(object@table) :
  could not find function "%dopar%"

I fixed this loading the 'doParallel' package first, but I guess you (or Benilton) may want to import this function in the NAMESPACE of the 'pdInfoBuilder' package.

2. The following one:

makePdInfoPackage(p)
========================================================================================
Building annotation for Generic Array
========================================================================================
Error in { :
task 2 failed - "arguments imply differing number of rows: 0, 6892960"

can be fixed by replacing the line above

probetable$fs2_meta_fsetid <- core$meta_fsetid[matcher]

by

probetable$fs2_man_fsetid <- core$meta_fsetid[matcher]

3. The following one:

makePdInfoPackage(p)
========================================================================================
Building annotation for Generic Array
========================================================================================
Error in getKeys(tbl) : Impossible to detect table type

is related to the fact shown in my previous comment by which this Glue Grant chip has more probe types than the HTA-2.0, such as 'blank', 'generic:at', etc. The following two lines should in principle fix is problem:

mask <- probetable$probe_type %in% c("mm:at", "mm:st", "pm:at", "pm:st")
probetable <- probetable[mask ,]

unfortunately, the code still chokes with them. More concretely, in the call to pdInfoBuilder:::getSchemaAndKeys(toSQL) the variable 'tbls' takes the following values:

tbls
 [1] "featureSet1"  "mps1mm:at"    "mps1mm:st"    "mps1pm:at"    "mps1pm:st"   
 [6] "featureSet2"  "mps2mm:at"    "mps2mm:st"    "mps2pm:at"    "mps2pm:st"   
[11] "mm:atfeature" "mm:stfeature" "pm:atfeature" "pm:stfeature"

which are the arguments used to call pdInfoBuilder:::getKeys(tbl) which is the function producing the error. The function gives the error for 'mps1mm:at' because such a token does not match any of these:

    tblTypes <- c('featureSet', 'mps', 'feature')
    isFeatureSet <- length(grep('^featureSet', tblname)) == 1L
    isMPSpm <- length(grep('^mps[[:digit:]]pm$', tblname)) == 1L
    isMPSmm <- length(grep('^mps[[:digit:]]mm$', tblname)) == 1L
    isFeature <- length(grep('feature$', tblname)) == 1L
    if (isFeatureSet + isMPSpm + isMPSmm + isFeature != 1L) stop('Impossible to detect table type')

which imply that when these tokens start with 'mps' they should end with either 'pm' or 'mm' so the ':at' or ':st' suffixes are not allowed. at this point i'm not sure anymore whether something should be done about how the table of probe information is built or something should be added or changed in the pdInfoBuilder package. It would be great if you could give a hand here again.

thanks!!

ADD REPLY • link 8.3 years ago Robert Castelo ★ 3.3k

1

Entering edit mode

The HTA arrays just have pm:st and pm:at (and the mm versions). Not sure what the others are intended for. Anyway, the HTA arrays are really an edge case to begin with, and probably more so for these versions. This might be something that needs to be addressed for the general random primer Affy arrays (I don't know for instance how prevalent the pm:at, etc nomenclature is), but I doubt it will ever be tailored specifically to handle particular aspects of the glue grant arrays.

I would do what you have done, and then just truncate down to pm and mm. There should be vanishingly few mm probes, given that this is a PM-only array (and the mm probes aren't used anyway, so you could probably just nuke those as well if you want). In my trial run I just added a column of all pm,

ADD REPLY • link 8.3 years ago James W. MacDonald 65k

0

Entering edit mode

Thanks for the advice, I have moved forward in this direction. While the package builds and installs without problems, errors arise when using it with oligo on the CEL files I want to analyze. Let me know if you think this should be a new thread. Just to recap, I have built the package with the following instructions:

library(pdInfoBuilder)
library(doParallel)

ggh <- pdInfoBuilder:::parsePgfClf("hGlue2_0.r1.core/hGlue2_0.r1.pgf",
                                   "hGlue2_0.r1.core/hGlue2_0.r1.clf")
probetable <- ggh$probes.table
names(probetable)[c(2,8)] <- c("fs1_man_fsetid", "probe_type")

core <- pdInfoBuilder:::mpsParser("hGlue2_0.r1.core/hGlue2_0.r1.TC.mps")
matcher <- match(probetable$fsetid, core$fsetid)
probetable$fs2_man_fsetid <- core$meta_fsetid[matcher]
probetable <- probetable[,c(2,10,1,4:9)] ## get rid of fsetid column

## remove probe types that are not pm:st or pm:at
mask <- probetable$probe_type %in% c("pm:at", "pm:st")
probetable <- probetable[mask ,]

## set probe types 'pm:st' and 'pm:at' to just 'pm'
probetable$probe_type <- substr(probetable$probe_type, 1, 2)

p <- new("GenericPDInfoPkgSeed", table = probetable, pkgName = "pd.hglue.3.0.v1")
makePdInfoPackage(p)

here I put pd.hglue.3.0.v1 as package name since this is what the CEL files I want to analyze expect to use for their processing. Now, I'm going to use it with the CEL files I want to analyze. Initially, there is no problem to read those CEL files (I just use the first 10 of them):

library(oligo)
library(pd.hglue.3.0.v1)

celFilenames <- list.celfiles(path="rawData/GEO",
                              full.names=TRUE,
                              listGzipped=TRUE)
oligoRaw <- read.celfiles(celFilenames[1:10])

however, errors arise when trying to use some of the oligo methods to explore the intensity distributions:

boxplot(oligoRaw)
Error in .local(object, ...) :
  'target' for GenericArray-types must be in the 'mpsK'-format
traceback()
12: stop("'target' for GenericArray-types must be in the 'mpsK'-format")
11: .local(object, ...)
10: pmindex(get(annotation(object)), subset, target)
9: pmindex(get(annotation(object)), subset, target)
8: .local(object, ...)
7: pmindex(x, target = target)
6: pmindex(x, target = target)
5: getProbeIndex(x, which)
4: unique(getProbeIndex(x, which))
3: .local(x, ...)
2: boxplot(oligoRaw)
1: boxplot(oligoRaw)

hist(oligoRaw, which="pm")
Error in .local(object, ...) :
  'target' for GenericArray-types must be in the 'mpsK'-format
traceback()
12: stop("'target' for GenericArray-types must be in the 'mpsK'-format")
11: .local(object, ...)
10: pmindex(get(annotation(object)), subset, target)
9: pmindex(get(annotation(object)), subset, target)
8: .local(object, ...)
7: pmindex(x, target = target)
6: pmindex(x, target = target)
5: getProbeIndex(x, which)
4: unique(getProbeIndex(x, which))
3: .local(x, ...)
2: hist(oligoRaw, which = "pm")
1: hist(oligoRaw, which = "pm")

fit a probe level model:

plm <- fitProbeLevelModel(oligoRaw)
Error in sqliteSendQuery(con, statement, bind.data) :
  error in statement: no such table: featureSet
traceback()
8: .Call(rsqlite_query_send, con@Id, as.character(statement), bind.data)
7: sqliteSendQuery(con, statement, bind.data)
6: sqliteGetQuery(conn, statement)
5: .local(conn, statement, ...)
4: dbGetQuery(conn, sql)
3: dbGetQuery(conn, sql)
2: getProbeInfo(object, target = target, field = vars, sortBy = "man_fsetid",
       ...)
1: fitProbeLevelModel(oligoRaw)

or try to normalize with RMA:

eset <- rma(oligoRaw, target="core")
Background correcting... OK
Normalizing... OK
Available tables: featureSet1, featureSet2, mps1pm, mps2pm, pmfeature, table_info
Error in getMPSInfo(get(annotation(object)), substr(target, 4, 4), "fid",  :
  Table mpsepm does not exist.

here is my session information:

sessionInfo()
R version 3.2.1 (2015-06-18)
Platform: x86_64-apple-darwin10.8.0 (64-bit)
Running under: OS X 10.8.5 (Mountain Lion)

locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8

attached base packages:
[1] stats4    parallel  stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
 [1] pd.hglue.3.0.v1_1.0.0 RSQLite_1.0.0         DBI_0.3.1             oligo_1.34.0         
 [5] Biostrings_2.38.2     XVector_0.10.0        IRanges_2.4.6         S4Vectors_0.8.5      
 [9] Biobase_2.30.0        oligoClasses_1.32.0   BiocGenerics_0.16.1   vimcom_1.2-5         
[13] setwidth_1.0-4        colorout_1.1-0       

loaded via a namespace (and not attached):
 [1] affxparser_1.42.0          splines_3.2.1              GenomicRanges_1.22.2      
 [4] zlibbioc_1.16.0            bit_1.1-12                 foreach_1.4.3             
 [7] GenomeInfoDb_1.6.1         tools_3.2.1                SummarizedExperiment_1.0.1
[10] ff_2.2-13                  KernSmooth_2.23-15         iterators_1.0.8           
[13] preprocessCore_1.32.0      affyio_1.40.0              codetools_0.2-14          
[16] BiocInstaller_1.20.1

ADD REPLY • link 8.3 years ago Robert Castelo ★ 3.3k

2

Entering edit mode

Hi,
First of all a disclaimer: I am not at all such an expert as James :) , but this may be of interest:
Few years ago I struggled once in a while with obtaining suitable CDFs for some of the newer ST arrays that were then release by Affymetrix in order to analyze them in R/BioC. I found some functions / code (see section "Building custom-made annotation data: CDF (Chip Definition File)") from the package aroma.affymetrix (from Henrik Bengtsson) to be very useful to generate CDFs from the PGF and CLF files provided by Affymetrix.
I especially would like to point you to the function flat2Cdf.R; originally created by Mark Robinson and Elizabeth Purdom and then slightly modified by Mike Smith. When using the modified version (from Mike Smith) I am able to generate a CDF for the Glue arrays, which in turn can be used to analyze the Glue arrays in Bioconductor (pseudo-images for QC, normalization, etc). See comment below for some code snippets.
HTH,
Guido

Note1: credits to the original developers of these functions; I just applied them.
Note2: the website hosting Mike Smith's modified function is down; see comment below in which i copied/pasted the code.
Note3: I wonder whether the MPS file is OK: the CDF contains 47960 probesets, which corresponds to the info in the MPS file. Of these 47960 probesets, 34834 begin with the suffix "TC". According to the annotation file (hGlue2_0.r1.TC_Annov.csv), these are supposed to measure transcripts. As far as I understand from the MPS file the remaining 13126 probesets (all have a numeric ID) are various control probesets (e.g. positive, negative, bacterial spike, etc). However, it worries me that these control probesets all contain one single probe only....!! So the MPS file is not correct?
Also, the annotation file (hGlue2_0.r1.TC_Annov.csv) only contains info for 35123 probesets. That does not match with the 34834 "TC probesets" from the MPS.
Lastly, the annotation info availabe at GEO consists of only 20533 probesets...??

ADD REPLY • link 8.3 years ago Guido Hooiveld ★ 3.9k

1

Entering edit mode

Mmm, maybe the most straight-forward advice would be to just use a remapped (custom) CDF.... For the Glue array these are provided in various flavors by Manhong Dai and Fan Meng since release 19 (current is 20). A corresponding PdInfo package is already available... !

http://brainarray.mbni.med.umich.edu/Brainarray/Database/CustomCDF/CDF_download.asp#v20

....wonder why i did not check this before....

ADD REPLY • link 8.3 years ago Guido Hooiveld ★ 3.9k

0

Entering edit mode

That's cool, I also did not think about looking up there, thanks! I've tried the packages and they work fine, except for the case of the PD package that chokes at the fitProbeLevelModel() of the 'oligo' pipeline with the same error I encounter when building that PD package myself. I have made a patch to fix this and sent it to Benilton, so hopefully this becomes available at some point for everyone who needs this. Something I do not like however from that PD package is that it contains only the transcript level and not the probeset level. I think the probeset level is interesting for QA since it contains fewer probes per feature which makes the PLM analysis faster for the reasons you were also pointing out.

ADD REPLY • link 8.3 years ago Robert Castelo ★ 3.3k

0

Entering edit mode

Code snippets:

> # start with the same PGF, CLF and MPS files as above.
> ggh <- pdInfoBuilder:::parsePgfClf("hGlue2_0.r1.pgf",
+                                    "hGlue2_0.r1.clf")
Parsing file: hGlue2_0.r1.pgf... OK
Parsing file: hGlue2_0.r1.clf... OK
Creating initial table for probes... OK
> probetable <- ggh$probes.table
> names(probetable)[c(2,8)] <- c("fs1_man_fsetid", "probe_type")
>
> core <- pdInfoBuilder:::mpsParser("hGlue2_0.r1.TC.mps")
Parsing file: hGlue2_0.r1.TC.mps... OK
> matcher <- match(probetable$fsetid, core$fsetid)
> probetable$fs2_man_fsetid <- core$meta_fsetid[matcher]
>
> #check
> head(probetable)
      fid fs1_man_fsetid fsetid           pstype    atom    x    y probe_type
1 3226405  JUC0400089039 127219   main->junction  634082 2364 1203      pm:at
2 1261956  JUC0400089039 127219   main->junction  634081 2355  470      pm:at
3 5605980  JUC0400089039 127219   main->junction  634079 2099 2091      pm:at
4 3834499  JUC0400089039 127219   main->junction  634080 2098 1430      pm:at
5  837135    Multi_17208 695965      main->multi 3484783  974  312      pm:at
6 3098276       PS158258  28336 main->exon->hjay  139645  195 1156      pm:at
                   sequence fs2_man_fsetid
1 TAAACATCTGGTTATAACATCCTAG           <NA>
2 AAACATCTGGTTATAACATCCTAGC           <NA>
3 ACATCTGGTTATAACATCCTAGCCG           <NA>
4 AACATCTGGTTATAACATCCTAGCC           <NA>
5 TGCAGCCTCCAGGACTGGGGCGTCA           <NA>
6 AACTTTTAGGGTAGTGAATCGTAGT      TC0X00383
>
> dim(probetable)
[1] 6892960      10
>
>
> # number of (meta)probesets
> length(unique(core$meta_fsetid))
[1] 47960
>
>
> # now get rid of fid that are note core (thus: fs2_man_fsetid = <NA>) using function complete.cases
> cleaned <-  probetable[complete.cases(probetable[,10]),]
>
> # extract relevant columns for flat2Cdf.R
>
> input <- cleaned[,c(1,6,7,9,10,10)]
> colnames(input) <- c("Probe_ID", "X", "Y", "Probe_Sequence", "Group_ID", "Unit_ID")
> head(input)
   Probe_ID    X    Y            Probe_Sequence  Group_ID   Unit_ID
6   3098276  195 1156 AACTTTTAGGGTAGTGAATCGTAGT TC0X00383 TC0X00383
7    514413 2532  191 CAGTTACACATTTTACGTCAATTGC TC0X00383 TC0X00383
8   2676392 1751  998 GAAGTGACGAAAGTACTCTTAATGT TC0X00383 TC0X00383
9   4369861 1460 1630 GAATCGTAGTTAAAAGTGACCACCT TC0X00383 TC0X00383
10  3443466 2345 1284 TCAAACACAGGAATAAACATAGAAT TC0X00383 TC0X00383
11  5071303  742 1892 CCGTACTCCGTTTGTTTATTAATTT TC0X00383 TC0X00383
>
> dim(input)
[1] 3061408       6
>
>
># now generate CDF file
># chip dimensions are from 1st post in this thread
>
> source("flat2Cdf.R")
> flat2Cdf(file = input, chipType = "hGlue_3_0_v1", tag = "", cols = 2680, rows = 2572)
Splitting TXT file into units ... split into 6219 initial chunks ... unwrapped into 47960 chunks ... Done.
Creating structure for 47960 units (dot=250):
....................(5000)
....................(10000)
<<snip>>
....................(40000)
....................(45000)
...........
Writing CDF file...
  Pathname: hGlue_3_0_v1.cdf
  Writes CDF header and unit names...
  Writes CDF header and unit names...done
  Writes QC units...
    Units left: 0
  Writes QC units...done
  Writes 47960 units...
    Units left: 46960, 45960, 44960<<snip>>, 960, 0
  Writes 47960 units...done
Writing CDF file...done
>
> #note: number of CDF units equals unique meta_fsetid
>
>
> # now create CDF environment
> library(makecdfenv)
> make.cdf.package(file="hGlue_3_0_v1.cdf", author="Guido Hooiveld", maintainer="Guido Hooiveld <guido.hooiveld@wur.nl>", version="0.0.1", species="Homo_sapiens", unlink=TRUE)
Creating package in D:/test/glue/hglue30v1cdf
existing D:/test/glue/hglue30v1cdf was removed.
>
>
> # install (refering to DIR created in step above)
> install.packages(repos=NULL, pkgs="D:\\test\\glue\\hglue30v1cdf", type="source")
* installing *source* package 'hglue30v1cdf' ...
** R
** data
<<snip>>
Warning: replacing previous import by 'utils::tail' when loading 'hglue30v1cdf'
Warning: replacing previous import by 'utils::head' when loading 'hglue30v1cdf'
* DONE (hglue30v1cdf)
>
> # perform test using 1st 10 Glue arrays
> library(affyPLM)
> # Read in data
> affy.data <- ReadAffy()
> #fit probelevel model
> x.norm <- fitPLM(affy.data)

Warning messages:
1: replacing previous import by ‘utils::tail’ when loading ‘hglue30v1cdf’
2: replacing previous import by ‘utils::head’ when loading ‘hglue30v1cdf’
>
> affy.data
AffyBatch object
size of arrays=2572x2680 features (20 kb)
cdf=hGlue_3_0_v1 (47960 affyids)
number of samples=10
number of genes=47960
annotation=hglue30v1
notes=
> x.norm
Probe level linear model (PLMset) object
size of arrays=2572x2680
cdf=hGlue_3_0_v1 (47960 probeset ids)
number of samples=10
number of probesets=47960
number of chip level parameters for each probeset=10
annotation=hglue30v1
PLMset settings
Creating function: fitPLM
Preprocessing
Background Correction=TRUE Method= RMA.2
Normalization=TRUE Method= quantile
>
> note1: # probesets equals unique meta_fsetid
> note2: fitting model took ~ 40m on my Win7 PC, likely because some probesets contain many probes.
>
> #pseudoimage for QC of 2nd array:
> image(x.norm,which=2)
>
> # barplot RLE (rel log expression)
> RLE(x.norm)

Etc

See here why run might take long: A: affyPLM and exon array question

ADD REPLY • link 8.3 years ago Guido Hooiveld ★ 3.9k

0

Entering edit mode

content flat2Cdf.R function as modified by Mike Smith.

posted here because website originally hosting this file (http://compbio.sysbiol.cam.ac.uk/Resources/GeneST/) is down.

#example: flat2Cdf(file="hjay.r1.flat",chipType="hjay",tag="r1,TC")
#file: assumes header...better perhaps to have ... that passes to read.table?; requires header X, Y
#ucol: unit column
#gcol: group column
#col.class: column classes of file (see read.table); NOTE: needs check that right number?
#splitn: parameter that controls the number of initial chunks that are unwrapped (number of characters of unit names used to keep units together for initial chunks)
#rows:
#cols:

flat2Cdf<-function(file,chipType,tags=NULL,rows=2560,cols=2560,verbose=10,xynames=c("X","Y"),
    gcol=5,ucol=6,splitn=4,col.class=c("integer","character")[c(1,1,1,2,2,2)],...) {
  split.quick<-
  function(r,ucol,splitn=3,verbose=TRUE) {
    rn3<-substr(r[,ucol],1,splitn)
    split.matrix<-split.data.frame
    rr<-split(r,factor(rn3))
    if (verbose) cat(" split into",length(rr),"initial chunks ...")
    rr<-unlist(lapply(rr,FUN=function(u) split(u,u[,ucol])),recursive=FALSE)
    if (verbose) cat(" unwrapped into",length(rr),"chunks ...")
    names(rr)<-substr(names(rr),splitn+2,nchar(rr))
    rr
##    rr<-unlist(lapply(rr,FUN=function(u) split(u,u[,ucol])),recursive=FALSE,use.names=FALSE)
##    namrr<-sapply(rr,function(u){nam<-unique(u[,ucol]); if(length(nam)>1) stop("Programming Error (units", nam,"). Please report") else return(nam)},USE.NAMES=FALSE)
##    names(rr)<-namrr
####    rr<-unlist(lapply(rr,FUN=function(u) split(u[,-ucol],u[,ucol])),recursive=FALSE)
####    names(rr)<-sapply(strsplit(names(rr),"\\."),.subset2,2)

  }

  ## modified by Mike Smith 22-07-2010
  #if (verbose) cat("Reading TXT file ...")
  #file<-read.table(file,header=TRUE,colClasses=col.class,stringsAsFactors=FALSE,comment.char="",...)
  #if (verbose) cat(" Done.\n")

  if (verbose) cat("Splitting TXT file into units ...")
  gxys<-split.quick(file,ucol,splitn)
  rm(file); gc()
  if (verbose) cat(" Done.\n")

  l<-vector("list",length(gxys))
  if (verbose) cat("Creating structure for",length(gxys),"units (dot=250):\n")
  for(i in  1:length(gxys)) {
    sp<-split(gxys[[i]],factor(gxys[[i]][,gcol]))
    e<-vector("list",length(sp))
    for(j in 1:length(sp)) {
      np<-nrow(sp[[j]])
      e[[j]]<-list(x=sp[[j]][,xynames[1]],y=sp[[j]][,xynames[2]],pbase=rep("A",np),tbase=rep("T",np),atom=0:(np-1),indexpos=0:(np-1),
                   groupdirection="sense",natoms=np,ncellsperatom=1)
    }
    names(e)<-names(sp)
    l[[i]]<-list(unittype=1,unitdirection=1,groups=e,natoms=nrow(gxys[[i]]),ncells=nrow(gxys[[i]]),ncellsperatom=1,unitnumber=i)
    if (verbose) { if(i %% 250==0) cat("."); if(i %% 5000==0) cat("(",i,")\n",sep="") }
  }
  cat("\n")
  names(l)<-names(gxys)
  if(!is.null(tags) && tags!="") filename<-paste(chipType,tags,sep=",")
  else filename<-chipType
  filename<-paste(filename,"cdf",sep=".")
  hdr<-list(probesets=length(l),qcprobesets=0,reference="",chiptype=chipType,filename=filename,
            nqcunits=0,nunits=length(l),rows=rows,cols=cols,refseq="",nrows=rows,ncols=cols)
  require(affxparser)
  writeCdf(hdr$filename, cdfheader=hdr, cdf=l, cdfqc=NULL, overwrite=TRUE, verbose=verbose)
  invisible(list(cdfList=l,cdfHeader=hdr))
}

ADD REPLY • link 8.3 years ago Guido Hooiveld ★ 3.9k

0

Entering edit mode

Hi Guido, thanks a lot for sharing your experience and the code below to produce a CDF file and enable using the 'affy' pipeline. I will probably use it to double-check my results but I will try to stick to the 'oligo' pipeline since, as far as I have understood from previous post on this subject, the fact that probes are shared across features in these newer Affy arrays enforces the 'affy' pipeline to discard probes.

ADD REPLY • link 8.3 years ago Robert Castelo ★ 3.3k

0

Entering edit mode

The GenericFeatureSet has different arguments than usual, given the non-specific nature of this class. You might want to check out oligo from svn and look at the methods-GenericArrays.R file.

ADD REPLY • link 8.3 years ago James W. MacDonald 65k

0

Entering edit mode

I see, thanks for the pointer, looking at the code I could figure out that I needed to change the value of the 'target' argument for the GenericFeatureSet::rma() method to 'mps2' for transcript-level summarization and this works fine.

Unfortunately, for the fitProbeLevelModel() function, changing the value of the argument 'target' leads to another error:

plm <- fitProbeLevelModel(oligoRaw, target="mps2")
Error in sqliteSendQuery(con, statement, bind.data) :
  error in statement: no such table: featureSet
traceback()
8: .Call(rsqlite_query_send, con@Id, as.character(statement), bind.data)
7: sqliteSendQuery(con, statement, bind.data)
6: sqliteGetQuery(conn, statement)
5: .local(conn, statement, ...)
4: dbGetQuery(conn, sql)
3: dbGetQuery(conn, sql)
2: getProbeInfo(object, target = target, field = vars, sortBy = "man_fsetid",
       ...)
1: fitProbeLevelModel(oligoRaw, target = "mps2")

.. and the QAish methods hist() and boxplot() have the problem that the 'target' argument is not parametrized and they choke at this call:

            idx <- unique(getProbeIndex(x, which))

in line #126 from methods-FeatureSet.R. For this to work the call should be

idx <- unique(getProbeIndex(oligoRaw, type="pm", target=target))

where the 'target' value should be an argument to hist() and boxplot() that in this case I would set to target='mps2'.

These three errors look to me more like oligo would require some fine tuning on these three functions. Let me know if you have any further suggestion to move forward.

ADD REPLY • link 8.3 years ago Robert Castelo ★ 3.3k

0

Entering edit mode

You will probably not be able to move forward without writing your own functions. You are at the bleeding edge of what oligo is intended to be used for; the code for generic arrays, while intended for arbitrary arrays, and hopefully improved in future to accommodate all such arrays seamlessly, is brand new code, and right now just tested with the MBNI remapped CDF files, which are far simpler than what you are using.

ADD REPLY • link 8.3 years ago James W. MacDonald 65k

0

Entering edit mode

I see, of course, no problem. I have modified the corresponding code in oligo to fix these few problems and sent the corresponding patch to Benilton. This fixed version runs smoothly at least with these 'Glue Grant' arrays in the few tests I have done. Thanks again for your guidance!

ADD REPLY • link 8.3 years ago Robert Castelo ★ 3.3k