hi all,
I'm interested in analyzing microarray data from the article http://www.ncbi.nlm.nih.gov/pubmed/26052715 deposited at GEO here:
http://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE69686
even though the data was recently deposited, they were generated from "Glue Grant" Affymetrix chips described in the article http://www.ncbi.nlm.nih.gov/pubmed/21317363
as part of that article, they put available a number of files to facilitate processing of these kind of data here:
http://gluegrant1.stanford.edu/~DIC/GGHarray
and I thought I could build the corresponding pd.* package but unfortunately the *.probeset.csv file that goes on the 'probeFile' slot of the 'AffyGenePDInfoPkgSeed' object is missing and have not been able to find it after googling.
during my googling I found the following BioC post:
http://grokbase.com/p/r/bioconductor/139as9419c/bioc-affymetrix-hta-2-array-analysis
that briefly mentions these "Glue Grant" arrays as predecessors of the commercial HTA-2.0 Affymetrix arrays. so i thought i could try to process them using the available pd.hta.2.0 package. however, the CEL files have as chip name 'hGlue_3_0_v1'. This is the header of one of them:
library(affyio) read.celfile.header("GSM1707422_6699_51A_hGlue_3_0_v1_.CEL.gz") $cdfName [1] "hGlue_3_0_v1" $`CEL dimensions` Cols Rows 2680 2572
so, i decided to download the source of pd.hta.2.0_*.tar.gz, uncompress it, rename the hta.2.0 bits to hGlue_3_0_v1 and package it again as pd.hGlue_3_0_v1_*.tar.gz and try the oligo pipeline.
so far it seems everything to work, which is nice, but i would like to know if 1. anybody has an idea of possible pitfalls with this approach since in principle they are different chips; and 2. if anybody knows where could i access the *.probeset.csv file that goes into the 'probeFile' slot of the 'AffyGenePDInfoPkgSeed' object for these Glue Grant chips, so that i could build the corresponding pd.* package without worrying whether the pd.hta-2.0_*.tar.gz package is the right solution.
thanks!! and happy x-mas everyone!
robert.
hi James,
thanks for your detailed response. I have done the comparison between the CLF and PGF files from the two chips and to me they look quite different. I find it surprising since I expected that as "predecessor" the "Glue Grant" (GGH3) array would be a subset of the HTA-2.0. This is what I found:
so, initially, the older GGH3 chip has 7577330-6892960=684370 fewer probes than the commercial HTA-2.0 but the same "geometry". The first few rows of each table of probe information look like this:
they are already different, and the 'pstype' column reveals a quite different set of types of probes:
as well as the 'ptype' column:
the feature identifiers 'fid' of GGH3 are nearly all of them part of the 'fid' column in the HTA-2.0 and when they match, they lead to the same values in the 'x' and 'y' columns but not the same probe sequence:
in fact, only a small subset of the probe sequences match between the two chips:
so, i would say that even though the physical support and geometry may be similar, the two chips are quite differently organized. i'm now surprised that i could use the current pd.hta-2.0 package to run the oligo pipeline on these chips.
anyway, i see now more clearly that i will have to try to build a pd.hglue.3.0.v1 package specific for this GGH3 chips. from what you describe and looking into the source code you indicated, i understand that i have to build a data.frame object from the CLF, PGF and MPS files and this data.frame object will be the only piece of information required to build the package. so, the question is how exactly should i build this data.frame. The bit of the source code that i presume you are pointing out to says the following:
so, i guess this probe table could be built approximately as follows after reading the information as i did above for GGH3:
however, i have not used at this point the MPS file and i guess this may be the bit involving the multiple columns for probes being shared among probesets that you were talking about. Could you give me a more concrete advice about how to integrate the information from the MPS file?
thanks!!
robert.
Hi Robert,
I think the following should work, but no guarantees...
hi James,
this is great, thanks a lot for crafting a solution. I have found however a couple of errors when trying to run it. For two of them I could find the way forward, but for the third one I would appreciate if you could look into it.
1. The first is straightforward, the builder function tries to use the '%dopar%' function without importing it:
I fixed this loading the 'doParallel' package first, but I guess you (or Benilton) may want to import this function in the NAMESPACE of the 'pdInfoBuilder' package.
2. The following one:
makePdInfoPackage(p)
========================================================================================
Building annotation for Generic Array
========================================================================================
Error in { :
task 2 failed - "arguments imply differing number of rows: 0, 6892960"
can be fixed by replacing the line above
by
3. The following one:
is related to the fact shown in my previous comment by which this Glue Grant chip has more probe types than the HTA-2.0, such as 'blank', 'generic:at', etc. The following two lines should in principle fix is problem:
unfortunately, the code still chokes with them. More concretely, in the call to pdInfoBuilder:::getSchemaAndKeys(toSQL) the variable 'tbls' takes the following values:
which are the arguments used to call pdInfoBuilder:::getKeys(tbl) which is the function producing the error. The function gives the error for 'mps1mm:at' because such a token does not match any of these:
which imply that when these tokens start with 'mps' they should end with either 'pm' or 'mm' so the ':at' or ':st' suffixes are not allowed. at this point i'm not sure anymore whether something should be done about how the table of probe information is built or something should be added or changed in the pdInfoBuilder package. It would be great if you could give a hand here again.
thanks!!
The HTA arrays just have pm:st and pm:at (and the mm versions). Not sure what the others are intended for. Anyway, the HTA arrays are really an edge case to begin with, and probably more so for these versions. This might be something that needs to be addressed for the general random primer Affy arrays (I don't know for instance how prevalent the pm:at, etc nomenclature is), but I doubt it will ever be tailored specifically to handle particular aspects of the glue grant arrays.
I would do what you have done, and then just truncate down to pm and mm. There should be vanishingly few mm probes, given that this is a PM-only array (and the mm probes aren't used anyway, so you could probably just nuke those as well if you want). In my trial run I just added a column of all pm,
Thanks for the advice, I have moved forward in this direction. While the package builds and installs without problems, errors arise when using it with oligo on the CEL files I want to analyze. Let me know if you think this should be a new thread. Just to recap, I have built the package with the following instructions:
here I put pd.hglue.3.0.v1 as package name since this is what the CEL files I want to analyze expect to use for their processing. Now, I'm going to use it with the CEL files I want to analyze. Initially, there is no problem to read those CEL files (I just use the first 10 of them):
however, errors arise when trying to use some of the oligo methods to explore the intensity distributions:
fit a probe level model:
or try to normalize with RMA:
here is my session information:
Hi,
First of all a disclaimer: I am not at all such an expert as James :) , but this may be of interest:
Few years ago I struggled once in a while with obtaining suitable CDFs for some of the newer ST arrays that were then release by Affymetrix in order to analyze them in R/BioC. I found some functions / code (see section "Building custom-made annotation data: CDF (Chip Definition File)") from the package aroma.affymetrix (from Henrik Bengtsson) to be very useful to generate CDFs from the PGF and CLF files provided by Affymetrix.
I especially would like to point you to the function flat2Cdf.R; originally created by Mark Robinson and Elizabeth Purdom and then slightly modified by Mike Smith. When using the modified version (from Mike Smith) I am able to generate a CDF for the Glue arrays, which in turn can be used to analyze the Glue arrays in Bioconductor (pseudo-images for QC, normalization, etc). See comment below for some code snippets.
HTH,
Guido
Note1: credits to the original developers of these functions; I just applied them.
Note2: the website hosting Mike Smith's modified function is down; see comment below in which i copied/pasted the code.
Note3: I wonder whether the MPS file is OK: the CDF contains 47960 probesets, which corresponds to the info in the MPS file. Of these 47960 probesets, 34834 begin with the suffix "TC". According to the annotation file (hGlue2_0.r1.TC_Annov.csv), these are supposed to measure transcripts. As far as I understand from the MPS file the remaining 13126 probesets (all have a numeric ID) are various control probesets (e.g. positive, negative, bacterial spike, etc). However, it worries me that these control probesets all contain one single probe only....!! So the MPS file is not correct?
Also, the annotation file (hGlue2_0.r1.TC_Annov.csv) only contains info for 35123 probesets. That does not match with the 34834 "TC probesets" from the MPS.
Lastly, the annotation info availabe at GEO consists of only 20533 probesets...??
Mmm, maybe the most straight-forward advice would be to just use a remapped (custom) CDF.... For the Glue array these are provided in various flavors by Manhong Dai and Fan Meng since release 19 (current is 20). A corresponding PdInfo package is already available... !
http://brainarray.mbni.med.umich.edu/Brainarray/Database/CustomCDF/CDF_download.asp#v20
....wonder why i did not check this before....
That's cool, I also did not think about looking up there, thanks! I've tried the packages and they work fine, except for the case of the PD package that chokes at the fitProbeLevelModel() of the 'oligo' pipeline with the same error I encounter when building that PD package myself. I have made a patch to fix this and sent it to Benilton, so hopefully this becomes available at some point for everyone who needs this. Something I do not like however from that PD package is that it contains only the transcript level and not the probeset level. I think the probeset level is interesting for QA since it contains fewer probes per feature which makes the PLM analysis faster for the reasons you were also pointing out.
Code snippets:
Etc
See here why run might take long: A: affyPLM and exon array question
content flat2Cdf.R function as modified by Mike Smith.
posted here because website originally hosting this file (http://compbio.sysbiol.cam.ac.uk/Resources/GeneST/) is down.
Hi Guido, thanks a lot for sharing your experience and the code below to produce a CDF file and enable using the 'affy' pipeline. I will probably use it to double-check my results but I will try to stick to the 'oligo' pipeline since, as far as I have understood from previous post on this subject, the fact that probes are shared across features in these newer Affy arrays enforces the 'affy' pipeline to discard probes.
The GenericFeatureSet has different arguments than usual, given the non-specific nature of this class. You might want to check out oligo from svn and look at the methods-GenericArrays.R file.
I see, thanks for the pointer, looking at the code I could figure out that I needed to change the value of the 'target' argument for the GenericFeatureSet::rma() method to 'mps2' for transcript-level summarization and this works fine.
Unfortunately, for the fitProbeLevelModel() function, changing the value of the argument 'target' leads to another error:
.. and the QAish methods hist() and boxplot() have the problem that the 'target' argument is not parametrized and they choke at this call:
in line #126 from methods-FeatureSet.R. For this to work the call should be
where the 'target' value should be an argument to hist() and boxplot() that in this case I would set to target='mps2'.
These three errors look to me more like oligo would require some fine tuning on these three functions. Let me know if you have any further suggestion to move forward.
You will probably not be able to move forward without writing your own functions. You are at the bleeding edge of what oligo is intended to be used for; the code for generic arrays, while intended for arbitrary arrays, and hopefully improved in future to accommodate all such arrays seamlessly, is brand new code, and right now just tested with the MBNI remapped CDF files, which are far simpler than what you are using.
I see, of course, no problem. I have modified the corresponding code in oligo to fix these few problems and sent the corresponding patch to Benilton. This fixed version runs smoothly at least with these 'Glue Grant' arrays in the few tests I have done. Thanks again for your guidance!