On 8/22/06, Amit Bahl <abahl at="" mail.med.upenn.edu=""> wrote:
> I have a custom Affy array which allows several applications
> (expression profiling, genotyping, etc...) on a single chip. I want
> to use RLMM to analyze our genotyping data, but have a couple of
> 1) Instead of normalizing to the scale of the training set (which I
> don't have), does it make sense to normalize all arrays to each
> using quantile normalization?
Depending of what type of data, but most likely yes. If you work with
extreme data such as cancer data, the might be too many copy-number
differences for the assumptions behind quantile normalization to be
> If I do this, then instead of using a
> raw file intermediate, I could go from an abatch object directly to
> the norm files (what is the format of these files?). This is also
> appealing as gtype_cel_to_pq chokes on my CDF file, probably due to
> the mixed design.
I can't tell you about 'abatch' objects, but I know that Affymetrix'
gtype_cel_to_pq tool is designed for the 100K SNP chips, which have
exactly 20PM and 20MM per SNP (probeset). This is not the case for
say the 500K chips. The simple reason for this assumption is that it
outputs a tab-delimited ASCII file (*.raw) with a table of rows of
equal lengths. Using tables to store CEL data with SNPs of different
lengths does not work well.
> 2) Once I have norm files, I can create the theta file - but Is
> a way to do unsupervised classification from the results in the
> file (that is, how do I avoid the internal regions file altogether
> or make a compatible uninformative one)? Of course, I could always
> define my own conservative decision regions in the unit square.
> 3) My genotyping probe-sets don't all have 20 PM probes, does RLMM
> explicitly require this?
If you talk about the package RLMM, the answer is yes. The
method/algorithm RLMM itself works on a SNP-to-SNP bases and does not
require equally sized SNPs.
> 4) I'm also interested in checking how much the various quartet
> offsets contribute to classification results. Are the 20 probes in
> the raw or norm file ordered by offset and strand?
I did look at this many months ago and if I remember it correctly, the
answer is that the probes are ordered as they are ordered in the CDF
file and there all sense probes comes first and then the anti-sense.
However, just looking in the *.raw file, you do not know how many
sense and anti-sense probes a specific SNP has; it varies and it is
*not* the case that it is always 20-20.
If you are going to do serious (long-term) investigation of SNP data,
I recommend you to move away from the *.raw file format; it was a
temporary solution and will soon be forgotten. It is also extremely
slow to work with ASCII files - much better to work with binary CEL
For low-level access to CDF and CEL data, I would recommend you to
look at the 'affxparser' package, but also the 'affyio' package.
Currently, they complement each other. The latter has been around
longer (hence probably less bugs), the former builds on top of
Affymetrix open-source libraries and also tries to minimize memory
usage by allowing you to work on a subset of probesets across
100-1000s of CEL files. Both will allow you to pull information from
the CDF about probe distributions etc for the SNP.
In the bigger picture, for doing RLMM and similar, I would recommend
you to look at the 'oligo' package which is under development but is
being designed for doing SNP analysis in R. You might also want to
look at the Affymetrix Power Tools (APT) (non R) which implements
BRLMM, which is an extension to RLMM that let SNPs borrow information
from other SNPs in order to get better genotype call regions. See
also CRLMM of 'oligo'.
> Bioconductor mailing list
> Bioconductor at stat.math.ethz.ch
> Search the archives: