RLMM questions

0

Entering edit mode

Amit Bahl ▴ 20

@amit-bahl-1842

Last seen 10.3 years ago

I have a custom Affy array which allows several applications (expression profiling, genotyping, etc...) on a single chip. I want to use RLMM to analyze our genotyping data, but have a couple of questions: 1) Instead of normalizing to the scale of the training set (which I don't have), does it make sense to normalize all arrays to each other using quantile normalization? If I do this, then instead of using a raw file intermediate, I could go from an abatch object directly to the norm files (what is the format of these files?). This is also appealing as gtype_cel_to_pq chokes on my CDF file, probably due to the mixed design. 2) Once I have norm files, I can create the theta file - but Is there a way to do unsupervised classification from the results in the theta file (that is, how do I avoid the internal regions file altogether or make a compatible uninformative one)? Of course, I could always define my own conservative decision regions in the unit square. 3) My genotyping probe-sets don't all have 20 PM probes, does RLMM explicitly require this? 4) I'm also interested in checking how much the various quartet offsets contribute to classification results. Are the 20 probes in the raw or norm file ordered by offset and strand? -Amit

GO Classification cdf affy RLMM GO Classification cdf affy RLMM • 1.0k views

ADD COMMENT • link updated 18.3 years ago by Henrik Bengtsson ★ 2.4k • written 18.3 years ago by Amit Bahl ▴ 20

0

Entering edit mode

Henrik Bengtsson ★ 2.4k

@henrik-bengtsson-4333

Last seen 7 months ago

United States

Hi. On 8/22/06, Amit Bahl <abahl at="" mail.med.upenn.edu=""> wrote: > > I have a custom Affy array which allows several applications > (expression profiling, genotyping, etc...) on a single chip. I want > to use RLMM to analyze our genotyping data, but have a couple of > questions: > > 1) Instead of normalizing to the scale of the training set (which I > don't have), does it make sense to normalize all arrays to each other > using quantile normalization? Depending of what type of data, but most likely yes. If you work with extreme data such as cancer data, the might be too many copy-number differences for the assumptions behind quantile normalization to be true. > If I do this, then instead of using a > raw file intermediate, I could go from an abatch object directly to > the norm files (what is the format of these files?). This is also > appealing as gtype_cel_to_pq chokes on my CDF file, probably due to > the mixed design. I can't tell you about 'abatch' objects, but I know that Affymetrix' gtype_cel_to_pq tool is designed for the 100K SNP chips, which have exactly 20PM and 20MM per SNP (probeset). This is not the case for say the 500K chips. The simple reason for this assumption is that it outputs a tab-delimited ASCII file (*.raw) with a table of rows of equal lengths. Using tables to store CEL data with SNPs of different lengths does not work well. > > 2) Once I have norm files, I can create the theta file - but Is there > a way to do unsupervised classification from the results in the theta > file (that is, how do I avoid the internal regions file altogether > or make a compatible uninformative one)? Of course, I could always > define my own conservative decision regions in the unit square. > > 3) My genotyping probe-sets don't all have 20 PM probes, does RLMM > explicitly require this? If you talk about the package RLMM, the answer is yes. The method/algorithm RLMM itself works on a SNP-to-SNP bases and does not require equally sized SNPs. > > 4) I'm also interested in checking how much the various quartet > offsets contribute to classification results. Are the 20 probes in > the raw or norm file ordered by offset and strand? I did look at this many months ago and if I remember it correctly, the answer is that the probes are ordered as they are ordered in the CDF file and there all sense probes comes first and then the anti-sense. However, just looking in the *.raw file, you do not know how many sense and anti-sense probes a specific SNP has; it varies and it is *not* the case that it is always 20-20. If you are going to do serious (long-term) investigation of SNP data, I recommend you to move away from the *.raw file format; it was a temporary solution and will soon be forgotten. It is also extremely slow to work with ASCII files - much better to work with binary CEL files directly. For low-level access to CDF and CEL data, I would recommend you to look at the 'affxparser' package, but also the 'affyio' package. Currently, they complement each other. The latter has been around longer (hence probably less bugs), the former builds on top of Affymetrix open-source libraries and also tries to minimize memory usage by allowing you to work on a subset of probesets across 100-1000s of CEL files. Both will allow you to pull information from the CDF about probe distributions etc for the SNP. In the bigger picture, for doing RLMM and similar, I would recommend you to look at the 'oligo' package which is under development but is being designed for doing SNP analysis in R. You might also want to look at the Affymetrix Power Tools (APT) (non R) which implements BRLMM, which is an extension to RLMM that let SNPs borrow information from other SNPs in order to get better genotype call regions. See also CRLMM of 'oligo'. Cheers Henrik > > -Amit > > _______________________________________________ > Bioconductor mailing list > Bioconductor at stat.math.ethz.ch > https://stat.ethz.ch/mailman/listinfo/bioconductor > Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor >

ADD COMMENT • link 18.3 years ago Henrik Bengtsson ★ 2.4k

0

Entering edit mode

To use CRLMM, you should install oligo, available at: http://www.bioconductor.org/packages/1.9/bioc/html/oligo.html You will also need a platform design environment, specific to the array you're using... Some can be downloaded from: http://www.biostat.jhsph.edu/~bcarvalh/research.html Best, b. On Aug 22, 2006, at 2:52 PM, Henrik Bengtsson wrote: > Hi. > > On 8/22/06, Amit Bahl <abahl at="" mail.med.upenn.edu=""> wrote: >> >> I have a custom Affy array which allows several applications >> (expression profiling, genotyping, etc...) on a single chip. I want >> to use RLMM to analyze our genotyping data, but have a couple of >> questions: >> >> 1) Instead of normalizing to the scale of the training set (which I >> don't have), does it make sense to normalize all arrays to each other >> using quantile normalization? > > Depending of what type of data, but most likely yes. If you work with > extreme data such as cancer data, the might be too many copy-number > differences for the assumptions behind quantile normalization to be > true. > >> If I do this, then instead of using a >> raw file intermediate, I could go from an abatch object directly to >> the norm files (what is the format of these files?). This is also >> appealing as gtype_cel_to_pq chokes on my CDF file, probably due to >> the mixed design. > > I can't tell you about 'abatch' objects, but I know that Affymetrix' > gtype_cel_to_pq tool is designed for the 100K SNP chips, which have > exactly 20PM and 20MM per SNP (probeset). This is not the case for > say the 500K chips. The simple reason for this assumption is that it > outputs a tab-delimited ASCII file (*.raw) with a table of rows of > equal lengths. Using tables to store CEL data with SNPs of different > lengths does not work well. > >> >> 2) Once I have norm files, I can create the theta file - but Is there >> a way to do unsupervised classification from the results in the theta >> file (that is, how do I avoid the internal regions file altogether >> or make a compatible uninformative one)? Of course, I could always >> define my own conservative decision regions in the unit square. >> >> 3) My genotyping probe-sets don't all have 20 PM probes, does RLMM >> explicitly require this? > > If you talk about the package RLMM, the answer is yes. The > method/algorithm RLMM itself works on a SNP-to-SNP bases and does not > require equally sized SNPs. > >> >> 4) I'm also interested in checking how much the various quartet >> offsets contribute to classification results. Are the 20 probes in >> the raw or norm file ordered by offset and strand? > > I did look at this many months ago and if I remember it correctly, the > answer is that the probes are ordered as they are ordered in the CDF > file and there all sense probes comes first and then the anti-sense. > However, just looking in the *.raw file, you do not know how many > sense and anti-sense probes a specific SNP has; it varies and it is > *not* the case that it is always 20-20. > > If you are going to do serious (long-term) investigation of SNP data, > I recommend you to move away from the *.raw file format; it was a > temporary solution and will soon be forgotten. It is also extremely > slow to work with ASCII files - much better to work with binary CEL > files directly. > > For low-level access to CDF and CEL data, I would recommend you to > look at the 'affxparser' package, but also the 'affyio' package. > Currently, they complement each other. The latter has been around > longer (hence probably less bugs), the former builds on top of > Affymetrix open-source libraries and also tries to minimize memory > usage by allowing you to work on a subset of probesets across > 100-1000s of CEL files. Both will allow you to pull information from > the CDF about probe distributions etc for the SNP. > > In the bigger picture, for doing RLMM and similar, I would recommend > you to look at the 'oligo' package which is under development but is > being designed for doing SNP analysis in R. You might also want to > look at the Affymetrix Power Tools (APT) (non R) which implements > BRLMM, which is an extension to RLMM that let SNPs borrow information > from other SNPs in order to get better genotype call regions. See > also CRLMM of 'oligo'. > > Cheers > > Henrik > >> >> -Amit >> >> _______________________________________________ >> Bioconductor mailing list >> Bioconductor at stat.math.ethz.ch >> https://stat.ethz.ch/mailman/listinfo/bioconductor >> Search the archives: http://news.gmane.org/ >> gmane.science.biology.informatics.conductor >> > > _______________________________________________ > Bioconductor mailing list > Bioconductor at stat.math.ethz.ch > https://stat.ethz.ch/mailman/listinfo/bioconductor > Search the archives: http://news.gmane.org/ > gmane.science.biology.informatics.conductor

ADD REPLY • link 18.3 years ago Benilton Carvalho ★ 4.3k

Login before adding your answer.