Analysis of Affymetrix Mouse Gene 2.0 ST arrays

0

Entering edit mode

Kamila Naxerova ▴ 100

@kamila-naxerova-4164

Last seen 9.6 years ago

Dear all, I am analyzing a set of Affymetrix Mouse Gene 2.0 ST arrays. I am quite familiar with 3'-biased chips, but this is my first time looking at data from WT arrays. I have a few general questions -- any advice would be appreciated to speed up my learning process. 1) I have already read on this mailing list that the good old affy package does not work well with WT arrays (can anybody point me to any literature on why that is?). So I have installed the oligo and xps packages -- what are the advantages/disadvantages for each? Any opinions on which one is the right "starter kit"? 2) I see with some dread that there seems to be no annotation package for the 2.0 array yet. I have never built my own... any quick bullet points on how I would go about doing that for a WT array? 3) It seems that RMA is also used for normalization of WT arrays, so that part I am comfortable with. But are there any differences in preprocessing between 3' and WT arrays that I should watch out for? Thanks so much! Kamila

Annotation Normalization GO PROcess oligo Annotation Normalization GO PROcess oligo • 8.2k views

ADD COMMENT • link updated 11.1 years ago by James W. MacDonald 65k • written 11.1 years ago by Kamila Naxerova ▴ 100

0

Entering edit mode

James W. MacDonald 65k

@james-w-macdonald-5106

Last seen 12 hours ago

United States

Hi Kamila, On 3/5/2013 4:45 PM, Naxerova, Kamila wrote: > Dear all, > > I am analyzing a set of Affymetrix Mouse Gene 2.0 ST arrays. I am quite familiar with 3'-biased chips, but this is my first time looking at data from WT arrays. I have a few general questions -- any advice would be appreciated to speed up my learning process. > > 1) I have already read on this mailing list that the good old affy package does not work well with WT arrays (can anybody point me to any literature on why that is?). So I have installed the oligo and xps packages -- what are the advantages/disadvantages for each? Any opinions on which one is the right "starter kit"? The affy package was never intended to work with these arrays - it was designed specifically for the 3' biased arrays, which had pre-defined probesets, and which didn't share probes between probesets. In addition, the makecdfenv package is designed to work with the old style CDF packages, and Affy has never released a CDF for these new chips that they are willing to support in any meaningful way. There were some changes made to the affy package in order to accommodate the fact that probes could be shared between probesets, and it is possible to use functions in affxparser to re-create conventional CDF packages using the newer pgf and clf files. So hypothetically you could still use the affy package (and hypothetically you could still use an Apple IIe for all your computing needs, but that's crazy, so let's move on). I don't think you will find much difference between oligo and xps, other than the fact that xps requires the additional installation of ROOT. You might play around with both and see which suits you better. I should throw in my obligatory cautionary statement about summarizing Gene ST data at the probeset (as compared to the transcript) level. If you look at the number of probes/probeset, there are a huge number with < 4 probes. So hypothetically you can do this, but I wouldn't. > > 2) I see with some dread that there seems to be no annotation package for the 2.0 array yet. I have never built my own... any quick bullet points on how I would go about doing that for a WT array? No dread should be required. All you need to do is get the transcript-level annotation file from Affy (http://www.affymetrix.com/Auth/analysis/downloads/na33/wtgene/MoGene- 2_0-st-v1.na33.mm10.transcript.csv.zip) and then the AnnotationForge, mouse.db0, and org.Mm.eg.db packages. Then something like library(AnnotationForge) library(mouse.db0) library(org.Mm.eg.db) makeDBPackage("MOUSECHIP_DB", affy=TRUE, prefix="mogene20sttranscriptcluster", fileName="MoGene-2_0-st-v1.na33.mm10.transcript.csv", outputDir = ".", version="2.11.1", manufacturer = "Affymetrix", chipName = "Human Gene 2.1 ST Array", manufacturerUrl = "http://www.affymetrix.com", author = "Kamila Naxerova", maintainer = "Kamila Naxerova <naxerova at="" fas.harvard.edu="">") should do the trick. You can then install directly from within R by install.packages("mogene20sttranscriptcluster.db", repos=NULL, type="source") And see http://bioconductor.org/packages/2.11/bioc/vignettes/AnnotationForge/i nst/doc/SQLForge.pdf > > 3) It seems that RMA is also used for normalization of WT arrays, so that part I am comfortable with. But are there any differences in preprocessing between 3' and WT arrays that I should watch out for? Not really. I don't use xps, so cannot say for certain how you do things with that package, but with oligo it's a simple abatch <- read.celfiles(list.celfiles()) eset <- rma(abatch) To normalize and summarize at the transcript level. Note however that the annotation for the resulting ExpressionSet will be the pd.mogene.2.0.st.v1 package, and if you use annotation(eset) in any further calls to do gene annotation, it won't work out. You need to first do annotation(eset) <- "mogene20sttranscriptcluster.db" One further note: the intronic controls (especially) have an irritating habit of popping up in lists of differentially expressed genes. This is IMO likely due to mRNA that has not been fully processed to excise the introns, but regardless, these probesets tend to have no annotation at all, so are not useful without extra work to figure out what they are supposed to be measuring. My usual MO is to just summarily excise them after e.g., the eBayes() step of an analysis using limma. If you are interested, there is a function in the affycoretools package called getMainProbes() that will do this for you. Best, Jim > > Thanks so much! > Kamila > _______________________________________________ > Bioconductor mailing list > Bioconductor at r-project.org > https://stat.ethz.ch/mailman/listinfo/bioconductor > Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor -- James W. MacDonald, M.S. Biostatistician University of Washington Environmental and Occupational Health Sciences 4225 Roosevelt Way NE, # 100 Seattle WA 98105-6099

ADD COMMENT • link 11.1 years ago James W. MacDonald 65k

0

Entering edit mode

Hi Jim, thank you for your helpful reply. I have a few follow-up questions. > > I should throw in my obligatory cautionary statement about summarizing > Gene ST data at the probeset (as compared to the transcript) level. If > you look at the number of probes/probeset, there are a huge number with > < 4 probes. So hypothetically you can do this, but I wouldn't. I am bit confused about transcript clusters and probesets. In the MoGene-2_0-st-v1.na33.mm10.transcript.csv file, each transcript cluster corresponds to exactly one probe set. But from your email it sounds like there are more probesets than transcript clusters -- I assume these are stored in a different file? Unfortunately the structure of the Affymetrix web site is a mystery to me, without your direct link I would have never found the transcript annotation file, so I have no way of browsing and checking out other annotation files to better understand what is going on. Why is there a distinction between transcript cluster and probeset in the first place? I understand that it's useful to be able to group probes dynamically (based on our state of knowledge about a locus). If this grouping is defined as the transcript cluster, what is the definition of a probeset? Do I assume correctly that if I build my annotation using the MoGene- 2_0-st-v1.na33.mm10.transcript.csvfile, I essentially commit to analyzing my data on the transcript level? > > library(AnnotationForge) > library(mouse.db0) > library(org.Mm.eg.db) > makeDBPackage("MOUSECHIP_DB", > affy=TRUE, > prefix="mogene20sttranscriptcluster", > fileName="MoGene-2_0-st-v1.na33.mm10.transcript.csv", > outputDir = ".", > version="2.11.1", > manufacturer = "Affymetrix", > chipName = "Human Gene 2.1 ST Array", > manufacturerUrl = "http://www.affymetrix.com", > author = "Kamila Naxerova", > maintainer = "Kamila Naxerova <naxerova at="" fas.harvard.edu="">") > > Any thoughts on this error message? > makeDBPackage("MOUSECHIP_DB", + affy=TRUE, + prefix="mogene20sttranscriptcluster", + fileName="MoGene-2_0-st-v1.na33.mm10.transcript.csv", + outputDir = ".", + version="2.11.1", + manufacturer = "Affymetrix", + chipName = "Mouse Gene 2.0 ST Array", + manufacturerUrl = "http://www.affymetrix.com", + author = "Kamila Naxerova", + maintainer = "Kamila Naxerova <naxerova at="" fas.harvard.edu="">") Error in `[.data.frame`(csvFile, , GenBankIDName) : undefined columns selected > sessionInfo() R version 2.15.3 (2013-03-01) Platform: i386-apple-darwin9.8.0/i386 (32-bit) locale: [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8 attached base packages: [1] stats graphics grDevices utils datasets methods base other attached packages: [1] org.Mm.eg.db_2.8.0 mouse.db0_2.8.0 AnnotationForge_1.0.3 org.Hs.eg.db_2.8.0 RSQLite_0.11.2 DBI_0.2-5 AnnotationDbi_1.20.5 Biobase_2.18.0 [9] BiocGenerics_0.4.0 BiocInstaller_1.8.3 loaded via a namespace (and not attached): [1] IRanges_1.16.6 parallel_2.15.3 stats4_2.15.3 tools_2.15.3 Many thanks! Kamila

ADD REPLY • link 11.1 years ago Kamila Naxerova ▴ 100

0

Entering edit mode

Hi Kamila, On 3/6/2013 10:17 AM, Naxerova, Kamila wrote: > Hi Jim, > > thank you for your helpful reply. I have a few follow-up questions. >> I should throw in my obligatory cautionary statement about summarizing >> Gene ST data at the probeset (as compared to the transcript) level. If >> you look at the number of probes/probeset, there are a huge number with >> < 4 probes. So hypothetically you can do this, but I wouldn't. > I am bit confused about transcript clusters and probesets. In the MoGene-2_0-st-v1.na33.mm10.transcript.csv file, each transcript cluster corresponds to exactly one probe set. But from your email it sounds like there are more probesets than transcript clusters -- I assume these are stored in a different file? Unfortunately the structure of the Affymetrix web site is a mystery to me, without your direct link I would have never found the transcript annotation file, so I have no way of browsing and checking out other annotation files to better understand what is going on. Maybe I should use a different terminology, or maybe Affy should be more consistent. ;-D Anyway, as you note there are two columns in that file, one called transcript cluster and the other is probeset. But note that they are identical. When I say probeset, this is based on the fact that the Gene ST arrays are 'cut down' versions of the Exon ST arrays, which in general have 4 probes per probeset, and each probeset is supposed to interrogate an exon (or portion thereof). So in my terminology, the probeset corresponds to the original probesets from the Exon arrays, and the annotations for that level of summarization can be found in the MoGene-2_0-st-v1.na33.mm10.probeset.csv annotation file. And yeah, Affy should improve their crappy website. But it's way better than the Illumina or Agilent sites, so maybe we should count ourselves lucky. I found the csv file not by searching on their website, but by letting the googles find it for me, by searching on 'mouse gene 2.0 st annotation'. I then get this page http://www.affymetrix.com/estore/browse/products.jsp?productId=131476# 1_3 and then what I wanted is under the Technical Documentation tab. > > Why is there a distinction between transcript cluster and probeset in the first place? I understand that it's useful to be able to group probes dynamically (based on our state of knowledge about a locus). If this grouping is defined as the transcript cluster, what is the definition of a probeset? As I note above, the Gene ST arrays were created by taking the 'good' probes from the Exon array. So the notion of a probeset is based on the original construction of the probesets on the Exon array, which were usually 4 probes. But since Affy only took the good probes, tons of the probesets on the Gene ST arrays are made up of 3 or fewer probes. > > Do I assume correctly that if I build my annotation using the MoGene-2_0-st-v1.na33.mm10.transcript.csvfile, I essentially commit to analyzing my data on the transcript level? >> library(AnnotationForge) >> library(mouse.db0) >> library(org.Mm.eg.db) >> makeDBPackage("MOUSECHIP_DB", >> affy=TRUE, >> prefix="mogene20sttranscriptcluster", >> fileName="MoGene-2_0-st-v1.na33.mm10.transcript.csv", >> outputDir = ".", >> version="2.11.1", >> manufacturer = "Affymetrix", >> chipName = "Human Gene 2.1 ST Array", >> manufacturerUrl = "http://www.affymetrix.com", >> author = "Kamila Naxerova", >> maintainer = "Kamila Naxerova<naxerova at="" fas.harvard.edu="">") >> >> > Any thoughts on this error message? Yeah, I forgot that this isn't a slam dunk like with the 3'-biased arrays. Here is the problem in a nutshell: [jmacdon at adam2 tmp]$ awk -F, '{if($1 !~ /#|[:alpha:]/) print $0}' MoGene-2_0-st-v1.na33.mm10.transcript.csv | cut -d, -f 1,8 | head -n 3 "17210850","---" "17210852","---" "17210855","NM_008866 // Lypla1 // lysophospholipase 1 // 1 A1|1 // 18777 /// ENSMUST00000027036 // Lypla1 // lysophospholipase 1 // 1 A1|1 // 18777 /// BC013536 // Lypla1 // lysophospholipase 1 // 1 A1|1 // 18777 /// BC052848 // Lypla1 // lysophospholipase 1 // 1 A1|1 // 18777 /// U89352 // Lypla1 // lysophospholipase 1 // 1 A1|1 // 18777 /// CT010201 // Lypla1 // lysophospholipase 1 // 1 A1|1 // 18777 /// ENSMUST00000134384 // Lypla1 // lysophospholipase 1 // 1 A1|1 // 18777 /// ENSMUST00000150971 // Lypla1 // lysophospholipase 1 // 1 A1|1 // 18777 /// ENSMUST00000134384 // Lypla1 // lysophospholipase 1 // 1 A1|1 // 18777 /// ENSMUST00000155020 // Lypla1 // lysophospholipase 1 // 1 A1|1 // 18777 /// ENSMUST00000141278 // Lypla1 // lysophospholipase 1 // 1 A1|1 // 18777 /// AK050549 // Lypla1 // lysophospholipase 1 // 1 A1|1 // 18777 /// AK167231 // Lypla1 // lysophospholipase 1 // 1 A1|1 // 18777 /// ENSMUST00000115529 // Lypla1 // lysophospholipase 1 // 1 A1|1 // 18777 /// ENSMUST00000137887 // Lypla1 // lysophospholipase 1 // 1 A1|1 // 18777 /// AK034851 // Lypla1 // lysophospholipase 1 // 1 A1|1 // 18777 /// ENSMUST00000131119 // Lypla1 // lysophospholipase 1 // 1 A1|1 // 18777 /// ENSMUST00000119612 // Lypla1 // lysophospholipase 1 // 1 A1|1 // 18777" So these are the data for the first three transcript clusters. The first two are mystery clusters with no annotation. The third is Lypla1, but you can see that the data are splayed out in Affy's preferred format of /// for a transcript with // separating the info for that transcript. What we need is something like 17210855 NM_008866 without all that other cruft. I thought I had some code floating around that I used to parse these data for the HuGene 2.0 ST array, but I can't find it at the moment. But suffice it to say that you want to just have the first column, and then one of the RefSeq IDs in the second column. There aren't enough rows here to necessitate using something like Perl or an Awk script to do the parsing, you could just do it in R. Something like awk -F, '{if($1 !~ /#|[:alpha:]/) print $0}' MoGene-2_0-st-v1.na33.mm10.transcript.csv | cut -d, -f 1,8 > tmp.csv to get the requisite columns, then you could read in using read.csv() and then something like this, dat <- read.csv("tmp.csv", header=FALSE, stringsAsFactors = FALSE) dat$fixed <- sapply(sapply(strsplit(dat[1:5,2], " /// "), function(x) sapply(strsplit(grep("^N.", x, value=T), " // "), "[", 1)[1]), function(z) if(is.null(z)) return(NA) else return(z[1])) which naively takes just the first RefSeq looking thing. There are likely other more sophisticated things that one could do. But if the results look OK, then you could just write out columns 1 and 3 and then use that as input for building the package. Best, Jim > >> makeDBPackage("MOUSECHIP_DB", > + affy=TRUE, > + prefix="mogene20sttranscriptcluster", > + fileName="MoGene-2_0-st-v1.na33.mm10.transcript.csv", > + outputDir = ".", > + version="2.11.1", > + manufacturer = "Affymetrix", > + chipName = "Mouse Gene 2.0 ST Array", > + manufacturerUrl = "http://www.affymetrix.com", > + author = "Kamila Naxerova", > + maintainer = "Kamila Naxerova<naxerova at="" fas.harvard.edu="">") > Error in `[.data.frame`(csvFile, , GenBankIDName) : > undefined columns selected > > >> sessionInfo() > R version 2.15.3 (2013-03-01) > Platform: i386-apple-darwin9.8.0/i386 (32-bit) > > locale: > [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8 > > attached base packages: > [1] stats graphics grDevices utils datasets methods base > > other attached packages: > [1] org.Mm.eg.db_2.8.0 mouse.db0_2.8.0 AnnotationForge_1.0.3 org.Hs.eg.db_2.8.0 RSQLite_0.11.2 DBI_0.2-5 AnnotationDbi_1.20.5 Biobase_2.18.0 > [9] BiocGenerics_0.4.0 BiocInstaller_1.8.3 > > loaded via a namespace (and not attached): > [1] IRanges_1.16.6 parallel_2.15.3 stats4_2.15.3 tools_2.15.3 > > > > Many thanks! > Kamila -- James W. MacDonald, M.S. Biostatistician University of Washington Environmental and Occupational Health Sciences 4225 Roosevelt Way NE, # 100 Seattle WA 98105-6099

ADD REPLY • link 11.1 years ago James W. MacDonald 65k

0

Entering edit mode

Dear Christian and Jim, many thanks to both of you for your explanations. Your hard work paid off, and I have finally understood everything and managed to build my annotation package!!!! I wrote a little script similar to what Jim was suggesting, namely picking the first RefSeq- like thing I came across. Jim called it "naive" -- but I think there is no downside to this approach, right? I have looked at various examples in the Affy file for a long time, and simply picking the first Refseq ID seems to be kosher. data <-read.csv("MoGene-transcript- noheader.csv",header=T,stringsAsFactors=F,sep=",") sdata <- data[,c(1,9)] returnRef=function(x){ refst <- strsplit(x,split="///")[[1]][grep("RefSeq",strsplit(x ,split="///")[[1]])[1]] refid <- gsub(" ","",strsplit(refst,split="//")[[1]][1]) return(refid) } sdata$refseqids <- sapply(sdata[,2],returnRef) fdata <- sdata[,-2] write.table(fdata,"AnnotBuild.txt", sep="\t",quote=F,row.names=F,col.names=F) library(AnnotationForge) library(mouse.db0) library(org.Mm.eg.db) makeDBPackage("MOUSECHIP_DB", affy=F, prefix="mogene20sttranscriptcluster", fileName="AnnotBuild.txt", outputDir = ".", version="2.11.1", baseMapType="refseq", manufacturer = "Affymetrix", chipName = "Mouse Gene 2.0 ST Array", manufacturerUrl = "http://www.affymetrix.com", author = "Kamila Naxerova", maintainer = "Kamila Naxerova <naxerova at="" fas.harvard.edu="">") > install.packages("mogene20sttranscriptcluster.db",repos=NULL, type="source") * installing *source* package ?mogene20sttranscriptcluster.db? ... ** R ** inst ** preparing package for lazy loading ** help *** installing help indices ** building package indices ** testing if installed package can be loaded *** arch - i386 *** arch - x86_64 * DONE (mogene20sttranscriptcluster.db)

ADD REPLY • link 11.1 years ago Kamila Naxerova ▴ 100

0

Entering edit mode

Dear all, I am afraid I have to ask for help with the Mouse Gene 2.0 ST annotation package one more time. It looked like I created it successfully, but when I try to use it to read in cel files with the oligo package, I get a cryptic error message. Any suggestions would be much appreciated! > abatch <- read.celfiles(list.celfiles(),pkgname="mogene20sttranscrip tcluster.db") Platform design info loaded. Reading in : xxx.CEL Reading in : xxx.CEL Reading in : xxx.CEL [... more cel files listed] Error in function (classes, fdef, mtable) : unable to find an inherited method for function ?kind? for signature ?"ChipDb"? Thanks Kamila On Mar 6, 2013, at 6:16 PM, "Naxerova, Kamila" <naxerova at="" fas.harvard.edu=""> wrote: > Dear Christian and Jim, > > many thanks to both of you for your explanations. > > Your hard work paid off, and I have finally understood everything and managed to build my annotation package!!!! I wrote a little script similar to what Jim was suggesting, namely picking the first RefSeq- like thing I came across. Jim called it "naive" -- but I think there is no downside to this approach, right? I have looked at various examples in the Affy file for a long time, and simply picking the first Refseq ID seems to be kosher. > > data <-read.csv("MoGene-transcript- noheader.csv",header=T,stringsAsFactors=F,sep=",") > sdata <- data[,c(1,9)] > > returnRef=function(x){ > refst <- strsplit(x,split="///")[[1]][grep("RefSeq",strsplit(x ,split="///")[[1]])[1]] > refid <- gsub(" ","",strsplit(refst,split="//")[[1]][1]) > return(refid) > } > > sdata$refseqids <- sapply(sdata[,2],returnRef) > fdata <- sdata[,-2] > write.table(fdata,"AnnotBuild.txt", sep="\t",quote=F,row.names=F,col.names=F) > > library(AnnotationForge) > library(mouse.db0) > library(org.Mm.eg.db) > makeDBPackage("MOUSECHIP_DB", > affy=F, > prefix="mogene20sttranscriptcluster", > fileName="AnnotBuild.txt", > outputDir = ".", > version="2.11.1", > baseMapType="refseq", > manufacturer = "Affymetrix", > chipName = "Mouse Gene 2.0 ST Array", > manufacturerUrl = "http://www.affymetrix.com", > author = "Kamila Naxerova", > maintainer = "Kamila Naxerova <naxerova at="" fas.harvard.edu="">") > >> install.packages("mogene20sttranscriptcluster.db",repos=NULL, type="source") > * installing *source* package ?mogene20sttranscriptcluster.db? ... > ** R > ** inst > ** preparing package for lazy loading > ** help > *** installing help indices > ** building package indices > ** testing if installed package can be loaded > *** arch - i386 > *** arch - x86_64 > > * DONE (mogene20sttranscriptcluster.db) > > _______________________________________________ > Bioconductor mailing list > Bioconductor at r-project.org > https://stat.ethz.ch/mailman/listinfo/bioconductor > Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor

ADD REPLY • link 11.1 years ago Kamila Naxerova ▴ 100

0

Entering edit mode

Hi Kamila, On 3/7/2013 9:54 AM, Naxerova, Kamila wrote: > Dear all, > > I am afraid I have to ask for help with the Mouse Gene 2.0 ST annotation package one more time. It looked like I created it successfully, but when I try to use it to read in cel files with the oligo package, I get a cryptic error message. Any suggestions would be much appreciated! You don't use the annotation package at this step. There are two packages that are used for the analysis of this chip type. The first is the pd.mogene.2.0.st.v1 package, which is used by oligo to map probes to probesets when doing the normalization/summarization step. This package will be automagically installed if you don't have it, so there is nothing to be done at the first step but abatch <- read.celfiles(list.celfiles()) eset <- rma(abatch) This will give you the summarized and normalized data at the transcript level. You then will normally fit some model(s) using the modeling package of your choice, and then might want to output a set of significant genes, at which time you will use the mogene20sttranscriptcluster.db package to map probeset IDs to gene information. Best, Jim > >> abatch<- read.celfiles(list.celfiles(),pkgname="mogene20sttranscrip tcluster.db") > Platform design info loaded. > Reading in : xxx.CEL > Reading in : xxx.CEL > Reading in : xxx.CEL > [... more cel files listed] > > Error in function (classes, fdef, mtable) : > unable to find an inherited method for function ?kind? for signature ?"ChipDb"? > > Thanks > Kamila > > On Mar 6, 2013, at 6:16 PM, "Naxerova, Kamila"<naxerova at="" fas.harvard.edu=""> wrote: > >> Dear Christian and Jim, >> >> many thanks to both of you for your explanations. >> >> Your hard work paid off, and I have finally understood everything and managed to build my annotation package!!!! I wrote a little script similar to what Jim was suggesting, namely picking the first RefSeq- like thing I came across. Jim called it "naive" -- but I think there is no downside to this approach, right? I have looked at various examples in the Affy file for a long time, and simply picking the first Refseq ID seems to be kosher. >> >> data<-read.csv("MoGene-transcript- noheader.csv",header=T,stringsAsFactors=F,sep=",") >> sdata<- data[,c(1,9)] >> >> returnRef=function(x){ >> refst<- strsplit(x,split="///")[[1]][grep("RefSeq",strsplit(x, split="///")[[1]])[1]] >> refid<- gsub(" ","",strsplit(refst,split="//")[[1]][1]) >> return(refid) >> } >> >> sdata$refseqids<- sapply(sdata[,2],returnRef) >> fdata<- sdata[,-2] >> write.table(fdata,"AnnotBuild.txt", sep="\t",quote=F,row.names=F,col.names=F) >> >> library(AnnotationForge) >> library(mouse.db0) >> library(org.Mm.eg.db) >> makeDBPackage("MOUSECHIP_DB", >> affy=F, >> prefix="mogene20sttranscriptcluster", >> fileName="AnnotBuild.txt", >> outputDir = ".", >> version="2.11.1", >> baseMapType="refseq", >> manufacturer = "Affymetrix", >> chipName = "Mouse Gene 2.0 ST Array", >> manufacturerUrl = "http://www.affymetrix.com", >> author = "Kamila Naxerova", >> maintainer = "Kamila Naxerova<naxerova at="" fas.harvard.edu="">") >> >>> install.packages("mogene20sttranscriptcluster.db",repos=NULL, type="source") >> * installing *source* package ?mogene20sttranscriptcluster.db? ... >> ** R >> ** inst >> ** preparing package for lazy loading >> ** help >> *** installing help indices >> ** building package indices >> ** testing if installed package can be loaded >> *** arch - i386 >> *** arch - x86_64 >> >> * DONE (mogene20sttranscriptcluster.db) >> >> _______________________________________________ >> Bioconductor mailing list >> Bioconductor at r-project.org >> https://stat.ethz.ch/mailman/listinfo/bioconductor >> Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor > _______________________________________________ > Bioconductor mailing list > Bioconductor at r-project.org > https://stat.ethz.ch/mailman/listinfo/bioconductor > Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor -- James W. MacDonald, M.S. Biostatistician University of Washington Environmental and Occupational Health Sciences 4225 Roosevelt Way NE, # 100 Seattle WA 98105-6099

ADD REPLY • link 11.1 years ago James W. MacDonald 65k

0

Entering edit mode

Thanks Jim. Of course the annotation package does not contain probe --> probe set information. What was I thinking?!?? What I had not realized was that I needed to build the pd.mogene.2.0.st package myself first, because it also does not exist on Bioconductor. So I just downloaded all the required files from Affy, but again I am stuck with an error message I don't understand... what is the coreMPS file that gives me the error? > library(pdInfoBuilder) > baseDir <- "/Users/naxerova/Documents/xxx" > (pgf <- list.files(baseDir, pattern = ".pgf", + full.names = TRUE)) [1] "/Users/naxerova/Documents/xxx/MoGene-2_0-st.pgf" > (clf <- list.files(baseDir, pattern = ".clf", + full.names = TRUE)) [1] "/Users/naxerova/Documents/xxx/MoGene-2_0-st.clf" > (prob <- list.files(baseDir, pattern = ".probeset.csv", + full.names = TRUE)) [1] "/Users/naxerova/Documents/xxx/MoGene- 2_0-st-v1.na33.mm10.probeset.csv" > seed <- new("AffyGenePDInfoPkgSeed", + pgfFile = pgf, clfFile = clf, + probeFile = prob, author = "Kamila Naxerova", + email = "naxerova at fas.harvard.edu", + biocViews = "AnnotationData", + organism = "Mouse", species = "Mus Musculus") > makePdInfoPackage(seed, destDir = ".") ====================================================================== ====================================================================== === Building annotation package for Affymetrix Gene ST Array PGF.........: MoGene-2_0-st.pgf CLF.........: MoGene-2_0-st.clf Probeset....: MoGene-2_0-st-v1.na33.mm10.probeset.csv Transcript..: TheTranscriptFile Core MPS....: coreMps ====================================================================== ====================================================================== === Parsing file: MoGene-2_0-st.pgf... OK Parsing file: MoGene-2_0-st.clf... OK Creating initial table for probes... OK Creating dictionaries... OK Parsing file: MoGene-2_0-st-v1.na33.mm10.probeset.csv... OK Parsing file: coreMps... Error in file(file, "rt") : cannot open the connection In addition: Warning message: In file(file, "rt") : cannot open file 'coreMps': No such file or directory On Mar 7, 2013, at 10:06 AM, "James W. MacDonald" <jmacdon at="" uw.edu=""> wrote: > Hi Kamila, > > On 3/7/2013 9:54 AM, Naxerova, Kamila wrote: >> Dear all, >> >> I am afraid I have to ask for help with the Mouse Gene 2.0 ST annotation package one more time. It looked like I created it successfully, but when I try to use it to read in cel files with the oligo package, I get a cryptic error message. Any suggestions would be much appreciated! > > You don't use the annotation package at this step. There are two > packages that are used for the analysis of this chip type. The first is > the pd.mogene.2.0.st.v1 package, which is used by oligo to map probes to > probesets when doing the normalization/summarization step. This package > will be automagically installed if you don't have it, so there is > nothing to be done at the first step but > > abatch <- read.celfiles(list.celfiles()) > eset <- rma(abatch) > > This will give you the summarized and normalized data at the transcript > level. You then will normally fit some model(s) using the modeling > package of your choice, and then might want to output a set of > significant genes, at which time you will use the > mogene20sttranscriptcluster.db package to map probeset IDs to gene > information. > > Best, > > Jim > > >> >>> abatch<- read.celfiles(list.celfiles(),pkgname="mogene20sttranscri ptcluster.db") >> Platform design info loaded. >> Reading in : xxx.CEL >> Reading in : xxx.CEL >> Reading in : xxx.CEL >> [... more cel files listed] >> >> Error in function (classes, fdef, mtable) : >> unable to find an inherited method for function ?kind? for signature ?"ChipDb"? >> >> Thanks >> Kamila >> >> On Mar 6, 2013, at 6:16 PM, "Naxerova, Kamila"<naxerova at="" fas.harvard.edu=""> wrote: >> >>> Dear Christian and Jim, >>> >>> many thanks to both of you for your explanations. >>> >>> Your hard work paid off, and I have finally understood everything and managed to build my annotation package!!!! I wrote a little script similar to what Jim was suggesting, namely picking the first RefSeq- like thing I came across. Jim called it "naive" -- but I think there is no downside to this approach, right? I have looked at various examples in the Affy file for a long time, and simply picking the first Refseq ID seems to be kosher. >>> >>> data<-read.csv("MoGene-transcript- noheader.csv",header=T,stringsAsFactors=F,sep=",") >>> sdata<- data[,c(1,9)] >>> >>> returnRef=function(x){ >>> refst<- strsplit(x,split="///")[[1]][grep("RefSeq",strsplit(x, split="///")[[1]])[1]] >>> refid<- gsub(" ","",strsplit(refst,split="//")[[1]][1]) >>> return(refid) >>> } >>> >>> sdata$refseqids<- sapply(sdata[,2],returnRef) >>> fdata<- sdata[,-2] >>> write.table(fdata,"AnnotBuild.txt", sep="\t",quote=F,row.names=F,col.names=F) >>> >>> library(AnnotationForge) >>> library(mouse.db0) >>> library(org.Mm.eg.db) >>> makeDBPackage("MOUSECHIP_DB", >>> affy=F, >>> prefix="mogene20sttranscriptcluster", >>> fileName="AnnotBuild.txt", >>> outputDir = ".", >>> version="2.11.1", >>> baseMapType="refseq", >>> manufacturer = "Affymetrix", >>> chipName = "Mouse Gene 2.0 ST Array", >>> manufacturerUrl = "http://www.affymetrix.com", >>> author = "Kamila Naxerova", >>> maintainer = "Kamila Naxerova<naxerova at="" fas.harvard.edu="">") >>> >>>> install.packages("mogene20sttranscriptcluster.db",repos=NULL, type="source") >>> * installing *source* package ?mogene20sttranscriptcluster.db? ... >>> ** R >>> ** inst >>> ** preparing package for lazy loading >>> ** help >>> *** installing help indices >>> ** building package indices >>> ** testing if installed package can be loaded >>> *** arch - i386 >>> *** arch - x86_64 >>> >>> * DONE (mogene20sttranscriptcluster.db) >>> >>> _______________________________________________ >>> Bioconductor mailing list >>> Bioconductor at r-project.org >>> https://stat.ethz.ch/mailman/listinfo/bioconductor >>> Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor >> _______________________________________________ >> Bioconductor mailing list >> Bioconductor at r-project.org >> https://stat.ethz.ch/mailman/listinfo/bioconductor >> Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor > > -- > James W. MacDonald, M.S. > Biostatistician > University of Washington > Environmental and Occupational Health Sciences > 4225 Roosevelt Way NE, # 100 > Seattle WA 98105-6099 >

ADD REPLY • link 11.1 years ago Kamila Naxerova ▴ 100

0

Entering edit mode

Wow. This is really an education on the vast unwashed underbelly of BioC, no? There is a file called MoGene-2_0-st.mps that came in the zip file you downloaded. Add mps <- list.files(baseDir, pattern = "mps$", full.names = TRUE) and then coreMps = mps when you create your AffyGenePDInfoPkgSeed. This file is used to distinguish between the probeset and transcript probe mappings. Best, Jim On 3/7/2013 10:36 AM, Naxerova, Kamila wrote: > Thanks Jim. Of course the annotation package does not contain probe --> probe set information. What was I thinking?!?? > > What I had not realized was that I needed to build the pd.mogene.2.0.st package myself first, because it also does not exist on Bioconductor. So I just downloaded all the required files from Affy, but again I am stuck with an error message I don't understand... what is the coreMPS file that gives me the error? > >> library(pdInfoBuilder) >> baseDir<- "/Users/naxerova/Documents/xxx" >> (pgf<- list.files(baseDir, pattern = ".pgf", > + full.names = TRUE)) > [1] "/Users/naxerova/Documents/xxx/MoGene-2_0-st.pgf" >> (clf<- list.files(baseDir, pattern = ".clf", > + full.names = TRUE)) > [1] "/Users/naxerova/Documents/xxx/MoGene-2_0-st.clf" >> (prob<- list.files(baseDir, pattern = ".probeset.csv", > + full.names = TRUE)) > [1] "/Users/naxerova/Documents/xxx/MoGene- 2_0-st-v1.na33.mm10.probeset.csv" >> seed<- new("AffyGenePDInfoPkgSeed", > + pgfFile = pgf, clfFile = clf, > + probeFile = prob, author = "Kamila Naxerova", > + email = "naxerova at fas.harvard.edu", > + biocViews = "AnnotationData", > + organism = "Mouse", species = "Mus Musculus") >> makePdInfoPackage(seed, destDir = ".") > ==================================================================== ====================================================================== ===== > Building annotation package for Affymetrix Gene ST Array > PGF.........: MoGene-2_0-st.pgf > CLF.........: MoGene-2_0-st.clf > Probeset....: MoGene-2_0-st-v1.na33.mm10.probeset.csv > Transcript..: TheTranscriptFile > Core MPS....: coreMps > ==================================================================== ====================================================================== ===== > Parsing file: MoGene-2_0-st.pgf... OK > Parsing file: MoGene-2_0-st.clf... OK > Creating initial table for probes... OK > Creating dictionaries... OK > Parsing file: MoGene-2_0-st-v1.na33.mm10.probeset.csv... OK > Parsing file: coreMps... Error in file(file, "rt") : cannot open the connection > In addition: Warning message: > In file(file, "rt") : cannot open file 'coreMps': No such file or directory > > > > > > On Mar 7, 2013, at 10:06 AM, "James W. MacDonald"<jmacdon at="" uw.edu=""> wrote: > >> Hi Kamila, >> >> On 3/7/2013 9:54 AM, Naxerova, Kamila wrote: >>> Dear all, >>> >>> I am afraid I have to ask for help with the Mouse Gene 2.0 ST annotation package one more time. It looked like I created it successfully, but when I try to use it to read in cel files with the oligo package, I get a cryptic error message. Any suggestions would be much appreciated! >> You don't use the annotation package at this step. There are two >> packages that are used for the analysis of this chip type. The first is >> the pd.mogene.2.0.st.v1 package, which is used by oligo to map probes to >> probesets when doing the normalization/summarization step. This package >> will be automagically installed if you don't have it, so there is >> nothing to be done at the first step but >> >> abatch<- read.celfiles(list.celfiles()) >> eset<- rma(abatch) >> >> This will give you the summarized and normalized data at the transcript >> level. You then will normally fit some model(s) using the modeling >> package of your choice, and then might want to output a set of >> significant genes, at which time you will use the >> mogene20sttranscriptcluster.db package to map probeset IDs to gene >> information. >> >> Best, >> >> Jim >> >> >>>> abatch<- read.celfiles(list.celfiles(),pkgname="mogene20sttranscr iptcluster.db") >>> Platform design info loaded. >>> Reading in : xxx.CEL >>> Reading in : xxx.CEL >>> Reading in : xxx.CEL >>> [... more cel files listed] >>> >>> Error in function (classes, fdef, mtable) : >>> unable to find an inherited method for function ?kind? for signature ?"ChipDb"? >>> >>> Thanks >>> Kamila >>> >>> On Mar 6, 2013, at 6:16 PM, "Naxerova, Kamila"<naxerova at="" fas.harvard.edu=""> wrote: >>> >>>> Dear Christian and Jim, >>>> >>>> many thanks to both of you for your explanations. >>>> >>>> Your hard work paid off, and I have finally understood everything and managed to build my annotation package!!!! I wrote a little script similar to what Jim was suggesting, namely picking the first RefSeq- like thing I came across. Jim called it "naive" -- but I think there is no downside to this approach, right? I have looked at various examples in the Affy file for a long time, and simply picking the first Refseq ID seems to be kosher. >>>> >>>> data<-read.csv("MoGene-transcript- noheader.csv",header=T,stringsAsFactors=F,sep=",") >>>> sdata<- data[,c(1,9)] >>>> >>>> returnRef=function(x){ >>>> refst<- strsplit(x,split="///")[[1]][grep("RefSeq",strsplit(x, split="///")[[1]])[1]] >>>> refid<- gsub(" ","",strsplit(refst,split="//")[[1]][1]) >>>> return(refid) >>>> } >>>> >>>> sdata$refseqids<- sapply(sdata[,2],returnRef) >>>> fdata<- sdata[,-2] >>>> write.table(fdata,"AnnotBuild.txt", sep="\t",quote=F,row.names=F,col.names=F) >>>> >>>> library(AnnotationForge) >>>> library(mouse.db0) >>>> library(org.Mm.eg.db) >>>> makeDBPackage("MOUSECHIP_DB", >>>> affy=F, >>>> prefix="mogene20sttranscriptcluster", >>>> fileName="AnnotBuild.txt", >>>> outputDir = ".", >>>> version="2.11.1", >>>> baseMapType="refseq", >>>> manufacturer = "Affymetrix", >>>> chipName = "Mouse Gene 2.0 ST Array", >>>> manufacturerUrl = "http://www.affymetrix.com", >>>> author = "Kamila Naxerova", >>>> maintainer = "Kamila Naxerova<naxerova at="" fas.harvard.edu="">") >>>> >>>>> install.packages("mogene20sttranscriptcluster.db",repos=NULL, type="source") >>>> * installing *source* package ?mogene20sttranscriptcluster.db? ... >>>> ** R >>>> ** inst >>>> ** preparing package for lazy loading >>>> ** help >>>> *** installing help indices >>>> ** building package indices >>>> ** testing if installed package can be loaded >>>> *** arch - i386 >>>> *** arch - x86_64 >>>> >>>> * DONE (mogene20sttranscriptcluster.db) >>>> >>>> _______________________________________________ >>>> Bioconductor mailing list >>>> Bioconductor at r-project.org >>>> https://stat.ethz.ch/mailman/listinfo/bioconductor >>>> Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor >>> _______________________________________________ >>> Bioconductor mailing list >>> Bioconductor at r-project.org >>> https://stat.ethz.ch/mailman/listinfo/bioconductor >>> Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor >> -- >> James W. MacDonald, M.S. >> Biostatistician >> University of Washington >> Environmental and Occupational Health Sciences >> 4225 Roosevelt Way NE, # 100 >> Seattle WA 98105-6099 >> -- James W. MacDonald, M.S. Biostatistician University of Washington Environmental and Occupational Health Sciences 4225 Roosevelt Way NE, # 100 Seattle WA 98105-6099

ADD REPLY • link 11.1 years ago James W. MacDonald 65k

0

Entering edit mode

And I should mention that you need the transFile argument as well, which will be the /Users/naxerova/Documents/xxx/MoGene- 2_0-st-v1.na33.mm10.transcript.csv file that you used to create the mogene20sttranscriptcluster.db file. Best, Jim On 3/7/2013 11:03 AM, James W. MacDonald wrote: > Wow. This is really an education on the vast unwashed underbelly of > BioC, no? > > There is a file called MoGene-2_0-st.mps that came in the zip file you > downloaded. Add > > mps <- list.files(baseDir, pattern = "mps$", full.names = TRUE) > > and then > > coreMps = mps > > when you create your AffyGenePDInfoPkgSeed. This file is used to > distinguish between the probeset and transcript probe mappings. > > Best, > > Jim > > > > On 3/7/2013 10:36 AM, Naxerova, Kamila wrote: >> Thanks Jim. Of course the annotation package does not contain probe >> --> probe set information. What was I thinking?!?? >> >> What I had not realized was that I needed to build the >> pd.mogene.2.0.st package myself first, because it also does not exist >> on Bioconductor. So I just downloaded all the required files from >> Affy, but again I am stuck with an error message I don't >> understand... what is the coreMPS file that gives me the error? >> >>> library(pdInfoBuilder) >>> baseDir<- "/Users/naxerova/Documents/xxx" >>> (pgf<- list.files(baseDir, pattern = ".pgf", >> + full.names = TRUE)) >> [1] "/Users/naxerova/Documents/xxx/MoGene-2_0-st.pgf" >>> (clf<- list.files(baseDir, pattern = ".clf", >> + full.names = TRUE)) >> [1] "/Users/naxerova/Documents/xxx/MoGene-2_0-st.clf" >>> (prob<- list.files(baseDir, pattern = ".probeset.csv", >> + full.names = TRUE)) >> [1] >> "/Users/naxerova/Documents/xxx/MoGene- 2_0-st-v1.na33.mm10.probeset.csv" >>> seed<- new("AffyGenePDInfoPkgSeed", >> + pgfFile = pgf, clfFile = clf, >> + probeFile = prob, author = "Kamila Naxerova", >> + email = "naxerova at fas.harvard.edu", >> + biocViews = "AnnotationData", >> + organism = "Mouse", species = "Mus Musculus") >>> makePdInfoPackage(seed, destDir = ".") >> =================================================================== ====================================================================== ====== >> >> Building annotation package for Affymetrix Gene ST Array >> PGF.........: MoGene-2_0-st.pgf >> CLF.........: MoGene-2_0-st.clf >> Probeset....: MoGene-2_0-st-v1.na33.mm10.probeset.csv >> Transcript..: TheTranscriptFile >> Core MPS....: coreMps >> =================================================================== ====================================================================== ====== >> >> Parsing file: MoGene-2_0-st.pgf... OK >> Parsing file: MoGene-2_0-st.clf... OK >> Creating initial table for probes... OK >> Creating dictionaries... OK >> Parsing file: MoGene-2_0-st-v1.na33.mm10.probeset.csv... OK >> Parsing file: coreMps... Error in file(file, "rt") : cannot open the >> connection >> In addition: Warning message: >> In file(file, "rt") : cannot open file 'coreMps': No such file or >> directory >> >> >> >> >> >> On Mar 7, 2013, at 10:06 AM, "James W. MacDonald"<jmacdon at="" uw.edu=""> >> wrote: >> >>> Hi Kamila, >>> >>> On 3/7/2013 9:54 AM, Naxerova, Kamila wrote: >>>> Dear all, >>>> >>>> I am afraid I have to ask for help with the Mouse Gene 2.0 ST >>>> annotation package one more time. It looked like I created it >>>> successfully, but when I try to use it to read in cel files with >>>> the oligo package, I get a cryptic error message. Any suggestions >>>> would be much appreciated! >>> You don't use the annotation package at this step. There are two >>> packages that are used for the analysis of this chip type. The first is >>> the pd.mogene.2.0.st.v1 package, which is used by oligo to map >>> probes to >>> probesets when doing the normalization/summarization step. This package >>> will be automagically installed if you don't have it, so there is >>> nothing to be done at the first step but >>> >>> abatch<- read.celfiles(list.celfiles()) >>> eset<- rma(abatch) >>> >>> This will give you the summarized and normalized data at the transcript >>> level. You then will normally fit some model(s) using the modeling >>> package of your choice, and then might want to output a set of >>> significant genes, at which time you will use the >>> mogene20sttranscriptcluster.db package to map probeset IDs to gene >>> information. >>> >>> Best, >>> >>> Jim >>> >>> >>>>> abatch<- >>>>> read.celfiles(list.celfiles(),pkgname="mogene20sttranscriptclust er.db") >>>>> >>>> Platform design info loaded. >>>> Reading in : xxx.CEL >>>> Reading in : xxx.CEL >>>> Reading in : xxx.CEL >>>> [... more cel files listed] >>>> >>>> Error in function (classes, fdef, mtable) : >>>> unable to find an inherited method for function ?kind? for >>>> signature ?"ChipDb"? >>>> >>>> Thanks >>>> Kamila >>>> >>>> On Mar 6, 2013, at 6:16 PM, "Naxerova, >>>> Kamila"<naxerova at="" fas.harvard.edu=""> wrote: >>>> >>>>> Dear Christian and Jim, >>>>> >>>>> many thanks to both of you for your explanations. >>>>> >>>>> Your hard work paid off, and I have finally understood everything >>>>> and managed to build my annotation package!!!! I wrote a little >>>>> script similar to what Jim was suggesting, namely picking the >>>>> first RefSeq-like thing I came across. Jim called it "naive" -- >>>>> but I think there is no downside to this approach, right? I have >>>>> looked at various examples in the Affy file for a long time, and >>>>> simply picking the first Refseq ID seems to be kosher. >>>>> >>>>> data<-read.csv("MoGene-transcript- noheader.csv",header=T,stringsAsFactors=F,sep=",") >>>>> >>>>> sdata<- data[,c(1,9)] >>>>> >>>>> returnRef=function(x){ >>>>> refst<- >>>>> strsplit(x,split="///")[[1]][grep("RefSeq",strsplit(x,split="/// ")[[1]])[1]] >>>>> refid<- gsub(" ","",strsplit(refst,split="//")[[1]][1]) >>>>> return(refid) >>>>> } >>>>> >>>>> sdata$refseqids<- sapply(sdata[,2],returnRef) >>>>> fdata<- sdata[,-2] >>>>> write.table(fdata,"AnnotBuild.txt", >>>>> sep="\t",quote=F,row.names=F,col.names=F) >>>>> >>>>> library(AnnotationForge) >>>>> library(mouse.db0) >>>>> library(org.Mm.eg.db) >>>>> makeDBPackage("MOUSECHIP_DB", >>>>> affy=F, >>>>> prefix="mogene20sttranscriptcluster", >>>>> fileName="AnnotBuild.txt", >>>>> outputDir = ".", >>>>> version="2.11.1", >>>>> baseMapType="refseq", >>>>> manufacturer = "Affymetrix", >>>>> chipName = "Mouse Gene 2.0 ST Array", >>>>> manufacturerUrl = "http://www.affymetrix.com", >>>>> author = "Kamila Naxerova", >>>>> maintainer = "Kamila Naxerova<naxerova at="" fas.harvard.edu="">") >>>>> >>>>>> install.packages("mogene20sttranscriptcluster.db",repos=NULL, >>>>>> type="source") >>>>> * installing *source* package ?mogene20sttranscriptcluster.db? ... >>>>> ** R >>>>> ** inst >>>>> ** preparing package for lazy loading >>>>> ** help >>>>> *** installing help indices >>>>> ** building package indices >>>>> ** testing if installed package can be loaded >>>>> *** arch - i386 >>>>> *** arch - x86_64 >>>>> >>>>> * DONE (mogene20sttranscriptcluster.db) >>>>> >>>>> _______________________________________________ >>>>> Bioconductor mailing list >>>>> Bioconductor at r-project.org >>>>> https://stat.ethz.ch/mailman/listinfo/bioconductor >>>>> Search the archives: >>>>> http://news.gmane.org/gmane.science.biology.informatics.conductor >>>> _______________________________________________ >>>> Bioconductor mailing list >>>> Bioconductor at r-project.org >>>> https://stat.ethz.ch/mailman/listinfo/bioconductor >>>> Search the archives: >>>> http://news.gmane.org/gmane.science.biology.informatics.conductor >>> -- >>> James W. MacDonald, M.S. >>> Biostatistician >>> University of Washington >>> Environmental and Occupational Health Sciences >>> 4225 Roosevelt Way NE, # 100 >>> Seattle WA 98105-6099 >>> > -- James W. MacDonald, M.S. Biostatistician University of Washington Environmental and Occupational Health Sciences 4225 Roosevelt Way NE, # 100 Seattle WA 98105-6099

ADD REPLY • link 11.1 years ago James W. MacDonald 65k

0

Entering edit mode

Haha, I don't mind rubbing against that belly. It's like playing a game... tyring to figure out how to get to the 40th level to face the "final enemy". Your strategy worked, thank you! I am including all code up to RMA normalization (yay! I am there!) below, perhaps it will save somebody a few hours of work. library(pdInfoBuilder) baseDir <- "/Users/naxerova/Documents/xxx" (pgf <- list.files(baseDir, pattern = ".pgf", full.names = TRUE)) (clf <- list.files(baseDir, pattern = ".clf", full.names = TRUE)) (prob <- list.files(baseDir, pattern = ".probeset.csv", full.names = TRUE)) mps <- list.files(baseDir, pattern = "mps$", full.names = TRUE) trans <- list.files(baseDir, pattern="transcript",full.names=TRUE) seed <- new("AffyGenePDInfoPkgSeed", pgfFile = pgf, clfFile = clf, coreMps=mps, transFile=trans, probeFile = prob, author = "Kamila Naxerova", email = "naxerova at fas.harvard.edu", biocViews = "AnnotationData", organism = "Mouse", species = "Mus Musculus") makePdInfoPackage(seed, destDir = ".") ## This is what the beginning of your output should look like Building annotation package for Affymetrix Gene ST Array PGF.........: MoGene-2_0-st.pgf CLF.........: MoGene-2_0-st.clf Probeset....: MoGene-2_0-st-v1.na33.mm10.probeset.csv Transcript..: MoGene-2_0-st-v1.na33.mm10.transcript.csv Core MPS....: MoGene-2_0-st.mps install.packages("/Users/naxerova/pd.mogene.2.0.st/", repos=NULL, type="source") > abatch <- read.celfiles(list.celfiles()) Loading required package: pd.mogene.2.0.st Platform design info loaded. Reading in : xxx.CEL [etc.] > eset <- rma(abatch) Background correcting Normalizing Calculating Expression On Mar 7, 2013, at 11:03 AM, James W. MacDonald <jmacdon at="" uw.edu=""> wrote: > Wow. This is really an education on the vast unwashed underbelly of > BioC, no? > > There is a file called MoGene-2_0-st.mps that came in the zip file you > downloaded. Add > > mps <- list.files(baseDir, pattern = "mps$", full.names = TRUE) > > and then > > coreMps = mps > > when you create your AffyGenePDInfoPkgSeed. This file is used to > distinguish between the probeset and transcript probe mappings. > > Best, > > Jim > > > > On 3/7/2013 10:36 AM, Naxerova, Kamila wrote: >> Thanks Jim. Of course the annotation package does not contain probe --> probe set information. What was I thinking?!?? >> >> What I had not realized was that I needed to build the pd.mogene.2.0.st package myself first, because it also does not exist on Bioconductor. So I just downloaded all the required files from Affy, but again I am stuck with an error message I don't understand... what is the coreMPS file that gives me the error? >> >>> library(pdInfoBuilder) >>> baseDir<- "/Users/naxerova/Documents/xxx" >>> (pgf<- list.files(baseDir, pattern = ".pgf", >> + full.names = TRUE)) >> [1] "/Users/naxerova/Documents/xxx/MoGene-2_0-st.pgf" >>> (clf<- list.files(baseDir, pattern = ".clf", >> + full.names = TRUE)) >> [1] "/Users/naxerova/Documents/xxx/MoGene-2_0-st.clf" >>> (prob<- list.files(baseDir, pattern = ".probeset.csv", >> + full.names = TRUE)) >> [1] "/Users/naxerova/Documents/xxx/MoGene- 2_0-st-v1.na33.mm10.probeset.csv" >>> seed<- new("AffyGenePDInfoPkgSeed", >> + pgfFile = pgf, clfFile = clf, >> + probeFile = prob, author = "Kamila Naxerova", >> + email = "naxerova at fas.harvard.edu", >> + biocViews = "AnnotationData", >> + organism = "Mouse", species = "Mus Musculus") >>> makePdInfoPackage(seed, destDir = ".") >> =================================================================== ====================================================================== ====== >> Building annotation package for Affymetrix Gene ST Array >> PGF.........: MoGene-2_0-st.pgf >> CLF.........: MoGene-2_0-st.clf >> Probeset....: MoGene-2_0-st-v1.na33.mm10.probeset.csv >> Transcript..: TheTranscriptFile >> Core MPS....: coreMps >> =================================================================== ====================================================================== ====== >> Parsing file: MoGene-2_0-st.pgf... OK >> Parsing file: MoGene-2_0-st.clf... OK >> Creating initial table for probes... OK >> Creating dictionaries... OK >> Parsing file: MoGene-2_0-st-v1.na33.mm10.probeset.csv... OK >> Parsing file: coreMps... Error in file(file, "rt") : cannot open the connection >> In addition: Warning message: >> In file(file, "rt") : cannot open file 'coreMps': No such file or directory >> >> >> >> >> >> On Mar 7, 2013, at 10:06 AM, "James W. MacDonald"<jmacdon at="" uw.edu=""> wrote: >> >>> Hi Kamila, >>> >>> On 3/7/2013 9:54 AM, Naxerova, Kamila wrote: >>>> Dear all, >>>> >>>> I am afraid I have to ask for help with the Mouse Gene 2.0 ST annotation package one more time. It looked like I created it successfully, but when I try to use it to read in cel files with the oligo package, I get a cryptic error message. Any suggestions would be much appreciated! >>> You don't use the annotation package at this step. There are two >>> packages that are used for the analysis of this chip type. The first is >>> the pd.mogene.2.0.st.v1 package, which is used by oligo to map probes to >>> probesets when doing the normalization/summarization step. This package >>> will be automagically installed if you don't have it, so there is >>> nothing to be done at the first step but >>> >>> abatch<- read.celfiles(list.celfiles()) >>> eset<- rma(abatch) >>> >>> This will give you the summarized and normalized data at the transcript >>> level. You then will normally fit some model(s) using the modeling >>> package of your choice, and then might want to output a set of >>> significant genes, at which time you will use the >>> mogene20sttranscriptcluster.db package to map probeset IDs to gene >>> information. >>> >>> Best, >>> >>> Jim >>> >>> >>>>> abatch<- read.celfiles(list.celfiles(),pkgname="mogene20sttransc riptcluster.db") >>>> Platform design info loaded. >>>> Reading in : xxx.CEL >>>> Reading in : xxx.CEL >>>> Reading in : xxx.CEL >>>> [... more cel files listed] >>>> >>>> Error in function (classes, fdef, mtable) : >>>> unable to find an inherited method for function ?kind? for signature ?"ChipDb"? >>>> >>>> Thanks >>>> Kamila >>>> >>>> On Mar 6, 2013, at 6:16 PM, "Naxerova, Kamila"<naxerova at="" fas.harvard.edu=""> wrote: >>>> >>>>> Dear Christian and Jim, >>>>> >>>>> many thanks to both of you for your explanations. >>>>> >>>>> Your hard work paid off, and I have finally understood everything and managed to build my annotation package!!!! I wrote a little script similar to what Jim was suggesting, namely picking the first RefSeq-like thing I came across. Jim called it "naive" -- but I think there is no downside to this approach, right? I have looked at various examples in the Affy file for a long time, and simply picking the first Refseq ID seems to be kosher. >>>>> >>>>> data<-read.csv("MoGene-transcript- noheader.csv",header=T,stringsAsFactors=F,sep=",") >>>>> sdata<- data[,c(1,9)] >>>>> >>>>> returnRef=function(x){ >>>>> refst<- strsplit(x,split="///")[[1]][grep("RefSeq",strsplit(x, split="///")[[1]])[1]] >>>>> refid<- gsub(" ","",strsplit(refst,split="//")[[1]][1]) >>>>> return(refid) >>>>> } >>>>> >>>>> sdata$refseqids<- sapply(sdata[,2],returnRef) >>>>> fdata<- sdata[,-2] >>>>> write.table(fdata,"AnnotBuild.txt", sep="\t",quote=F,row.names=F,col.names=F) >>>>> >>>>> library(AnnotationForge) >>>>> library(mouse.db0) >>>>> library(org.Mm.eg.db) >>>>> makeDBPackage("MOUSECHIP_DB", >>>>> affy=F, >>>>> prefix="mogene20sttranscriptcluster", >>>>> fileName="AnnotBuild.txt", >>>>> outputDir = ".", >>>>> version="2.11.1", >>>>> baseMapType="refseq", >>>>> manufacturer = "Affymetrix", >>>>> chipName = "Mouse Gene 2.0 ST Array", >>>>> manufacturerUrl = "http://www.affymetrix.com", >>>>> author = "Kamila Naxerova", >>>>> maintainer = "Kamila Naxerova<naxerova at="" fas.harvard.edu="">") >>>>> >>>>>> install.packages("mogene20sttranscriptcluster.db",repos=NULL, type="source") >>>>> * installing *source* package ?mogene20sttranscriptcluster.db? ... >>>>> ** R >>>>> ** inst >>>>> ** preparing package for lazy loading >>>>> ** help >>>>> *** installing help indices >>>>> ** building package indices >>>>> ** testing if installed package can be loaded >>>>> *** arch - i386 >>>>> *** arch - x86_64 >>>>> >>>>> * DONE (mogene20sttranscriptcluster.db) >>>>> >>>>> _______________________________________________ >>>>> Bioconductor mailing list >>>>> Bioconductor at r-project.org >>>>> https://stat.ethz.ch/mailman/listinfo/bioconductor >>>>> Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor >>>> _______________________________________________ >>>> Bioconductor mailing list >>>> Bioconductor at r-project.org >>>> https://stat.ethz.ch/mailman/listinfo/bioconductor >>>> Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor >>> -- >>> James W. MacDonald, M.S. >>> Biostatistician >>> University of Washington >>> Environmental and Occupational Health Sciences >>> 4225 Roosevelt Way NE, # 100 >>> Seattle WA 98105-6099 >>> > > -- > James W. MacDonald, M.S. > Biostatistician > University of Washington > Environmental and Occupational Health Sciences > 4225 Roosevelt Way NE, # 100 > Seattle WA 98105-6099 >

ADD REPLY • link 11.1 years ago Kamila Naxerova ▴ 100

0

Entering edit mode

I have gene expression data of 100 mice at beginning of a study and have outcome data (binary:yes or no) at several different time points post treatment. I want to build a predictive model of outcome over time using glmnet. I was wondering how to use the outcome data from multiple timepoints as my outcome variable. Since I am interested in optimizing features that can predict outcome over time, I wanted to use glmnet. My question is, is it possible to generate a multiple (across time points) multivariate (several genes) regression to predict outcome using glmnet. Any help would be greatly appreciated. Thanks, Som. [[alternative HTML version deleted]]

ADD REPLY • link 11.1 years ago fire1976 wyoming ▴ 380

0

Entering edit mode

Hi Kamila et. al., FYI: I'll ensure that the pd packages for the 2.0 versions of the chip are available for the next BioC release. benilton 2013/3/7 Naxerova, Kamila <naxerova at="" fas.harvard.edu="">: > Haha, I don't mind rubbing against that belly. It's like playing a game... tyring to figure out how to get to the 40th level to face the "final enemy". > > Your strategy worked, thank you! I am including all code up to RMA normalization (yay! I am there!) below, perhaps it will save somebody a few hours of work. > > library(pdInfoBuilder) > > baseDir <- "/Users/naxerova/Documents/xxx" > (pgf <- list.files(baseDir, pattern = ".pgf", > full.names = TRUE)) > (clf <- list.files(baseDir, pattern = ".clf", > full.names = TRUE)) > (prob <- list.files(baseDir, pattern = ".probeset.csv", > full.names = TRUE)) > mps <- list.files(baseDir, pattern = "mps$", full.names = TRUE) > trans <- list.files(baseDir, pattern="transcript",full.names=TRUE) > > seed <- new("AffyGenePDInfoPkgSeed", > pgfFile = pgf, clfFile = clf, coreMps=mps, transFile=trans, > probeFile = prob, author = "Kamila Naxerova", > email = "naxerova at fas.harvard.edu", > biocViews = "AnnotationData", > organism = "Mouse", species = "Mus Musculus") > makePdInfoPackage(seed, destDir = ".") > > ## This is what the beginning of your output should look like > Building annotation package for Affymetrix Gene ST Array > PGF.........: MoGene-2_0-st.pgf > CLF.........: MoGene-2_0-st.clf > Probeset....: MoGene-2_0-st-v1.na33.mm10.probeset.csv > Transcript..: MoGene-2_0-st-v1.na33.mm10.transcript.csv > Core MPS....: MoGene-2_0-st.mps > > install.packages("/Users/naxerova/pd.mogene.2.0.st/", repos=NULL, type="source") > >> abatch <- read.celfiles(list.celfiles()) > Loading required package: pd.mogene.2.0.st > Platform design info loaded. > Reading in : xxx.CEL > [etc.] >> eset <- rma(abatch) > Background correcting > Normalizing > Calculating Expression > > On Mar 7, 2013, at 11:03 AM, James W. MacDonald <jmacdon at="" uw.edu=""> wrote: > >> Wow. This is really an education on the vast unwashed underbelly of >> BioC, no? >> >> There is a file called MoGene-2_0-st.mps that came in the zip file you >> downloaded. Add >> >> mps <- list.files(baseDir, pattern = "mps$", full.names = TRUE) >> >> and then >> >> coreMps = mps >> >> when you create your AffyGenePDInfoPkgSeed. This file is used to >> distinguish between the probeset and transcript probe mappings. >> >> Best, >> >> Jim >> >> >> >> On 3/7/2013 10:36 AM, Naxerova, Kamila wrote: >>> Thanks Jim. Of course the annotation package does not contain probe --> probe set information. What was I thinking?!?? >>> >>> What I had not realized was that I needed to build the pd.mogene.2.0.st package myself first, because it also does not exist on Bioconductor. So I just downloaded all the required files from Affy, but again I am stuck with an error message I don't understand... what is the coreMPS file that gives me the error? >>> >>>> library(pdInfoBuilder) >>>> baseDir<- "/Users/naxerova/Documents/xxx" >>>> (pgf<- list.files(baseDir, pattern = ".pgf", >>> + full.names = TRUE)) >>> [1] "/Users/naxerova/Documents/xxx/MoGene-2_0-st.pgf" >>>> (clf<- list.files(baseDir, pattern = ".clf", >>> + full.names = TRUE)) >>> [1] "/Users/naxerova/Documents/xxx/MoGene-2_0-st.clf" >>>> (prob<- list.files(baseDir, pattern = ".probeset.csv", >>> + full.names = TRUE)) >>> [1] "/Users/naxerova/Documents/xxx/MoGene- 2_0-st-v1.na33.mm10.probeset.csv" >>>> seed<- new("AffyGenePDInfoPkgSeed", >>> + pgfFile = pgf, clfFile = clf, >>> + probeFile = prob, author = "Kamila Naxerova", >>> + email = "naxerova at fas.harvard.edu", >>> + biocViews = "AnnotationData", >>> + organism = "Mouse", species = "Mus Musculus") >>>> makePdInfoPackage(seed, destDir = ".") >>> ================================================================== ====================================================================== ======= >>> Building annotation package for Affymetrix Gene ST Array >>> PGF.........: MoGene-2_0-st.pgf >>> CLF.........: MoGene-2_0-st.clf >>> Probeset....: MoGene-2_0-st-v1.na33.mm10.probeset.csv >>> Transcript..: TheTranscriptFile >>> Core MPS....: coreMps >>> ================================================================== ====================================================================== ======= >>> Parsing file: MoGene-2_0-st.pgf... OK >>> Parsing file: MoGene-2_0-st.clf... OK >>> Creating initial table for probes... OK >>> Creating dictionaries... OK >>> Parsing file: MoGene-2_0-st-v1.na33.mm10.probeset.csv... OK >>> Parsing file: coreMps... Error in file(file, "rt") : cannot open the connection >>> In addition: Warning message: >>> In file(file, "rt") : cannot open file 'coreMps': No such file or directory >>> >>> >>> >>> >>> >>> On Mar 7, 2013, at 10:06 AM, "James W. MacDonald"<jmacdon at="" uw.edu=""> wrote: >>> >>>> Hi Kamila, >>>> >>>> On 3/7/2013 9:54 AM, Naxerova, Kamila wrote: >>>>> Dear all, >>>>> >>>>> I am afraid I have to ask for help with the Mouse Gene 2.0 ST annotation package one more time. It looked like I created it successfully, but when I try to use it to read in cel files with the oligo package, I get a cryptic error message. Any suggestions would be much appreciated! >>>> You don't use the annotation package at this step. There are two >>>> packages that are used for the analysis of this chip type. The first is >>>> the pd.mogene.2.0.st.v1 package, which is used by oligo to map probes to >>>> probesets when doing the normalization/summarization step. This package >>>> will be automagically installed if you don't have it, so there is >>>> nothing to be done at the first step but >>>> >>>> abatch<- read.celfiles(list.celfiles()) >>>> eset<- rma(abatch) >>>> >>>> This will give you the summarized and normalized data at the transcript >>>> level. You then will normally fit some model(s) using the modeling >>>> package of your choice, and then might want to output a set of >>>> significant genes, at which time you will use the >>>> mogene20sttranscriptcluster.db package to map probeset IDs to gene >>>> information. >>>> >>>> Best, >>>> >>>> Jim >>>> >>>> >>>>>> abatch<- read.celfiles(list.celfiles(),pkgname="mogene20sttrans criptcluster.db") >>>>> Platform design info loaded. >>>>> Reading in : xxx.CEL >>>>> Reading in : xxx.CEL >>>>> Reading in : xxx.CEL >>>>> [... more cel files listed] >>>>> >>>>> Error in function (classes, fdef, mtable) : >>>>> unable to find an inherited method for function ?kind? for signature ?"ChipDb"? >>>>> >>>>> Thanks >>>>> Kamila >>>>> >>>>> On Mar 6, 2013, at 6:16 PM, "Naxerova, Kamila"<naxerova at="" fas.harvard.edu=""> wrote: >>>>> >>>>>> Dear Christian and Jim, >>>>>> >>>>>> many thanks to both of you for your explanations. >>>>>> >>>>>> Your hard work paid off, and I have finally understood everything and managed to build my annotation package!!!! I wrote a little script similar to what Jim was suggesting, namely picking the first RefSeq-like thing I came across. Jim called it "naive" -- but I think there is no downside to this approach, right? I have looked at various examples in the Affy file for a long time, and simply picking the first Refseq ID seems to be kosher. >>>>>> >>>>>> data<-read.csv("MoGene-transcript- noheader.csv",header=T,stringsAsFactors=F,sep=",") >>>>>> sdata<- data[,c(1,9)] >>>>>> >>>>>> returnRef=function(x){ >>>>>> refst<- strsplit(x,split="///")[[1]][grep("RefSeq",strsplit(x ,split="///")[[1]])[1]] >>>>>> refid<- gsub(" ","",strsplit(refst,split="//")[[1]][1]) >>>>>> return(refid) >>>>>> } >>>>>> >>>>>> sdata$refseqids<- sapply(sdata[,2],returnRef) >>>>>> fdata<- sdata[,-2] >>>>>> write.table(fdata,"AnnotBuild.txt", sep="\t",quote=F,row.names=F,col.names=F) >>>>>> >>>>>> library(AnnotationForge) >>>>>> library(mouse.db0) >>>>>> library(org.Mm.eg.db) >>>>>> makeDBPackage("MOUSECHIP_DB", >>>>>> affy=F, >>>>>> prefix="mogene20sttranscriptcluster", >>>>>> fileName="AnnotBuild.txt", >>>>>> outputDir = ".", >>>>>> version="2.11.1", >>>>>> baseMapType="refseq", >>>>>> manufacturer = "Affymetrix", >>>>>> chipName = "Mouse Gene 2.0 ST Array", >>>>>> manufacturerUrl = "http://www.affymetrix.com", >>>>>> author = "Kamila Naxerova", >>>>>> maintainer = "Kamila Naxerova<naxerova at="" fas.harvard.edu="">") >>>>>> >>>>>>> install.packages("mogene20sttranscriptcluster.db",repos=NULL, type="source") >>>>>> * installing *source* package ?mogene20sttranscriptcluster.db? ... >>>>>> ** R >>>>>> ** inst >>>>>> ** preparing package for lazy loading >>>>>> ** help >>>>>> *** installing help indices >>>>>> ** building package indices >>>>>> ** testing if installed package can be loaded >>>>>> *** arch - i386 >>>>>> *** arch - x86_64 >>>>>> >>>>>> * DONE (mogene20sttranscriptcluster.db) >>>>>> >>>>>> _______________________________________________ >>>>>> Bioconductor mailing list >>>>>> Bioconductor at r-project.org >>>>>> https://stat.ethz.ch/mailman/listinfo/bioconductor >>>>>> Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor >>>>> _______________________________________________ >>>>> Bioconductor mailing list >>>>> Bioconductor at r-project.org >>>>> https://stat.ethz.ch/mailman/listinfo/bioconductor >>>>> Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor >>>> -- >>>> James W. MacDonald, M.S. >>>> Biostatistician >>>> University of Washington >>>> Environmental and Occupational Health Sciences >>>> 4225 Roosevelt Way NE, # 100 >>>> Seattle WA 98105-6099 >>>> >> >> -- >> James W. MacDonald, M.S. >> Biostatistician >> University of Washington >> Environmental and Occupational Health Sciences >> 4225 Roosevelt Way NE, # 100 >> Seattle WA 98105-6099 >> > > _______________________________________________ > Bioconductor mailing list > Bioconductor at r-project.org > https://stat.ethz.ch/mailman/listinfo/bioconductor > Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor

ADD REPLY • link 11.1 years ago Benilton Carvalho ★ 4.3k

0

Entering edit mode

Dear Kamila, Here is some history, as far as I remember: Originally (1999) Affymetrix sold the Hu6800 (HuGeneFL) array, which was an ivt array with most probes (oligos) located on the 3'-end. At that time the collection of probes which represented one gene was called a probeset, and there was (and still is) only one annotation file for the ivt arrays. In addition there was one library file, the Hu6800.CDF file. Most ivt arrays still have CDF-files which map the probes to the (x,y) location on the arrays. With the introduction of the HuExon 1.0 ST array as first exon array Affymetrix did change a couple of things: - they replaced one CDF-file with two files, called CLF-file and PGF-file, respectively. - in addition they provide now two annotation files, a transcript-cluster annotation file and a probeset annotation file. A certain gene (transcript-cluster) typically consists of one ore more exons, one exon consists of one ore more probesets, and one probeset consists usually of 2-4 probes (oligos). The probeset annotation file does now list all probesets with their "probeset_id", as well as the "transcript_cluster_id" and the "exon_id" for each probeset. In contrast the transcript annotation file does list only the genes with their "transcript_cluster_id". At some later time Affymetrix introduced a cheaper 'exon' array, the 'Whole Genome' HuGene 1.0 ST array since many labs were mainly interested in the expression of the genes and not of the different exons. This cheaper array typically has only one probe per exon. Originally Affymetrix sold this array as an array to measure gene expression, and there was only one annotation file. Lateron they decided to convert the HuGene array to an exon array, too. Thus now all WT arrays do contain both a probeset annotation file and a transcript annotation file. In principle you could now use WT arrays to measure the expression of single exons, however with the disadvantage that usually there is only one probe per exon. To understand the distinction between probeset and transcript annotation files please look at the annotation files, and especially read the README files which Affymetrix usually provides in the annotation zip- files. I hope this history does help you to understand the difference between these two annotation files. Best regards, Christian On 3/6/13 4:17 PM, Naxerova, Kamila wrote: > Hi Jim, > > thank you for your helpful reply. I have a few follow-up questions. >> >> I should throw in my obligatory cautionary statement about summarizing >> Gene ST data at the probeset (as compared to the transcript) level. If >> you look at the number of probes/probeset, there are a huge number with >> < 4 probes. So hypothetically you can do this, but I wouldn't. > > I am bit confused about transcript clusters and probesets. In the MoGene-2_0-st-v1.na33.mm10.transcript.csv file, each transcript cluster corresponds to exactly one probe set. But from your email it sounds like there are more probesets than transcript clusters -- I assume these are stored in a different file? Unfortunately the structure of the Affymetrix web site is a mystery to me, without your direct link I would have never found the transcript annotation file, so I have no way of browsing and checking out other annotation files to better understand what is going on. > > Why is there a distinction between transcript cluster and probeset in the first place? I understand that it's useful to be able to group probes dynamically (based on our state of knowledge about a locus). If this grouping is defined as the transcript cluster, what is the definition of a probeset? > > Do I assume correctly that if I build my annotation using the MoGene-2_0-st-v1.na33.mm10.transcript.csvfile, I essentially commit to analyzing my data on the transcript level? >> >> library(AnnotationForge) >> library(mouse.db0) >> library(org.Mm.eg.db) >> makeDBPackage("MOUSECHIP_DB", >> affy=TRUE, >> prefix="mogene20sttranscriptcluster", >> fileName="MoGene-2_0-st-v1.na33.mm10.transcript.csv", >> outputDir = ".", >> version="2.11.1", >> manufacturer = "Affymetrix", >> chipName = "Human Gene 2.1 ST Array", >> manufacturerUrl = "http://www.affymetrix.com", >> author = "Kamila Naxerova", >> maintainer = "Kamila Naxerova <naxerova at="" fas.harvard.edu="">") >> >> > > Any thoughts on this error message? > >> makeDBPackage("MOUSECHIP_DB", > + affy=TRUE, > + prefix="mogene20sttranscriptcluster", > + fileName="MoGene-2_0-st-v1.na33.mm10.transcript.csv", > + outputDir = ".", > + version="2.11.1", > + manufacturer = "Affymetrix", > + chipName = "Mouse Gene 2.0 ST Array", > + manufacturerUrl = "http://www.affymetrix.com", > + author = "Kamila Naxerova", > + maintainer = "Kamila Naxerova <naxerova at="" fas.harvard.edu="">") > Error in `[.data.frame`(csvFile, , GenBankIDName) : > undefined columns selected > > >> sessionInfo() > R version 2.15.3 (2013-03-01) > Platform: i386-apple-darwin9.8.0/i386 (32-bit) > > locale: > [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8 > > attached base packages: > [1] stats graphics grDevices utils datasets methods base > > other attached packages: > [1] org.Mm.eg.db_2.8.0 mouse.db0_2.8.0 AnnotationForge_1.0.3 org.Hs.eg.db_2.8.0 RSQLite_0.11.2 DBI_0.2-5 AnnotationDbi_1.20.5 Biobase_2.18.0 > [9] BiocGenerics_0.4.0 BiocInstaller_1.8.3 > > loaded via a namespace (and not attached): > [1] IRanges_1.16.6 parallel_2.15.3 stats4_2.15.3 tools_2.15.3 > > > > Many thanks! > Kamila > _______________________________________________ > Bioconductor mailing list > Bioconductor at r-project.org > https://stat.ethz.ch/mailman/listinfo/bioconductor > Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor >

ADD REPLY • link 11.1 years ago cstrato ★ 3.9k

0

Entering edit mode

Hi James and others, Since I will soon get my first set of Human Gene ST v2.0 arrays, your remark about creating a BioC-compatible annotation library for such new arrays was of particular interest for me. I therefore tried to generate such library based on the code you suggested. However, it doesn't work (see below): the CSV annotation file downloaded from Affymetrix has no column labeled "Representative Public ID", as is expected by the function makeBaseMaps. Note that the NetAffx annotation csv file for e.g. the (old) human HGU133plus2 arrays does indeed contain such column. Moreover, I also noticed that for the Gene ST arrays target sequences have been derived from multiple databases, including both NCBI and ENSEMBL. As a consequence the NetAffx annotation file for the Gene ST arrays contain in the column "gene_assignment" or "mrna_assignment" multiple ID types. Would the library 'annotationForge' be able to handle this at all? Thanks, Guido BTW: in case it is relevant: the NetAffx annotation file for the 3'-IVT array (HGU133plus2) was build according to " #%netaffx- annotation-tabular-format-version=1.0", whereas the file for the HuGene ST v2.0 was build according to " #%netaffx-annotation-tabular- format-version=1.1" > makeDBPackage("HUMANCHIP_DB", + affy=TRUE, + prefix="hugene20sttranscriptcluster", + fileName="HuGene-2_0-st-v1.na33.hg19.probeset.csv", + outputDir = ".", + version="2.11.1", + manufacturer = "Affymetrix", + chipName = "Human Gene 2.0 ST Array", + manufacturerUrl = "http://www.affymetrix.com", author = "N.N.", maintainer = "me at mail.com") Error in `[.data.frame`(csvFile, , GenBankIDName) : undefined columns selected > > sessionInfo() R version 2.15.2 (2012-10-26) Platform: x86_64-unknown-linux-gnu (64-bit) locale: [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C [3] LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8 [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8 [7] LC_PAPER=C LC_NAME=C [9] LC_ADDRESS=C LC_TELEPHONE=C [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C attached base packages: [1] stats graphics grDevices utils datasets methods base other attached packages: [1] human.db0_2.8.0 AnnotationForge_1.0.3 org.Hs.eg.db_2.8.0 [4] RSQLite_0.11.2 DBI_0.2-5 AnnotationDbi_1.20.3 [7] Biobase_2.18.0 BiocGenerics_0.4.0 loaded via a namespace (and not attached): [1] IRanges_1.16.6 parallel_2.15.2 stats4_2.15.2 > -----Original Message----- From: bioconductor-bounces@r-project.org [mailto:bioconductor- bounces@r-project.org] On Behalf Of James W. MacDonald Sent: Tuesday, March 05, 2013 23:23 To: Naxerova, Kamila Cc: bioconductor at r-project.org Subject: Re: [BioC] Analysis of Affymetrix Mouse Gene 2.0 ST arrays Hi Kamila, On 3/5/2013 4:45 PM, Naxerova, Kamila wrote: > Dear all, > > I am analyzing a set of Affymetrix Mouse Gene 2.0 ST arrays. I am quite familiar with 3'-biased chips, but this is my first time looking at data from WT arrays. I have a few general questions -- any advice would be appreciated to speed up my learning process. > > 1) I have already read on this mailing list that the good old affy package does not work well with WT arrays (can anybody point me to any literature on why that is?). So I have installed the oligo and xps packages -- what are the advantages/disadvantages for each? Any opinions on which one is the right "starter kit"? The affy package was never intended to work with these arrays - it was designed specifically for the 3' biased arrays, which had pre-defined probesets, and which didn't share probes between probesets. In addition, the makecdfenv package is designed to work with the old style CDF packages, and Affy has never released a CDF for these new chips that they are willing to support in any meaningful way. There were some changes made to the affy package in order to accommodate the fact that probes could be shared between probesets, and it is possible to use functions in affxparser to re-create conventional CDF packages using the newer pgf and clf files. So hypothetically you could still use the affy package (and hypothetically you could still use an Apple IIe for all your computing needs, but that's crazy, so let's move on). I don't think you will find much difference between oligo and xps, other than the fact that xps requires the additional installation of ROOT. You might play around with both and see which suits you better. I should throw in my obligatory cautionary statement about summarizing Gene ST data at the probeset (as compared to the transcript) level. If you look at the number of probes/probeset, there are a huge number with < 4 probes. So hypothetically you can do this, but I wouldn't. > > 2) I see with some dread that there seems to be no annotation package for the 2.0 array yet. I have never built my own... any quick bullet points on how I would go about doing that for a WT array? No dread should be required. All you need to do is get the transcript- level annotation file from Affy (http://www.affymetrix.com/Auth/analysis/downloads/na33/wtgene/MoGene- 2_0-st-v1.na33.mm10.transcript.csv.zip) and then the AnnotationForge, mouse.db0, and org.Mm.eg.db packages. Then something like library(AnnotationForge) library(mouse.db0) library(org.Mm.eg.db) makeDBPackage("MOUSECHIP_DB", affy=TRUE, prefix="mogene20sttranscriptcluster", fileName="MoGene-2_0-st-v1.na33.mm10.transcript.csv", outputDir = ".", version="2.11.1", manufacturer = "Affymetrix", chipName = "Human Gene 2.1 ST Array", manufacturerUrl = "http://www.affymetrix.com", author = "Kamila Naxerova", maintainer = "Kamila Naxerova <naxerova at="" fas.harvard.edu="">") should do the trick. You can then install directly from within R by install.packages("mogene20sttranscriptcluster.db", repos=NULL, type="source") And see http://bioconductor.org/packages/2.11/bioc/vignettes/AnnotationForge/i nst/doc/SQLForge.pdf > > 3) It seems that RMA is also used for normalization of WT arrays, so that part I am comfortable with. But are there any differences in preprocessing between 3' and WT arrays that I should watch out for? Not really. I don't use xps, so cannot say for certain how you do things with that package, but with oligo it's a simple abatch <- read.celfiles(list.celfiles()) eset <- rma(abatch) To normalize and summarize at the transcript level. Note however that the annotation for the resulting ExpressionSet will be the pd.mogene.2.0.st.v1 package, and if you use annotation(eset) in any further calls to do gene annotation, it won't work out. You need to first do annotation(eset) <- "mogene20sttranscriptcluster.db" One further note: the intronic controls (especially) have an irritating habit of popping up in lists of differentially expressed genes. This is IMO likely due to mRNA that has not been fully processed to excise the introns, but regardless, these probesets tend to have no annotation at all, so are not useful without extra work to figure out what they are supposed to be measuring. My usual MO is to just summarily excise them after e.g., the eBayes() step of an analysis using limma. If you are interested, there is a function in the affycoretools package called getMainProbes() that will do this for you. Best, Jim > > Thanks so much! > Kamila > _______________________________________________ > Bioconductor mailing list > Bioconductor at r-project.org > https://stat.ethz.ch/mailman/listinfo/bioconductor > Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor -- James W. MacDonald, M.S. Biostatistician University of Washington Environmental and Occupational Health Sciences 4225 Roosevelt Way NE, # 100 Seattle WA 98105-6099 _______________________________________________ Bioconductor mailing list Bioconductor at r-project.org https://stat.ethz.ch/mailman/listinfo/bioconductor Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor

ADD REPLY • link 11.1 years ago Guido Hooiveld ★ 3.9k

0

Entering edit mode

Just an additional thought: would it not be most straight-forward and unambiguous to use the info on genome location available in the NetAffx file to create the annotation file? If so, what would be the best approach of doing so? Just using one of the TxDb's for this? I don't have any experience with this. G First relevant line from HuGen ST 2.0 file probeset_id seqname strand start stop 16657437 chr1 + 12200 12224 -----Original Message----- From: bioconductor-bounces@r-project.org [mailto:bioconductor- bounces@r-project.org] On Behalf Of Hooiveld, Guido Sent: Wednesday, March 06, 2013 16:29 To: 'James W. MacDonald'; Naxerova, Kamila Cc: bioconductor at r-project.org Subject: Re: [BioC] Analysis of Affymetrix Mouse Gene 2.0 ST arrays Hi James and others, Since I will soon get my first set of Human Gene ST v2.0 arrays, your remark about creating a BioC-compatible annotation library for such new arrays was of particular interest for me. I therefore tried to generate such library based on the code you suggested. However, it doesn't work (see below): the CSV annotation file downloaded from Affymetrix has no column labeled "Representative Public ID", as is expected by the function makeBaseMaps. Note that the NetAffx annotation csv file for e.g. the (old) human HGU133plus2 arrays does indeed contain such column. Moreover, I also noticed that for the Gene ST arrays target sequences have been derived from multiple databases, including both NCBI and ENSEMBL. As a consequence the NetAffx annotation file for the Gene ST arrays contain in the column "gene_assignment" or "mrna_assignment" multiple ID types. Would the library 'annotationForge' be able to handle this at all? Thanks, Guido BTW: in case it is relevant: the NetAffx annotation file for the 3'-IVT array (HGU133plus2) was build according to " #%netaffx- annotation-tabular-format-version=1.0", whereas the file for the HuGene ST v2.0 was build according to " #%netaffx-annotation-tabular- format-version=1.1" > makeDBPackage("HUMANCHIP_DB", + affy=TRUE, + prefix="hugene20sttranscriptcluster", + fileName="HuGene-2_0-st-v1.na33.hg19.probeset.csv", + outputDir = ".", + version="2.11.1", + manufacturer = "Affymetrix", + chipName = "Human Gene 2.0 ST Array", + manufacturerUrl = "http://www.affymetrix.com", author = "N.N.", + maintainer = "me at mail.com") Error in `[.data.frame`(csvFile, , GenBankIDName) : undefined columns selected > > sessionInfo() R version 2.15.2 (2012-10-26) Platform: x86_64-unknown-linux-gnu (64-bit) locale: [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C [3] LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8 [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8 [7] LC_PAPER=C LC_NAME=C [9] LC_ADDRESS=C LC_TELEPHONE=C [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C attached base packages: [1] stats graphics grDevices utils datasets methods base other attached packages: [1] human.db0_2.8.0 AnnotationForge_1.0.3 org.Hs.eg.db_2.8.0 [4] RSQLite_0.11.2 DBI_0.2-5 AnnotationDbi_1.20.3 [7] Biobase_2.18.0 BiocGenerics_0.4.0 loaded via a namespace (and not attached): [1] IRanges_1.16.6 parallel_2.15.2 stats4_2.15.2 > -----Original Message----- From: bioconductor-bounces@r-project.org [mailto:bioconductor- bounces@r-project.org] On Behalf Of James W. MacDonald Sent: Tuesday, March 05, 2013 23:23 To: Naxerova, Kamila Cc: bioconductor at r-project.org Subject: Re: [BioC] Analysis of Affymetrix Mouse Gene 2.0 ST arrays Hi Kamila, On 3/5/2013 4:45 PM, Naxerova, Kamila wrote: > Dear all, > > I am analyzing a set of Affymetrix Mouse Gene 2.0 ST arrays. I am quite familiar with 3'-biased chips, but this is my first time looking at data from WT arrays. I have a few general questions -- any advice would be appreciated to speed up my learning process. > > 1) I have already read on this mailing list that the good old affy package does not work well with WT arrays (can anybody point me to any literature on why that is?). So I have installed the oligo and xps packages -- what are the advantages/disadvantages for each? Any opinions on which one is the right "starter kit"? The affy package was never intended to work with these arrays - it was designed specifically for the 3' biased arrays, which had pre-defined probesets, and which didn't share probes between probesets. In addition, the makecdfenv package is designed to work with the old style CDF packages, and Affy has never released a CDF for these new chips that they are willing to support in any meaningful way. There were some changes made to the affy package in order to accommodate the fact that probes could be shared between probesets, and it is possible to use functions in affxparser to re-create conventional CDF packages using the newer pgf and clf files. So hypothetically you could still use the affy package (and hypothetically you could still use an Apple IIe for all your computing needs, but that's crazy, so let's move on). I don't think you will find much difference between oligo and xps, other than the fact that xps requires the additional installation of ROOT. You might play around with both and see which suits you better. I should throw in my obligatory cautionary statement about summarizing Gene ST data at the probeset (as compared to the transcript) level. If you look at the number of probes/probeset, there are a huge number with < 4 probes. So hypothetically you can do this, but I wouldn't. > > 2) I see with some dread that there seems to be no annotation package for the 2.0 array yet. I have never built my own... any quick bullet points on how I would go about doing that for a WT array? No dread should be required. All you need to do is get the transcript- level annotation file from Affy (http://www.affymetrix.com/Auth/analysis/downloads/na33/wtgene/MoGene- 2_0-st-v1.na33.mm10.transcript.csv.zip) and then the AnnotationForge, mouse.db0, and org.Mm.eg.db packages. Then something like library(AnnotationForge) library(mouse.db0) library(org.Mm.eg.db) makeDBPackage("MOUSECHIP_DB", affy=TRUE, prefix="mogene20sttranscriptcluster", fileName="MoGene-2_0-st-v1.na33.mm10.transcript.csv", outputDir = ".", version="2.11.1", manufacturer = "Affymetrix", chipName = "Human Gene 2.1 ST Array", manufacturerUrl = "http://www.affymetrix.com", author = "Kamila Naxerova", maintainer = "Kamila Naxerova <naxerova at="" fas.harvard.edu="">") should do the trick. You can then install directly from within R by install.packages("mogene20sttranscriptcluster.db", repos=NULL, type="source") And see http://bioconductor.org/packages/2.11/bioc/vignettes/AnnotationForge/i nst/doc/SQLForge.pdf > > 3) It seems that RMA is also used for normalization of WT arrays, so that part I am comfortable with. But are there any differences in preprocessing between 3' and WT arrays that I should watch out for? Not really. I don't use xps, so cannot say for certain how you do things with that package, but with oligo it's a simple abatch <- read.celfiles(list.celfiles()) eset <- rma(abatch) To normalize and summarize at the transcript level. Note however that the annotation for the resulting ExpressionSet will be the pd.mogene.2.0.st.v1 package, and if you use annotation(eset) in any further calls to do gene annotation, it won't work out. You need to first do annotation(eset) <- "mogene20sttranscriptcluster.db" One further note: the intronic controls (especially) have an irritating habit of popping up in lists of differentially expressed genes. This is IMO likely due to mRNA that has not been fully processed to excise the introns, but regardless, these probesets tend to have no annotation at all, so are not useful without extra work to figure out what they are supposed to be measuring. My usual MO is to just summarily excise them after e.g., the eBayes() step of an analysis using limma. If you are interested, there is a function in the affycoretools package called getMainProbes() that will do this for you. Best, Jim > > Thanks so much! > Kamila > _______________________________________________ > Bioconductor mailing list > Bioconductor at r-project.org > https://stat.ethz.ch/mailman/listinfo/bioconductor > Search the archives: > http://news.gmane.org/gmane.science.biology.informatics.conductor -- James W. MacDonald, M.S. Biostatistician University of Washington Environmental and Occupational Health Sciences 4225 Roosevelt Way NE, # 100 Seattle WA 98105-6099 _______________________________________________ Bioconductor mailing list Bioconductor at r-project.org https://stat.ethz.ch/mailman/listinfo/bioconductor Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor _______________________________________________ Bioconductor mailing list Bioconductor at r-project.org https://stat.ethz.ch/mailman/listinfo/bioconductor Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor

ADD REPLY • link 11.1 years ago Guido Hooiveld ★ 3.9k

0

Entering edit mode

Hi Guido, It wouldn't be difficult. Starting from something like awk -F, '{if($1 !~ /#|[:alpha:]/) print $0}' MoGene-2_0-st-v1.na33.mm10.transcript.csv | cut -d, -f 1,3-6 > tmp.csv and then in R dat <- read.csv("tmp.csv", header = FALSE, stringsAsFactors = FALSE) dat <- dat[dat[,2] != "---",] gr <- GRanges(dat[,2], IRanges(start = dat[,4], end = dat[,5]), strand=dat[,3]) library(Mus.musculus) gr <- GRanges(dat[,2], IRanges(start = dat[,4], end = dat[,5]), strand=dat[,3], probeset = dat[,1]) mcols(gr)$egid <- names(tx)[findOverlaps(gr, tx, select="first")] > head(as.data.frame(mcols(gr))) probeset egid 1 17210850 <na> 2 17210852 <na> 3 17210855 18777 4 17210869 21399 5 17210883 <na> 6 17210887 108664 Best, Jim On 3/6/2013 10:55 AM, Hooiveld, Guido wrote: > Just an additional thought: would it not be most straight-forward and unambiguous to use the info on genome location available in the NetAffx file to create the annotation file? If so, what would be the best approach of doing so? Just using one of the TxDb's for this? I don't have any experience with this. > G > > First relevant line from HuGen ST 2.0 file > probeset_id seqname strand start stop > 16657437 chr1 + 12200 12224 > > > -----Original Message----- > From: bioconductor-bounces at r-project.org [mailto:bioconductor- bounces at r-project.org] On Behalf Of Hooiveld, Guido > Sent: Wednesday, March 06, 2013 16:29 > To: 'James W. MacDonald'; Naxerova, Kamila > Cc: bioconductor at r-project.org > Subject: Re: [BioC] Analysis of Affymetrix Mouse Gene 2.0 ST arrays > > Hi James and others, > Since I will soon get my first set of Human Gene ST v2.0 arrays, your remark about creating a BioC-compatible annotation library for such new arrays was of particular interest for me. I therefore tried to generate such library based on the code you suggested. However, it doesn't work (see below): the CSV annotation file downloaded from Affymetrix has no column labeled "Representative Public ID", as is expected by the function makeBaseMaps. Note that the NetAffx annotation csv file for e.g. the (old) human HGU133plus2 arrays does indeed contain such column. Moreover, I also noticed that for the Gene ST arrays target sequences have been derived from multiple databases, including both NCBI and ENSEMBL. As a consequence the NetAffx annotation file for the Gene ST arrays contain in the column "gene_assignment" or "mrna_assignment" multiple ID types. Would the library 'annotationForge' be able to handle this at all? > > Thanks, > Guido > > BTW: in case it is relevant: the NetAffx annotation file for the 3'-IVT array (HGU133plus2) was build according to " #%netaffx- annotation-tabular-format-version=1.0", whereas the file for the HuGene ST v2.0 was build according to " #%netaffx-annotation-tabular- format-version=1.1" > > >> makeDBPackage("HUMANCHIP_DB", > + affy=TRUE, > + prefix="hugene20sttranscriptcluster", > + fileName="HuGene-2_0-st-v1.na33.hg19.probeset.csv", > + outputDir = ".", > + version="2.11.1", > + manufacturer = "Affymetrix", > + chipName = "Human Gene 2.0 ST Array", > + manufacturerUrl = "http://www.affymetrix.com", author = "N.N.", > + maintainer = "me at mail.com") > Error in `[.data.frame`(csvFile, , GenBankIDName) : > undefined columns selected >> sessionInfo() > R version 2.15.2 (2012-10-26) > Platform: x86_64-unknown-linux-gnu (64-bit) > > locale: > [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C > [3] LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8 > [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8 > [7] LC_PAPER=C LC_NAME=C > [9] LC_ADDRESS=C LC_TELEPHONE=C > [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C > > attached base packages: > [1] stats graphics grDevices utils datasets methods base > > other attached packages: > [1] human.db0_2.8.0 AnnotationForge_1.0.3 org.Hs.eg.db_2.8.0 > [4] RSQLite_0.11.2 DBI_0.2-5 AnnotationDbi_1.20.3 > [7] Biobase_2.18.0 BiocGenerics_0.4.0 > > loaded via a namespace (and not attached): > [1] IRanges_1.16.6 parallel_2.15.2 stats4_2.15.2 > > -----Original Message----- > From: bioconductor-bounces at r-project.org [mailto:bioconductor- bounces at r-project.org] On Behalf Of James W. MacDonald > Sent: Tuesday, March 05, 2013 23:23 > To: Naxerova, Kamila > Cc: bioconductor at r-project.org > Subject: Re: [BioC] Analysis of Affymetrix Mouse Gene 2.0 ST arrays > > Hi Kamila, > > On 3/5/2013 4:45 PM, Naxerova, Kamila wrote: >> Dear all, >> >> I am analyzing a set of Affymetrix Mouse Gene 2.0 ST arrays. I am quite familiar with 3'-biased chips, but this is my first time looking at data from WT arrays. I have a few general questions -- any advice would be appreciated to speed up my learning process. >> >> 1) I have already read on this mailing list that the good old affy package does not work well with WT arrays (can anybody point me to any literature on why that is?). So I have installed the oligo and xps packages -- what are the advantages/disadvantages for each? Any opinions on which one is the right "starter kit"? > The affy package was never intended to work with these arrays - it was designed specifically for the 3' biased arrays, which had pre- defined probesets, and which didn't share probes between probesets. In addition, the makecdfenv package is designed to work with the old style CDF packages, and Affy has never released a CDF for these new chips that they are willing to support in any meaningful way. > > There were some changes made to the affy package in order to accommodate the fact that probes could be shared between probesets, and it is possible to use functions in affxparser to re-create conventional CDF packages using the newer pgf and clf files. So hypothetically you could still use the affy package (and hypothetically you could still use an Apple IIe for all your computing needs, but that's crazy, so let's move on). > > I don't think you will find much difference between oligo and xps, other than the fact that xps requires the additional installation of ROOT. You might play around with both and see which suits you better. > > I should throw in my obligatory cautionary statement about summarizing Gene ST data at the probeset (as compared to the transcript) level. If you look at the number of probes/probeset, there are a huge number with< 4 probes. So hypothetically you can do this, but I wouldn't. > >> 2) I see with some dread that there seems to be no annotation package for the 2.0 array yet. I have never built my own... any quick bullet points on how I would go about doing that for a WT array? > No dread should be required. All you need to do is get the transcript-level annotation file from Affy > (http://www.affymetrix.com/Auth/analysis/downloads/na33/wtgene /MoGene-2_0-st-v1.na33.mm10.transcript.csv.zip) > and then the AnnotationForge, mouse.db0, and org.Mm.eg.db packages. Then something like > > library(AnnotationForge) > library(mouse.db0) > library(org.Mm.eg.db) > makeDBPackage("MOUSECHIP_DB", > affy=TRUE, > prefix="mogene20sttranscriptcluster", > fileName="MoGene-2_0-st-v1.na33.mm10.transcript.csv", > outputDir = ".", > version="2.11.1", > manufacturer = "Affymetrix", > chipName = "Human Gene 2.1 ST Array", > manufacturerUrl = "http://www.affymetrix.com", author = "Kamila Naxerova", maintainer = "Kamila Naxerova<naxerova at="" fas.harvard.edu="">") > > should do the trick. You can then install directly from within R by > > install.packages("mogene20sttranscriptcluster.db", repos=NULL, > type="source") > > And see > http://bioconductor.org/packages/2.11/bioc/vignettes/AnnotationForge /inst/doc/SQLForge.pdf > > >> 3) It seems that RMA is also used for normalization of WT arrays, so that part I am comfortable with. But are there any differences in preprocessing between 3' and WT arrays that I should watch out for? > Not really. I don't use xps, so cannot say for certain how you do things with that package, but with oligo it's a simple > > abatch<- read.celfiles(list.celfiles()) eset<- rma(abatch) > > To normalize and summarize at the transcript level. Note however that the annotation for the resulting ExpressionSet will be the > pd.mogene.2.0.st.v1 package, and if you use annotation(eset) in any further calls to do gene annotation, it won't work out. You need to first do > > annotation(eset)<- "mogene20sttranscriptcluster.db" > > One further note: the intronic controls (especially) have an irritating habit of popping up in lists of differentially expressed genes. This is IMO likely due to mRNA that has not been fully processed to excise the introns, but regardless, these probesets tend to have no annotation at all, so are not useful without extra work to figure out what they are supposed to be measuring. My usual MO is to just summarily excise them after e.g., the eBayes() step of an analysis using limma. If you are interested, there is a function in the affycoretools package called > getMainProbes() that will do this for you. > > Best, > > Jim > > >> Thanks so much! >> Kamila >> _______________________________________________ >> Bioconductor mailing list >> Bioconductor at r-project.org >> https://stat.ethz.ch/mailman/listinfo/bioconductor >> Search the archives: >> http://news.gmane.org/gmane.science.biology.informatics.conductor > -- > James W. MacDonald, M.S. > Biostatistician > University of Washington > Environmental and Occupational Health Sciences > 4225 Roosevelt Way NE, # 100 > Seattle WA 98105-6099 > > _______________________________________________ > Bioconductor mailing list > Bioconductor at r-project.org > https://stat.ethz.ch/mailman/listinfo/bioconductor > Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor > > _______________________________________________ > Bioconductor mailing list > Bioconductor at r-project.org > https://stat.ethz.ch/mailman/listinfo/bioconductor > Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor > > > > -- James W. MacDonald, M.S. Biostatistician University of Washington Environmental and Occupational Health Sciences 4225 Roosevelt Way NE, # 100 Seattle WA 98105-6099

ADD REPLY • link 11.1 years ago James W. MacDonald 65k

0

Entering edit mode

cstrato ★ 3.9k

@cstrato-908

Last seen 5.5 years ago

Austria

Dear Kamila, The reason why affy does not work well with WT arrays is that these arrays do not have PM/MM pairs in contrast to the ivt arrays for which affy was created. As the author of package xps I can only say that xps does use the original library files and annotation files from Affymetrix. The example script in directory xps/examples/script4schemes.R will show you how to create a scheme file for e.g. MoGene-2_0-ST arrays. However, if you decide to use xps I must mention that there is currently a problem with the Afymetrix annotation files for the 2.0 and 2.1 arrays, see: https://www.stat.math.ethz.ch/pipermail/bioconductor/2012-August/04775 5.html So you need either adapt the mentioned perl script to correct the annotation files, or I can send you the modified perl script for MoGene-2_0. If you prefer I could also send you the finished 'mogene20st.root' file, however it has a size of 35 MB so you would need to tell me how to send it to you. BTW, RMA is fine for both ivt arrays and WT arrays. However, if you want to get present calls P/M/A then for ivt arrays you need to call mas5.call() while for WT arrays you need to call dabg.call(). Best regards, Christian On 3/5/13 10:45 PM, Naxerova, Kamila wrote: > Dear all, > > I am analyzing a set of Affymetrix Mouse Gene 2.0 ST arrays. I am quite familiar with 3'-biased chips, but this is my first time looking at data from WT arrays. I have a few general questions -- any advice would be appreciated to speed up my learning process. > > 1) I have already read on this mailing list that the good old affy package does not work well with WT arrays (can anybody point me to any literature on why that is?). So I have installed the oligo and xps packages -- what are the advantages/disadvantages for each? Any opinions on which one is the right "starter kit"? > > 2) I see with some dread that there seems to be no annotation package for the 2.0 array yet. I have never built my own... any quick bullet points on how I would go about doing that for a WT array? > > 3) It seems that RMA is also used for normalization of WT arrays, so that part I am comfortable with. But are there any differences in preprocessing between 3' and WT arrays that I should watch out for? > > Thanks so much! > Kamila > _______________________________________________ > Bioconductor mailing list > Bioconductor at r-project.org > https://stat.ethz.ch/mailman/listinfo/bioconductor > Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor >

ADD COMMENT • link 11.1 years ago cstrato ★ 3.9k

Login before adding your answer.