Error in Data.frame for mogene10stv1cdf and mogene10sttranscriptcluster.db
1
0
Entering edit mode
Joel Ma ▴ 50
@joel-ma-6558
Last seen 9.6 years ago
Hi I was trying to make a data.frame of geneIDs, symbols and names for my eset, but the cdf and .db did not match up. > library(mogene10sttranscriptcluster.db) > library(mogene10stv1cdf) > geneIDs <- ls(mogene10stv1cdf) > geneSymbols <- unlist(as.list(mogene10sttranscriptclusterSYMBOL)) > geneNames <- unlist(as.list(mogene10sttranscriptclusterGENENAME)) > genelist <- data.frame(GeneID=geneIDs,GeneSymbol=geneSymbols,GeneNam e=geneNames,"",sep="") Error in data.frame(GeneID = geneIDs, GeneSymbol = geneSymbols, GeneName = geneNames, : arguments imply differing number of rows: 34760, 35556, 1 > summary(geneIDs) Length Class Mode 34760 character character > summary(geneNames) Length Class Mode 35556 character character What should I do? Cheers Joel [[alternative HTML version deleted]]
cdf cdf • 1.3k views
ADD COMMENT
0
Entering edit mode
@james-w-macdonald-5106
Last seen 34 minutes ago
United States
Hi Joel, On 6/28/2014 7:06 AM, Joel Ma wrote: > Hi > > I was trying to make a data.frame of geneIDs, symbols and names for my eset, but the cdf and .db did not match up. > >> library(mogene10sttranscriptcluster.db) >> library(mogene10stv1cdf) >> geneIDs <- ls(mogene10stv1cdf) >> geneSymbols <- unlist(as.list(mogene10sttranscriptclusterSYMBOL)) >> geneNames <- unlist(as.list(mogene10sttranscriptclusterGENENAME)) >> genelist <- data.frame(GeneID=geneIDs,GeneSymbol=geneSymbols,GeneNa me=geneNames,"",sep="") > > Error in data.frame(GeneID = geneIDs, GeneSymbol = geneSymbols, GeneName = geneNames, : > arguments imply differing number of rows: 34760, 35556, 1 > >> summary(geneIDs) > Length Class Mode > 34760 character character >> summary(geneNames) > Length Class Mode > 35556 character character > > What should I do? First, please note that there is no reason to expect that all of the probesets for this (or any) array will actually measure a known gene, with a Gene ID, symbol, and gene name. There are many different controls on this array (which, being controls, should not be something that you would want to annotate, even if you could). Nor should you expect that all of the things being measured by this array will have a symbol or a gene name. Affy has put all manner of non-translated content on the Gene ST arrays, which you can find out by looking at their marketing materials. Plus, you are using very old methods of accessing data from the annotation package, which I would not recommend using. The methods for accessing data were updated for a good reason. In addition, I wouldn't personally use the affy/makecdfenv pipeline for analyzing these data. The oligo and xps packages are much better suited. I would do something like this: library(oligo) eset <- rma(read.celfiles(list.celfiles())) <shameless plug=""> library(affycoretools) eset <- getMainProbes(eset) </shameless> annot <- select(hugene10sttranscriptcluster.db, featureNames(eset), c("ENTREZID","SYMBOL","GENENAME")) At this point you may get a message informing you that there have been some one-to-many mappings (I haven't analyzed hugene10st arrays for a while, so don't remember if this is true or not). This means that some of the probesets may measure transcript abundance for things that are annotated with more than one Gene ID, symbol or gene name. This would have been masked the way you did things before (you would get an NA for all three instead). How you deal with this duplication is up to you. I tend to be naive about it and just take the first of the dups: annot <- annot[!duplicated(annot),] Others have argued in the past that this is suboptimal, and that you should instead randomly choose the annotation to use. Alternatively you could do something like annotlst <- split(annot[,-1], annot[,1]) annot <- sapply(annot, function(x) if(nrow(x) == 1) return(x) else return(apply(x, 2, paste, collapse = " | ")) Which will cause duplicated IDs or symbols to be concatenated with a pipe symbol. The above is untested; sometimes sapply will return a transposed matrix, so you might have to fiddle with the code to get what you want. Best, Jim > > Cheers > > Joel > > [[alternative HTML version deleted]] > > _______________________________________________ > Bioconductor mailing list > Bioconductor at r-project.org > https://stat.ethz.ch/mailman/listinfo/bioconductor > Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor > -- James W. MacDonald, M.S. Biostatistician University of Washington Environmental and Occupational Health Sciences 4225 Roosevelt Way NE, # 100 Seattle WA 98105-6099
ADD COMMENT

Login before adding your answer.

Traffic: 875 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6