How to go from affymetrix to Ensembl transcript IDs
2
0
Entering edit mode
@peter-robinson-529
Last seen 10.3 years ago
Hi all, sorry if this is a dumb question, but rtfm has not helped so far. I would like to get the Ensembl transcript IDs that correspond to affymetrix probeset ids using biomaRt. As a test case, I am using the ALL data set from bioconductor. My code: library("biomaRt") library("ALL") data("ALL") ## Note this dataset uses hgu95av2 Affymetrix chip dat <- exprs(ALL) affyids = rownames(dat) ## get mapping data from Ensembl via bioMaRt ensembl <- useMart("ensembl") ensembl = useDataset("hsapiens_gene_ensembl",mart=ensembl) mapping <- getBM(attributes = c("affy_hg_u95av2", "ensembl_transcript_id"), filters = "affy_hg_u95av2", values = affyids, mart = ensembl) Here is where the problem is. The "mapping" seems to be a random collection of transcript IDs. > which(mapping=="32337_at") [1] 8 46 139 155 203 267 320 327 7385 8701 18769 20533 [13] 23728 23969 23972 24241 24242 24243 24244 25236 26157 26204 26218 26231 [25] 26240 26321 26404 > mapping[which(mapping=="32337_at"),] affy_hg_u95av2 ensembl_transcript_id 8 32337_at ENST00000404812 46 32337_at ENST00000393574 139 32337_at ENST00000403842 155 32337_at ENST00000397467 203 32337_at ENST00000407990 267 32337_at ENST00000399007 320 32337_at ENST00000404500 327 32337_at ENST00000399891 7385 32337_at ENST00000396599 8701 32337_at ENST00000403916 18769 32337_at ENST00000334328 20533 32337_at ENST00000377603 23728 32337_at ENST00000401418 23969 32337_at ENST00000046640 23972 32337_at ENST00000381870 24241 32337_at ENST00000326092 24242 32337_at ENST00000319826 24243 32337_at ENST00000272274 24244 32337_at ENST00000311549 25236 32337_at ENST00000404512 26157 32337_at ENST00000404609 26204 32337_at ENST00000402713 26218 32337_at ENST00000401464 26231 32337_at ENST00000407389 26240 32337_at ENST00000406161 26321 32337_at ENST00000402658 26404 32337_at ENST00000401595 At the end of the day, I would like to write the data matrix as a CSV file for further analysis, whereby the affy ID is replaced by an Ensembl ID. Thanks Peter
hgu95av2 affy biomaRt hgu95av2 affy biomaRt • 5.2k views
ADD COMMENT
0
Entering edit mode
@sean-davis-490
Last seen 4 months ago
United States
On Thu, Apr 9, 2009 at 5:40 PM, Peter Robinson <peter.robinson@t-online.de>wrote: > Hi all, > > sorry if this is a dumb question, but rtfm has not helped so far. > > I would like to get the Ensembl transcript IDs that correspond to > affymetrix probeset ids using biomaRt. As a test case, I am using the ALL > data set from bioconductor. My code: > > > library("biomaRt") > library("ALL") > data("ALL") ## Note this dataset uses hgu95av2 Affymetrix chip > > dat <- exprs(ALL) > affyids = rownames(dat) > > > ## get mapping data from Ensembl via bioMaRt > ensembl <- useMart("ensembl") > ensembl = useDataset("hsapiens_gene_ensembl",mart=ensembl) > > mapping <- getBM(attributes = c("affy_hg_u95av2", "ensembl_transcript_id"), > filters = "affy_hg_u95av2", > values = affyids, mart = ensembl) > > > > Here is where the problem is. The "mapping" seems to be a random collection > of transcript IDs. > > > which(mapping=="32337_at") > [1] 8 46 139 155 203 267 320 327 7385 8701 18769 20533 > [13] 23728 23969 23972 24241 24242 24243 24244 25236 26157 26204 26218 > 26231 > [25] 26240 26321 26404 > > mapping[which(mapping=="32337_at"),] > affy_hg_u95av2 ensembl_transcript_id > 8 32337_at ENST00000404812 > 46 32337_at ENST00000393574 > 139 32337_at ENST00000403842 > 155 32337_at ENST00000397467 > 203 32337_at ENST00000407990 > 267 32337_at ENST00000399007 > 320 32337_at ENST00000404500 > 327 32337_at ENST00000399891 > 7385 32337_at ENST00000396599 > 8701 32337_at ENST00000403916 > 18769 32337_at ENST00000334328 > 20533 32337_at ENST00000377603 > 23728 32337_at ENST00000401418 > 23969 32337_at ENST00000046640 > 23972 32337_at ENST00000381870 > 24241 32337_at ENST00000326092 > 24242 32337_at ENST00000319826 > 24243 32337_at ENST00000272274 > 24244 32337_at ENST00000311549 > 25236 32337_at ENST00000404512 > 26157 32337_at ENST00000404609 > 26204 32337_at ENST00000402713 > 26218 32337_at ENST00000401464 > 26231 32337_at ENST00000407389 > 26240 32337_at ENST00000406161 > 26321 32337_at ENST00000402658 > 26404 32337_at ENST00000401595 > > At the end of the day, I would like to write the data matrix as a CSV file > for further analysis, whereby the affy ID is replaced by an Ensembl ID. > Hi, Peter. Ensembl does their own mapping of affy probes and the above is an example of what can happen--a probeset can map to multiple transcripts. In fact, there is not a reason to think that a probeset should, in general, map to only one transcript. All that said, I think you have used biomaRt correctly and are faithfully reproducing the results available from Ensembl. If you want another alternative based more closely on what affy supplies, try the following: library(hgu95av2.db) dat = toTable(hgu95av2ENSEMBL) dat[dat[,1]=="32337_at",] probe_id ensembl_id 5562 32337_at ENSG00000122026 dim(dat) [1] 12316 2 Hope that helps, Sean sessionInfo() R version 2.9.0 Under development (unstable) (2009-02-21 r47969) x86_64-unknown-linux-gnu locale: LC_CTYPE=en_US.UTF-8;LC_NUMERIC=C;LC_TIME=en_US.UTF-8;LC_COLLATE=en_US .UTF-8;LC_MONETARY=C;LC_MESSAGES=en_US.UTF-8;LC_PAPER=en_US.UTF-8;LC_N AME=C;LC_ADDRESS=C;LC_TELEPHONE=C;LC_MEASUREMENT=en_US.UTF-8;LC_IDENTI FICATION=C attached base packages: [1] stats graphics grDevices utils datasets methods base other attached packages: [1] hgu95av2.db_2.2.11 RSQLite_0.7-1 DBI_0.2-4 [4] AnnotationDbi_1.5.23 Biobase_2.3.11 [[alternative HTML version deleted]]
ADD COMMENT
0
Entering edit mode
@steve-lianoglou-2771
Last seen 22 months ago
United States
Hi Peter, On Apr 9, 2009, at 5:40 PM, Peter Robinson wrote: > Hi all, > > sorry if this is a dumb question, but rtfm has not helped so far. > > I would like to get the Ensembl transcript IDs that correspond to > affymetrix probeset ids using biomaRt. As a test case, I am using > the ALL data set from bioconductor. My code: > > > library("biomaRt") > library("ALL") > data("ALL") ## Note this dataset uses hgu95av2 Affymetrix chip > > dat <- exprs(ALL) > affyids = rownames(dat) > > > ## get mapping data from Ensembl via bioMaRt > ensembl <- useMart("ensembl") > ensembl = useDataset("hsapiens_gene_ensembl",mart=ensembl) > > mapping <- getBM(attributes = c("affy_hg_u95av2", > "ensembl_transcript_id"), filters = "affy_hg_u95av2", > values = affyids, mart = ensembl) > > > > Here is where the problem is. The "mapping" seems to be a random > collection of transcript IDs. Your query is right, so ... your results are not random. You can double check by trying the small example in the ?getBM help. Anyway: that probe looks a-weird one. Even affy maps it to several locations. See: https://www.affymetrix.com/analysis/netaffx/fullrecord.affx?pk=HG- U95AV2%3A32337_AT #a_ensembl You will need an Affy NetAffx account to see that. Some relevant stats from that page are that the probe maps to 6 different ensembl IDs. It even aligns to two different places: chr13:26725913-26728689(+) chr10:122104175-122104685(-) You'll probably find this for many probes, so you'll need some policy to deal with that. Hope that helps, -steve -- Steve Lianoglou Graduate Student: Physiology, Biophysics and Systems Biology Weill Medical College of Cornell University http://cbio.mskcc.org/~lianos
ADD COMMENT

Login before adding your answer.

Traffic: 536 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6