creating GSEA files using biomart
1
0
Entering edit mode
Juliet Hannah ▴ 360
@juliet-hannah-4531
Last seen 4.9 years ago
United States
All, I am trying to create the GSEA chip file. This example uses Affy data, and the chip file is already available. I'm doing this as an exercise in preparation for other platforms. The chip file should look like: Probe Set ID Gene Symbol Gene Title 244901_at ORF25 hypothetical protein 244902_at NAD4L NADH dehydrogenase subunit 4L 244912_at CCB382 cytochrome c biogenesis orf382 244919_at CCB203 cytochrome c biogenesis orf203 244925_at NAD7 NADH dehydrogenase subunit 7 How can I obtain the third column from biomart. I tried searching the attributes, but couldn't find the right name. Is it a matter of trial and error to find the correct attribute, or are there systematic ways to find it. Here is what I have so far: library("biomaRt") probeSets <- c("219666_at", "220547_s_at", "218034_at") ensembl = useMart("ensembl") ensembl = useDataset("hsapiens_gene_ensembl",mart=ensembl) idens <- getBM(attributes = c("affy_hg_u133a","hgnc_symbol"), filters = "affy_hg_u133a",values = probeSets, mart = ensembl) Also, does anyone have any suggestions regarding how to handle the duplicates (seen in this example) with respect to GSEA. Thanks, Juliet Hannah
affy biomaRt affy biomaRt • 1.5k views
ADD COMMENT
0
Entering edit mode
@steffen-durinck-4465
Last seen 9.6 years ago
Hi Juliet, The third attribute you're looking for is 'description': idens <- getBM(attributes = c("affy_hg_u133a","hgnc_symbol","description"), filters ="affy_hg_u133a",values = probeSets, mart = ensembl) Gives: affy_hg_u133a hgnc_symbol description 1 219666_at MS4A6A membrane-spanning 4-domains, subfamily A, member 6A [Source:HGNC Symbol;Acc:13375] 2 220547_s_at FAM35B family with sequence similarity 35, member B [Source:HGNC Symbol;Acc:31425] 3 218034_at FIS1 fission 1 (mitochondrial outer membrane) homolog (S. cerevisiae) [Source:HGNC Symbol;Acc:21689] 4 220547_s_at FAM35B2 family with sequence similarity 35, member B2 (pseudogene) [Source:HGNC Symbol;Acc:34038] 5 220547_s_at FAM35A family with sequence similarity 35, member A [Source:HGNC Symbol;Acc:28773] There is no systematic way to figure out with attribute name you need to use all you have is the attribute name and a description of the attribute. The more you get used to looking at those, the easier it gets to figure out which one you need and once you know the attributes you need, often you'll be using a similar set of attributes most of the time It is interesting to see in your example that one probeset maps to three different but closely related genes. In the past I thought Ensembl would remove such unambiguous mappers. I think the best to do in this case is to remove all probes that map to multiple genes as there is no way to tell which gene you'll be measuring. I'll report this example to the Ensembll team as they used to do this for us. Cheers, Steffen On Thu, Sep 13, 2012 at 8:29 AM, Juliet Hannah <juliet.hannah@gmail.com>wrote: > All, > > I am trying to create the GSEA chip file. This example uses Affy data, > and the chip file is already available. I'm > doing this as an exercise in preparation for other platforms. > > The chip file should look like: > > > Probe Set ID Gene Symbol Gene Title > 244901_at ORF25 hypothetical protein > 244902_at NAD4L NADH dehydrogenase subunit 4L > 244912_at CCB382 cytochrome c biogenesis orf382 > 244919_at CCB203 cytochrome c biogenesis orf203 > 244925_at NAD7 NADH dehydrogenase subunit 7 > > How can I obtain the third column from biomart. I tried searching the > attributes, but couldn't find the right name. Is it a matter of trial > and error to find the correct attribute, or > are there systematic ways to find it. Here is what I have so far: > > library("biomaRt") > probeSets <- c("219666_at", "220547_s_at", "218034_at") > > ensembl = useMart("ensembl") > ensembl = useDataset("hsapiens_gene_ensembl",mart=ensembl) > > idens <- getBM(attributes = c("affy_hg_u133a","hgnc_symbol"), filters > = "affy_hg_u133a",values = probeSets, mart = ensembl) > > > Also, does anyone have any suggestions regarding how to handle the > duplicates (seen in this example) with respect to GSEA. > > Thanks, > > Juliet Hannah > > _______________________________________________ > Bioconductor mailing list > Bioconductor@r-project.org > https://stat.ethz.ch/mailman/listinfo/bioconductor > Search the archives: > http://news.gmane.org/gmane.science.biology.informatics.conductor > [[alternative HTML version deleted]]
ADD COMMENT
0
Entering edit mode
Thanks Steffen for the helpful answers. "description", how embarrassing! On Thu, Sep 13, 2012 at 11:42 AM, Steffen Durinck <durinck.steffen at="" gene.com=""> wrote: > Hi Juliet, > > The third attribute you're looking for is 'description': > > idens <- getBM(attributes = c("affy_hg_u133a","hgnc_symbol","description"), > filters ="affy_hg_u133a",values = probeSets, mart = ensembl) > > Gives: > > affy_hg_u133a hgnc_symbol > description > 1 219666_at MS4A6A membrane-spanning 4-domains, > subfamily A, member 6A [Source:HGNC Symbol;Acc:13375] > 2 220547_s_at FAM35B family with sequence > similarity 35, member B [Source:HGNC Symbol;Acc:31425] > 3 218034_at FIS1 fission 1 (mitochondrial outer membrane) homolog > (S. cerevisiae) [Source:HGNC Symbol;Acc:21689] > 4 220547_s_at FAM35B2 family with sequence similarity 35, member > B2 (pseudogene) [Source:HGNC Symbol;Acc:34038] > 5 220547_s_at FAM35A family with sequence > similarity 35, member A [Source:HGNC Symbol;Acc:28773] > > > There is no systematic way to figure out with attribute name you need to use > all you have is the attribute name and a description of the attribute. The > more you get used to looking at those, the easier it gets to figure out > which one you need and once you know the attributes you need, often you'll > be using a similar set of attributes most of the time > > > It is interesting to see in your example that one probeset maps to three > different but closely related genes. In the past I thought Ensembl would > remove such unambiguous mappers. I think the best to do in this case is to > remove all probes that map to multiple genes as there is no way to tell > which gene you'll be measuring. I'll report this example to the Ensembll > team as they used to do this for us. > > Cheers, > Steffen > > On Thu, Sep 13, 2012 at 8:29 AM, Juliet Hannah <juliet.hannah at="" gmail.com=""> > wrote: >> >> All, >> >> I am trying to create the GSEA chip file. This example uses Affy data, >> and the chip file is already available. I'm >> doing this as an exercise in preparation for other platforms. >> >> The chip file should look like: >> >> >> Probe Set ID Gene Symbol Gene Title >> 244901_at ORF25 hypothetical protein >> 244902_at NAD4L NADH dehydrogenase subunit 4L >> 244912_at CCB382 cytochrome c biogenesis orf382 >> 244919_at CCB203 cytochrome c biogenesis orf203 >> 244925_at NAD7 NADH dehydrogenase subunit 7 >> >> How can I obtain the third column from biomart. I tried searching the >> attributes, but couldn't find the right name. Is it a matter of trial >> and error to find the correct attribute, or >> are there systematic ways to find it. Here is what I have so far: >> >> library("biomaRt") >> probeSets <- c("219666_at", "220547_s_at", "218034_at") >> >> ensembl = useMart("ensembl") >> ensembl = useDataset("hsapiens_gene_ensembl",mart=ensembl) >> >> idens <- getBM(attributes = c("affy_hg_u133a","hgnc_symbol"), filters >> = "affy_hg_u133a",values = probeSets, mart = ensembl) >> >> >> Also, does anyone have any suggestions regarding how to handle the >> duplicates (seen in this example) with respect to GSEA. >> >> Thanks, >> >> Juliet Hannah >> >> _______________________________________________ >> Bioconductor mailing list >> Bioconductor at r-project.org >> https://stat.ethz.ch/mailman/listinfo/bioconductor >> Search the archives: >> http://news.gmane.org/gmane.science.biology.informatics.conductor > >
ADD REPLY

Login before adding your answer.

Traffic: 704 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6