GO term enrichment analysis over whole genome, copy number aberration investigation

0

Entering edit mode

Nathan Harmston ▴ 100

@nathan-harmston-2904

Last seen 9.6 years ago

Hi, I currently have a list of HUGO gene ids which relate to genes in areas of gain over a whole chromosome, and would like to perform GO enrichment analysis on them. So I have 2 problems: 1. currently i have been defining my gene universe based on affymetrix arrays, however now I am working over the whole genome. gene_universe = getBM(c("entrezgene"), mart = ensembl) ......however this leaves me with a gene_universe of 20275 gene ids (is this right?) 2. moving from my HUGO identifiers to entrez gene ids? I can do this using biomaRt test = getBM(c("entrezgene"), filters = "hgnc_symbol", values = stGained, mart = ensembl) however, this is not the same length as my number of hugo gene identifiers (in my case 30 are missing). Why is this? Is this just some weird annotation bug that can't be fixed or is it the way I m doing it. Does the bioconductor have the GO information for all genes in the genome and not just those in the annotation files for the affymetrix arrays? Finally.....what are the statistical implications of performing GO enrichment (Im using a conditional test) over a whole genome, would it be better to run the gene set enrichment analysis on each chromosome (I don think so)? I m trying to find evidence that genes relating to certain functions are gained over the whole chromosome (cancer study). I've ran a test one and have found some things which make sense. Many thanks in advance, Nathan

Annotation GO Annotation GO • 1.1k views

ADD COMMENT • link updated 15.6 years ago by Marc Carlson ★ 7.2k • written 15.6 years ago by Nathan Harmston ▴ 100

0

Entering edit mode

Marc Carlson ★ 7.2k

@marc-carlson-2264

Last seen 7.7 years ago

United States

Hi Nathan, Bioconductor does have packages that try to address the annotations for an entire organism based on their entrez gene IDs instead of their affymetrix (or other) IDs. These are called the organism packages. They have are named in a format like this example: "org.Hs.eg.db" which would be the "organism" package for "Homo sapiens" based on "Entrez Gene" IDs. If you think this will help you, please try it out. Marc Nathan Harmston wrote: > Hi, > > I currently have a list of HUGO gene ids which relate to genes in > areas of gain over a whole chromosome, and would like to perform GO > enrichment analysis on them. So I have 2 problems: > > 1. currently i have been defining my gene universe based on affymetrix > arrays, however now I am working over the whole genome. gene_universe > = getBM(c("entrezgene"), mart = ensembl) ......however this leaves me > with a gene_universe of 20275 gene ids (is this right?) > 2. moving from my HUGO identifiers to entrez gene ids? I can do this > using biomaRt > test = getBM(c("entrezgene"), filters = "hgnc_symbol", values = > stGained, mart = ensembl) > > however, this is not the same length as my number of hugo gene > identifiers (in my case 30 are missing). Why is this? Is this just > some weird annotation bug that can't be fixed or is it the way I m > doing it. Does the bioconductor have the GO information for all genes > in the genome and not just those in the annotation files for the > affymetrix arrays? > > Finally.....what are the statistical implications of performing GO > enrichment (Im using a conditional test) over a whole genome, would it > be better to run the gene set enrichment analysis on each chromosome > (I don think so)? I m trying to find evidence that genes relating to > certain functions are gained over the whole chromosome (cancer study). > I've ran a test one and have found some things which make sense. > > Many thanks in advance, > > Nathan > > _______________________________________________ > Bioconductor mailing list > Bioconductor at stat.math.ethz.ch > https://stat.ethz.ch/mailman/listinfo/bioconductor > Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor > >

ADD COMMENT • link 15.6 years ago Marc Carlson ★ 7.2k

0

Entering edit mode

michael watson IAH-C ★ 3.4k

@michael-watson-iah-c-378

Last seen 9.6 years ago

I'd say you have two choices of universe. Your list of IDs are HUGO gene ids, so your universe must be based on these. Choice 1: List of HUGO ids in your genome of choice. Of course, the number of HUGO ids may not match the humber of entrez gene ids, or any other ids, as no two ID systems map perfectly one-to-one. Choice 2: List of HUGO ids in your genome of choice that have at least one GO term. For some genomes, choice one and two will be the same; for others it will be radically different. If you choose the 2nd, then you must ensure that your list of "significant" HIGO ids also only contains IDS with a GO term. For an analysis per chromosome, you'd have to subset both your "significant" list and your universe by chromosome. -----Original Message----- From: bioconductor-bounces@stat.math.ethz.ch [mailto:bioconductor-bounces at stat.math.ethz.ch] On Behalf Of Nathan Harmston Sent: 05 September 2008 11:11 To: bioconductor at stat.math.ethz.ch Subject: Re: [BioC] GO term enrichment analysis over whole genome,copy number aberration investigation Hi, I currently have a list of HUGO gene ids which relate to genes in areas of gain over a whole chromosome, and would like to perform GO enrichment analysis on them. So I have 2 problems: 1. currently i have been defining my gene universe based on affymetrix arrays, however now I am working over the whole genome. gene_universe = getBM(c("entrezgene"), mart = ensembl) ......however this leaves me with a gene_universe of 20275 gene ids (is this right?) 2. moving from my HUGO identifiers to entrez gene ids? I can do this using biomaRt test = getBM(c("entrezgene"), filters = "hgnc_symbol", values = stGained, mart = ensembl) however, this is not the same length as my number of hugo gene identifiers (in my case 30 are missing). Why is this? Is this just some weird annotation bug that can't be fixed or is it the way I m doing it. Does the bioconductor have the GO information for all genes in the genome and not just those in the annotation files for the affymetrix arrays? Finally.....what are the statistical implications of performing GO enrichment (Im using a conditional test) over a whole genome, would it be better to run the gene set enrichment analysis on each chromosome (I don think so)? I m trying to find evidence that genes relating to certain functions are gained over the whole chromosome (cancer study). I've ran a test one and have found some things which make sense. Many thanks in advance, Nathan _______________________________________________ Bioconductor mailing list Bioconductor at stat.math.ethz.ch https://stat.ethz.ch/mailman/listinfo/bioconductor Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor

ADD COMMENT • link 15.6 years ago michael watson IAH-C ★ 3.4k

Login before adding your answer.