Hi,
I have a set of sequences blasted against NCBI RefSeq and I want to retrieve the GO-terms for the gene names, similar to what Blast2GO does. Are there any packages that can do this? I am working on a non-model organism.
Jon
Hi,
I have a set of sequences blasted against NCBI RefSeq and I want to retrieve the GO-terms for the gene names, similar to what Blast2GO does. Are there any packages that can do this? I am working on a non-model organism.
Jon
As Laurent noted, it would help to know more about what you are doing. However, note that AnnotationForge has a function called makeOrgDbFromNCBI() that you can use to make an organism-level package for your species. As an example, I have one I made for a fish, O. mykiss. I won't go into making the package because there is a man page and a vignette. But once you have it made and installed, you can do
refseq <- Rkeys(org.Omykiss.egREFSEQ)[sample(1:500, 10)] ## this to fake up some IDs refseq [1] "NM_001124216" "NM_001124301" "NP_001117661" "NM_001124240" "NP_001117695" [6] "NM_001124241" "NM_001124390" "NP_001117750" "NP_001117832" "NM_001124385" ## this is what you would do out <- select(org.Omykiss.eg.db, refseq, "GO", "REFSEQ") head(out, 25) REFSEQ GO EVIDENCE ONTOLOGY 1 NM_001124216 GO:0001574 IEA BP 2 NM_001124216 GO:0003828 IEA MF 3 NM_001124216 GO:0006491 IEA BP 4 NM_001124216 GO:0007162 IEA BP 5 NM_001124216 GO:0007411 IEA BP 6 NM_001124216 GO:0030173 IEA CC 7 NM_001124216 GO:0033691 IEA MF 8 NM_001124301 GO:0005515 IEA MF 9 NP_001117661 GO:0001525 IEA BP 10 NP_001117661 GO:0001541 IEA BP 11 NP_001117661 GO:0001542 IEA BP 12 NP_001117661 GO:0001666 IEA BP 13 NP_001117661 GO:0001955 IEA BP 14 NP_001117661 GO:0001957 IEA BP 15 NP_001117661 GO:0001968 IEA MF 16 NP_001117661 GO:0004222 IEA MF 17 NP_001117661 GO:0005615 IEA CC 18 NP_001117661 GO:0005886 IEA CC 19 NP_001117661 GO:0006508 IEA BP 20 NP_001117661 GO:0007162 IEA BP 21 NP_001117661 GO:0007565 IEA BP 22 NP_001117661 GO:0007567 IEA BP 23 NP_001117661 GO:0007568 IEA BP 24 NP_001117661 GO:0008270 IEA MF 25 NP_001117661 GO:0009314 IEA BP
Jim
EDIT: Sorry, I initially missied that you mentioned working with non-model organism. This might not be relevant, then.
There are several ways. You could use the appropriate annotation packages or biomaRt
. For details, please have a look at the Using Bioconductor for Annotation workflow and the biomaRt
vignette. Below are 2 short example for human.
> gn <- c("BRCA1", "UCKL1") > library("org.Hs.eg.db") > columns(org.Hs.eg.db) ## possible output [1] "ENTREZID" "PFAM" "IPI" "PROSITE" "ACCNUM" [6] "ALIAS" "CHR" "CHRLOC" "CHRLOCEND" "ENZYME" [11] "MAP" "PATH" "PMID" "REFSEQ" "SYMBOL" [16] "UNIGENE" "ENSEMBL" "ENSEMBLPROT" "ENSEMBLTRANS" "GENENAME" [21] "UNIPROT" "GO" "EVIDENCE" "ONTOLOGY" "GOALL" [26] "EVIDENCEALL" "ONTOLOGYALL" "OMIM" "UCSCKG" > keytypes(org.Hs.eg.db) ## query terms [1] "ENTREZID" "PFAM" "IPI" "PROSITE" "ACCNUM" [6] "ALIAS" "CHR" "CHRLOC" "CHRLOCEND" "ENZYME" [11] "MAP" "PATH" "PMID" "REFSEQ" "SYMBOL" [16] "UNIGENE" "ENSEMBL" "ENSEMBLPROT" "ENSEMBLTRANS" "GENENAME" [21] "UNIPROT" "GO" "EVIDENCE" "ONTOLOGY" "GOALL" [26] "EVIDENCEALL" "ONTOLOGYALL" "OMIM" "UCSCKG" > res <- select(org.Hs.eg.db, keys=gn, columns="GO", keytype="SYMBOL") Warning message: In .generateExtraRows(tab, keys, jointype) : 'select' resulted in 1:many mapping between keys and return rows > head(res) SYMBOL GO EVIDENCE ONTOLOGY 1 BRCA1 GO:0000151 NAS CC 2 BRCA1 GO:0000724 IDA BP 3 BRCA1 GO:0000724 TAS BP 4 BRCA1 GO:0001726 IDA CC 5 BRCA1 GO:0003677 TAS MF 6 BRCA1 GO:0003713 NAS MF > dim(res) [1] 76 4
The GO.db
package will allow you to futher extract GO term descriptions.
> library("biomaRt") > ensembl <- useMart("ensembl") > ensembl <- useDataset("hsapiens_gene_ensembl",mart=ensembl) > ensembl Object of class 'Mart': Using the ensembl BioMart database Using the hsapiens_gene_ensembl dataset > gn <- c("BRCA1", "UCKL1") > (filt <- listFilters(ensembl)[110, 1]) [1] "hgnc_symbol" > (attrs <- listAttributes(ensembl)[28:32, 1]) [1] "go_id" "name_1006" "definition_1006" "go_linkage_type" [5] "namespace_1003" > res <- getBM(attributes = c(filt, attrs), ## I want this info + filters = filt, values = gn[1], ## for these query terms + mart = ensembl) ## from this resource > dim(res) [1] 92 6 > res[1:5, 1:3] hgnc_symbol go_id 1 BRCA1 GO:0071681 2 BRCA1 GO:0006359 3 BRCA1 GO:0071158 4 BRCA1 GO:0045739 5 BRCA1 GO:0031572 name_1006 1 cellular response to indole-3-methanol 2 regulation of transcription from RNA polymerase III promoter 3 positive regulation of cell cycle arrest 4 positive regulation of DNA repair 5 G2 DNA damage checkpoint
Next time, please provide more details in your question, such as species name, some example gene names of interest and what solutions you have explored or consider using.
Hope this helps.
Laurent
Even for non-model organisms the makeOrgPackageFromNCBI() function that Jim mentioned (in the upcoming 3.0 release of Bioconductor) will support making an organism package for a LOT of them. The new version should supports about ~1100 different species. So you might want to try that out and see if there is enough data at NCBI to make a package for you.
Eventually we plan to just host all these as pre-made resources on the AnnotationHub. But that will have to wait for a future (3.1) release.
Marc
Use of this site constitutes acceptance of our User Agreement and Privacy Policy.