Question

Use R/Bioconductor to match BLAST hits to GO terms?

0

Entering edit mode

Jon Bråte ▴ 250

@jon-brate-6263

Last seen 2.6 years ago

Norway

Hi,

I have a set of sequences blasted against NCBI RefSeq and I want to retrieve the GO-terms for the gene names, similar to what Blast2GO does. Are there any packages that can do this? I am working on a non-model organism.

Jon

GO BLAST annotation • 4.8k views

ADD COMMENT • link updated 9.5 years ago by Marc Carlson ★ 7.2k • written 9.6 years ago by Jon Bråte ▴ 250

score 4 · Answer 1 · 2014-10-03

As Laurent noted, it would help to know more about what you are doing. However, note that AnnotationForge has a function called makeOrgDbFromNCBI() that you can use to make an organism-level package for your species. As an example, I have one I made for a fish, O. mykiss. I won't go into making the package because there is a man page and a vignette. But once you have it made and installed, you can do

refseq <- Rkeys(org.Omykiss.egREFSEQ)[sample(1:500, 10)]
## this to fake up some IDs
refseq
 [1] "NM_001124216" "NM_001124301" "NP_001117661" "NM_001124240" "NP_001117695"
 [6] "NM_001124241" "NM_001124390" "NP_001117750" "NP_001117832" "NM_001124385"
## this is what you would do
out <- select(org.Omykiss.eg.db, refseq, "GO", "REFSEQ")
head(out, 25)
          REFSEQ         GO EVIDENCE ONTOLOGY
1   NM_001124216 GO:0001574      IEA       BP
2   NM_001124216 GO:0003828      IEA       MF
3   NM_001124216 GO:0006491      IEA       BP
4   NM_001124216 GO:0007162      IEA       BP
5   NM_001124216 GO:0007411      IEA       BP
6   NM_001124216 GO:0030173      IEA       CC
7   NM_001124216 GO:0033691      IEA       MF
8   NM_001124301 GO:0005515      IEA       MF
9   NP_001117661 GO:0001525      IEA       BP
10  NP_001117661 GO:0001541      IEA       BP
11  NP_001117661 GO:0001542      IEA       BP
12  NP_001117661 GO:0001666      IEA       BP
13  NP_001117661 GO:0001955      IEA       BP
14  NP_001117661 GO:0001957      IEA       BP
15  NP_001117661 GO:0001968      IEA       MF
16  NP_001117661 GO:0004222      IEA       MF
17  NP_001117661 GO:0005615      IEA       CC
18  NP_001117661 GO:0005886      IEA       CC
19  NP_001117661 GO:0006508      IEA       BP
20  NP_001117661 GO:0007162      IEA       BP
21  NP_001117661 GO:0007565      IEA       BP
22  NP_001117661 GO:0007567      IEA       BP
23  NP_001117661 GO:0007568      IEA       BP
24  NP_001117661 GO:0008270      IEA       MF
25  NP_001117661 GO:0009314      IEA       BP

Jim

score 0 · Answer 2 · 2014-10-03

EDIT: Sorry, I initially missied that you mentioned working with non-model organism. This might not be relevant, then.

There are several ways. You could use the appropriate annotation packages or biomaRt. For details, please have a look at the Using Bioconductor for Annotation workflow and the biomaRt vignette. Below are 2 short example for human.

> gn <- c("BRCA1", "UCKL1")
> library("org.Hs.eg.db")
> columns(org.Hs.eg.db) ## possible output
 [1] "ENTREZID"     "PFAM"         "IPI"          "PROSITE"      "ACCNUM"      
 [6] "ALIAS"        "CHR"          "CHRLOC"       "CHRLOCEND"    "ENZYME"      
[11] "MAP"          "PATH"         "PMID"         "REFSEQ"       "SYMBOL"      
[16] "UNIGENE"      "ENSEMBL"      "ENSEMBLPROT"  "ENSEMBLTRANS" "GENENAME"    
[21] "UNIPROT"      "GO"           "EVIDENCE"     "ONTOLOGY"     "GOALL"       
[26] "EVIDENCEALL"  "ONTOLOGYALL"  "OMIM"         "UCSCKG"      
> keytypes(org.Hs.eg.db) ## query terms
[1] "ENTREZID"     "PFAM"         "IPI"          "PROSITE"      "ACCNUM"      
 [6] "ALIAS"        "CHR"          "CHRLOC"       "CHRLOCEND"    "ENZYME"      
[11] "MAP"          "PATH"         "PMID"         "REFSEQ"       "SYMBOL"      
[16] "UNIGENE"      "ENSEMBL"      "ENSEMBLPROT"  "ENSEMBLTRANS" "GENENAME"    
[21] "UNIPROT"      "GO"           "EVIDENCE"     "ONTOLOGY"     "GOALL"       
[26] "EVIDENCEALL"  "ONTOLOGYALL"  "OMIM"         "UCSCKG"      
> res <- select(org.Hs.eg.db, keys=gn, columns="GO", keytype="SYMBOL")
Warning message:
In .generateExtraRows(tab, keys, jointype) :
  'select' resulted in 1:many mapping between keys and return rows
> head(res)
  SYMBOL         GO EVIDENCE ONTOLOGY
1  BRCA1 GO:0000151      NAS       CC
2  BRCA1 GO:0000724      IDA       BP
3  BRCA1 GO:0000724      TAS       BP
4  BRCA1 GO:0001726      IDA       CC
5  BRCA1 GO:0003677      TAS       MF
6  BRCA1 GO:0003713      NAS       MF
> dim(res)
[1] 76  4

The GO.db package will allow you to futher extract GO term descriptions.

> library("biomaRt")
> ensembl <- useMart("ensembl")
> ensembl <- useDataset("hsapiens_gene_ensembl",mart=ensembl)
> ensembl 
Object of class 'Mart':
 Using the ensembl BioMart database
 Using the hsapiens_gene_ensembl dataset
> gn <- c("BRCA1", "UCKL1")
> (filt <- listFilters(ensembl)[110, 1])
[1] "hgnc_symbol"
> (attrs <- listAttributes(ensembl)[28:32, 1])
[1] "go_id"           "name_1006"       "definition_1006" "go_linkage_type"
[5] "namespace_1003" 
> res <- getBM(attributes = c(filt, attrs), ## I want this info
+              filters = filt, values = gn[1], ## for these query terms
+              mart = ensembl) ## from this resource

> dim(res)
[1] 92  6
> res[1:5, 1:3]
  hgnc_symbol      go_id
1       BRCA1 GO:0071681
2       BRCA1 GO:0006359
3       BRCA1 GO:0071158
4       BRCA1 GO:0045739
5       BRCA1 GO:0031572
                                                     name_1006
1                       cellular response to indole-3-methanol
2 regulation of transcription from RNA polymerase III promoter
3                     positive regulation of cell cycle arrest
4                            positive regulation of DNA repair
5                                     G2 DNA damage checkpoint

Next time, please provide more details in your question, such as species name, some example gene names of interest and what solutions you have explored or consider using.

Hope this helps.

Laurent

score 0 · Answer 3 · 2014-10-13

Even for non-model organisms the makeOrgPackageFromNCBI() function that Jim mentioned (in the upcoming 3.0 release of Bioconductor) will support making an organism package for a LOT of them. The new version should supports about ~1100 different species. So you might want to try that out and see if there is enough data at NCBI to make a package for you.

Eventually we plan to just host all these as pre-made resources on the AnnotationHub. But that will have to wait for a future (3.1) release.

Marc