You can get those data on the AnnotationHub
> library(AnnotationHub)
> hub <- AnnotationHub()
snapshotDate(): 2021-10-20
> query(hub, c("orgdb","glycine"))
AnnotationHub with 3 records
# snapshotDate(): 2021-10-20
# $dataprovider: ftp://ftp.ncbi.nlm.nih.gov/gene/DATA/
# $species: Glycine soja, Glycine max, Glomerella glycines
# $rdataclass: OrgDb
# additional mcols(): taxonomyid, genome, description,
# coordinate_1_based, maintainer, rdatadateadded, preparerclass, tags,
# rdatapath, sourceurl, sourcetype
# retrieve records with, e.g., 'object[["AH96011"]]'
title
AH96011 | org.Glycine_soja.eg.sqlite
AH96012 | org.Glycine_max.eg.sqlite
AH97193 | org.Glomerella_glycines.eg.sqlite
> orgdb <- hub[["AH96012"]] <----------------------- Assuming you want Glycine max
downloading 1 resources
retrieving 1 resource
|======================================================================| 100%
loading from cache
> orgdb
OrgDb object:
| DBSCHEMAVERSION: 2.1
| DBSCHEMA: NOSCHEMA_DB
| ORGANISM: Glycine max
| SPECIES: Glycine max
| CENTRALID: GID
| Taxonomy ID: 3847
| Db type: OrgDb
| Supporting package: AnnotationDbi
Please see: help('select') for usage information
> columns(orgdb)
[1] "ACCNUM" "ALIAS" "CHR" "ENTREZID" "EVIDENCE"
[6] "EVIDENCEALL" "GENENAME" "GID" "GO" "GOALL"
[11] "ONTOLOGY" "ONTOLOGYALL" "PMID" "REFSEQ" "SYMBOL"
[16] "UNIGENE"
## You want 'GOALL'
This will work for goseq
, although you will be responsible for generating the bias.data for the nullp
function (median gene length), as well as the gene2cat argument for goseq
, which you can easily get by doing
gene2cat <- select(orgdb, <NCBI Gene IDs go here>, "GOALL", "GID")
To make the bias.data you should probably use makeTxDbFromGFF
from the GenomicFeatures
package, assuming you have a GFF or GTF file for soybean. Something like
txdb <- makeTxDbFromGFF(<GFF file name goes here>)
tx <- transcriptsBy(txdb)
bias.data <- sapply(width(tx), median)
Dear @James W. MacDonald,
Thank you so much for your inputs. I have a question for the: gene2cat <- select(orgdb, <NCBI Gene IDs go here>, "GOALL", "GID"), For the NCBI gene IDS, how I can paste those here? Do I need to provide all the ids? Also I have downloaded my genome fasta and annotation file (gff) from the phytozomes, so my gene ids are probably different than NCBI gene ids. Could you please provide the source of NCBI gene ids. Again thank you so much, this input is being very very helpful.
If the IDs you have look like Glyma.01G000100, then you can use
biomaRt
insteadI downloaded a file from JGI, called Gmax_275_Wm82.a2.v1.annotation_info.txt, which has as its second column the gene IDs I show above.
You will probably have to modify the gene IDs back, maybe doing something like
@James W. MacDonald,
Thank you so much for explaining clearly. I used the GmaxFiskeby_678_v1.1.annotation_info.txt version of assembly so basically the gen IDs are same. I will follow your code. I might need to knock you again if I get stuck in any steps. Again thank you so mcuh, its a great help for me.
Dear James W. MacDonald,
I am getting a warning message while loading the package "biomart". Could you suggest any alternatives? Warning messages: 1: In .inet_warning(msg) : package ‘bioMart’ is not available for Bioconductor version '3.14'
A version of this package for your version of R might be available elsewhere, see the ideas at https://cran.r-project.org/doc/manuals/r-patched/R-admin.html#Installing-packages 2: In .inet_warning(msg) : Perhaps you meant ‘biomaRt’ ?
Please disregard the previous reply. The packae is biomaRT, I could loaded it successfully. Thank you
Dear @ James W. MacDonald, I got ensemble_gene_id exactly like what you given but I am stuck in the gene2cat <- select(orgdb, <What will be in here>, "GOALL", "GID"), I tried to use ensemble_gene_ids there but getting the error like this
Could you suggest which ids will go there? Thank you so much for your help and suggestions.
Also I am getting a warning messgae : z <- read.table("GmaxFiskeby_678_v1.1.annotation_info.txt", header = FALSE, sep = "\t", fill = TRUE) Warning message: In scan(file = file, what = what, sep = sep, quote = quote, dec = dec, : EOF within quoted string
Dear @ James W. MacDonald, I will be waiting fro your response. In the following code I am getting just 8 Gene ID, Don't know whats going on here...
Where did you obtain the file
GmaxFiskeby_678_v1.1.annotation_info.txt
? Could you provide a link?Note that it is an different file than the file James used in his code. Since you didn't provide
any
info on this, he assumed you used the fileGmax_275_Wm82.a2.v1.annotation_info.txt
from JCI.It looks you don't appreciate yet the fundamental differences between the 2 pieces of code James provided.
Your initial question did not contain much detail; the only thing that became clear was that you have RNA-seq data from a experiment with soy beans, and that you would like to perform a GO analysis using the package
goseq
.You didn't provide any details on which reference genome your provided, nor any code. Please do so!
To use
goseq
you need a) a file that maps you gene id with the GO categories it has been annotated to (object James namedgene2cat
), and b) the length of the transcripts to correct for bias. (object James namedbias.data
).Assuming you used the reference genome and corresponding annotation made available by the NCBI (here), in his first reply James showed you how to obtain all relevant information in R/Bioconductor; the object
gene2cat
can be obtained from the so-calledOrgDb
annotation resources available at theAnnotationHub
, whereas the bias data can be derived from the (to us unknown) GTF file you used.Information in the NCBI-based
OrgDb
packages (databases) can be retrieved using the functionselect
from the packageAnnotationDbi
. The error message above tells you that none of the input ids could be mapped in theOrgDb
package. This makes sense, since it later turned out the ids you have are apparently no NCBI gene ids! In other words, querying theOrgDb
package didn't work for you since you used another type of identifier than the expected NCBI gene id.Later you suggested you obtained the relevant information from another source, may be JGI?? In R/Bioconductor annotation info for JGI type of identifiers can be retrieved from the Ensembl plant database using the package
biomaRt
. Because of the different use of periods, underscores and capitals in the identifiers in the JGI input file and Ensembl database, in his 2nd reply James used some code to take care of that. He finished his post with showing how he made thegene2cat
object needed for the analysis ingoseq
. Note that in thisgene2cat
object the gene ids are JCI/Ensembl ids, and NOT NCBI gene ids.You will thus need to use this object to continue withgoseq
, not the one based on NCBI ids (that is made with the functionselect
).Hi @Guido Hooiveld, there are 6 different version of Soybean (Glycine max) genome assembly aavailable in the phytozomes. James is used just one of them. I used this assembly and annotation file from the phytozomes: https://data.jgi.doe.gov/refine-download/phytozome?q=Glycine+max&expanded=Phytozome-678#:~:text=GmaxFiskeby_678_v1.0.softmasked.fa.gz, https://data.jgi.doe.gov/refine-download/phytozome?q=Glycine+max&expanded=Phytozome-678#:~:text=GmaxFiskeby_678_v1.1.gene_exons.gff3.gz (annotation file). Also for the annotation text file I used https://data.jgi.doe.gov/refine-download/phytozome?q=Glycine+max&expanded=Phytozome-678#:~:text=GmaxFiskeby_678_v1.1.annotation_info.txt.
Why do you think I didn't appreciated the code provided by the James? I am learning the GO enrichment analysis for the first time and I thought there might be some problem in my coding, as I got error message. I exactly followed whatever James provided to get the GO term unless I just used a different version of gene annotation text file, because I used this reference genome assembly version for alingnment.
I'll answer all the questions here.
1.) As Guido noted, you don't use both the
OrgDb
andbiomaRt
. Since you have IDs that can be used with Ensembl, we switched.2.) You should try to interpret code that people give you rather than blindly using it. As an advanced user I tend to make complicated function calls, but it's easy enough to pick things apart. Note that something like
And that is true here as well:
Do you see in that code string why you would only get a few results returned? If not, try ?head
3.) If you ever get an error that mentions a missing EOF (end of file), that means you have a truncated file and it needs to be downloaded again.
4.) If you are posting code, don't just copy/paste. You need to either highlight the code, and click the CODE button above the reply box, or add a triple backtick (top left button on a QWERTY keyboard) before and after your code. Note how readable my code is in comparison to yours.
Thank you for your suggestions.