GO term gene ontology enrichment analysis
1
0
Entering edit mode
Afsana • 0
@1841e218
Last seen 8 months ago
United States

Hi I am doing a soybean RNA-seq analysis of 36 sample and did the DeSEq2 for all pairwise comparison. Now I need to do a GO term enrichment analysis. How may I get the GO term for all of my genes using R package "goseq". I haven't found the database for soybean genome in bioconductor. Also online software like DAVID, shinyGo is not recognizing my gene IDS.Could you suggest regarding this? Code should be placed in three backticks as shown below


# include your problematic code here with any corresponding output
# please also include the results of running the following in an R session

sessionInfo( )

GOTerm • 1.4k views
3
Entering edit mode
@james-w-macdonald-5106
Last seen 1 day ago
United States

You can get those data on the AnnotationHub

> library(AnnotationHub)

> hub <- AnnotationHub()
snapshotDate(): 2021-10-20

> query(hub, c("orgdb","glycine"))
AnnotationHub with 3 records
# snapshotDate(): 2021-10-20
# $dataprovider: ftp://ftp.ncbi.nlm.nih.gov/gene/DATA/ #$species: Glycine soja, Glycine max, Glomerella glycines
# \$rdataclass: OrgDb
# additional mcols(): taxonomyid, genome, description,
#   rdatapath, sourceurl, sourcetype
# retrieve records with, e.g., 'object[["AH96011"]]'

title
AH96011 | org.Glycine_soja.eg.sqlite
AH96012 | org.Glycine_max.eg.sqlite
AH97193 | org.Glomerella_glycines.eg.sqlite
> orgdb <- hub[["AH96012"]]                                <----------------------- Assuming you want Glycine max
retrieving 1 resource
|======================================================================| 100%

> orgdb
OrgDb object:
| DBSCHEMAVERSION: 2.1
| DBSCHEMA: NOSCHEMA_DB
| ORGANISM: Glycine max
| SPECIES: Glycine max
| CENTRALID: GID
| Taxonomy ID: 3847
| Db type: OrgDb
| Supporting package: AnnotationDbi

Please see: help('select') for usage information
> columns(orgdb)
[1] "ACCNUM"      "ALIAS"       "CHR"         "ENTREZID"    "EVIDENCE"
[6] "EVIDENCEALL" "GENENAME"    "GID"         "GO"          "GOALL"
[11] "ONTOLOGY"    "ONTOLOGYALL" "PMID"        "REFSEQ"      "SYMBOL"
[16] "UNIGENE"

## You want 'GOALL'


This will work for goseq, although you will be responsible for generating the bias.data for the nullp function (median gene length), as well as the gene2cat argument for goseq, which you can easily get by doing

gene2cat <- select(orgdb, <NCBI Gene IDs go here>, "GOALL", "GID")


To make the bias.data you should probably use makeTxDbFromGFF from the GenomicFeatures package, assuming you have a GFF or GTF file for soybean. Something like

txdb <- makeTxDbFromGFF(<GFF file name goes here>)
tx <- transcriptsBy(txdb)
bias.data <- sapply(width(tx), median)

0
Entering edit mode

Dear @James W. MacDonald,

Thank you so much for your inputs. I have a question for the: gene2cat <- select(orgdb, <NCBI Gene IDs go here>, "GOALL", "GID"), For the NCBI gene IDS, how I can paste those here? Do I need to provide all the ids? Also I have downloaded my genome fasta and annotation file (gff) from the phytozomes, so my gene ids are probably different than NCBI gene ids. Could you please provide the source of NCBI gene ids. Again thank you so much, this input is being very very helpful.

1
Entering edit mode

If the IDs you have look like Glyma.01G000100, then you can use biomaRt instead

I downloaded a file from JGI, called Gmax_275_Wm82.a2.v1.annotation_info.txt, which has as its second column the gene IDs I show above.

> z <- read.table("Gmax_275_Wm82.a2.v1.annotation_info.txt", header = FALSE, sep = "\t", fill = TRUE)
> library(bioMart)
> mart <- useEnsembl("plants_mart", "gmax_eg_gene",host ="https://plants.ensembl.org")

## Note that Ensembl uses an underscore rather than a period in the gene name, so we have to convert. Like this:
[1] "Glyma_01G000100" "Glyma_01G000200" "Glyma_01G000300" "Glyma_01G000400"
[5] "Glyma_01G000500" "Glyma_01G000600"

> getBM(c("ensembl_gene_id","go_id"), "ensembl_gene_id", gsub("\\.", "_", head(z[,2])), mart)
ensembl_gene_id      go_id
1  GLYMA_01G000100
2  GLYMA_01G000200
3  GLYMA_01G000300
4  GLYMA_01G000400 GO:0006355
5  GLYMA_01G000400 GO:0008270
6  GLYMA_01G000400 GO:0005634
7  GLYMA_01G000400 GO:0046872
8  GLYMA_01G000500
9  GLYMA_01G000600 GO:0006355
10 GLYMA_01G000600 GO:0008270
11 GLYMA_01G000600 GO:0005634
12 GLYMA_01G000600 GO:0046872
>


You will probably have to modify the gene IDs back, maybe doing something like

> gene2cat[,1] <- gsub("GLYMA","Glyma", gene2cat[,1])
> gene2cat
ensembl_gene_id      go_id
1  Glyma_01G000100
2  Glyma_01G000200
3  Glyma_01G000300
4  Glyma_01G000400 GO:0006355
5  Glyma_01G000400 GO:0008270
6  Glyma_01G000400 GO:0005634
7  Glyma_01G000400 GO:0046872
8  Glyma_01G000500
9  Glyma_01G000600 GO:0006355
10 Glyma_01G000600 GO:0008270
11 Glyma_01G000600 GO:0005634
12 Glyma_01G000600 GO:0046872

0
Entering edit mode

@James W. MacDonald,

Thank you so much for explaining clearly. I used the GmaxFiskeby_678_v1.1.annotation_info.txt version of assembly so basically the gen IDs are same. I will follow your code. I might need to knock you again if I get stuck in any steps. Again thank you so mcuh, its a great help for me.

0
Entering edit mode

Dear James W. MacDonald,

I am getting a warning message while loading the package "biomart". Could you suggest any alternatives? Warning messages: 1: In .inet_warning(msg) : package ‘bioMart’ is not available for Bioconductor version '3.14'

A version of this package for your version of R might be available elsewhere, see the ideas at https://cran.r-project.org/doc/manuals/r-patched/R-admin.html#Installing-packages 2: In .inet_warning(msg) : Perhaps you meant ‘biomaRt’ ?

0
Entering edit mode

0
Entering edit mode

Dear @ James W. MacDonald, I got ensemble_gene_id exactly like what you given but I am stuck in the gene2cat <- select(orgdb, <What will be in here>, "GOALL", "GID"), I tried to use ensemble_gene_ids there but getting the error like this

gene2cat <- select(orgdb, "ensembl_gene_id", "GOALL", "GID") Error in .testForValidKeys(x, keys, keytype, fks) : None of the keys entered are valid keys for 'GID'. Please use the keys method to see a listing of valid arguments.

Could you suggest which ids will go there? Thank you so much for your help and suggestions.

0
Entering edit mode

Also I am getting a warning messgae : z <- read.table("GmaxFiskeby_678_v1.1.annotation_info.txt", header = FALSE, sep = "\t", fill = TRUE) Warning message: In scan(file = file, what = what, sep = sep, quote = quote, dec = dec, : EOF within quoted string

0
Entering edit mode

Dear @ James W. MacDonald, I will be waiting fro your response. In the following code I am getting just 8 Gene ID, Don't know whats going on here...

getBM(c("ensembl_gene_id","go_id"), "ensembl_gene_id", gsub("a0", "a_0", head(z[,2])), mart) ensembl_gene_id go_id 1 GLYMA_01G000100
2 GLYMA_01G000200
3 GLYMA_01G000300
4 GLYMA_01G000400 GO:0006355 5 GLYMA_01G000400 GO:0008270 6 GLYMA_01G000400 GO:0005634 7 GLYMA_01G000400 GO:0046872 8 GLYMA_01G000500

0
Entering edit mode

Where did you obtain the file GmaxFiskeby_678_v1.1.annotation_info.txt? Could you provide a link?

Note that it is an different file than the file James used in his code. Since you didn't provide any info on this, he assumed you used the file Gmax_275_Wm82.a2.v1.annotation_info.txt from JCI.

0
Entering edit mode

It looks you don't appreciate yet the fundamental differences between the 2 pieces of code James provided.

Your initial question did not contain much detail; the only thing that became clear was that you have RNA-seq data from a experiment with soy beans, and that you would like to perform a GO analysis using the package goseq.

You didn't provide any details on which reference genome your provided, nor any code. Please do so!

To use goseq you need a) a file that maps you gene id with the GO categories it has been annotated to (object James named gene2cat), and b) the length of the transcripts to correct for bias. (object James named bias.data).

Assuming you used the reference genome and corresponding annotation made available by the NCBI (here), in his first reply James showed you how to obtain all relevant information in R/Bioconductor; the object gene2cat can be obtained from the so-called OrgDb annotation resources available at the AnnotationHub, whereas the bias data can be derived from the (to us unknown) GTF file you used.

Information in the NCBI-based OrgDb packages (databases) can be retrieved using the function select from the package AnnotationDbi. The error message above tells you that none of the input ids could be mapped in the OrgDb package. This makes sense, since it later turned out the ids you have are apparently no NCBI gene ids! In other words, querying the OrgDb package didn't work for you since you used another type of identifier than the expected NCBI gene id.

Later you suggested you obtained the relevant information from another source, may be JGI?? In R/Bioconductor annotation info for JGI type of identifiers can be retrieved from the Ensembl plant database using the package biomaRt. Because of the different use of periods, underscores and capitals in the identifiers in the JGI input file and Ensembl database, in his 2nd reply James used some code to take care of that. He finished his post with showing how he made the gene2cat object needed for the analysis in goseq. Note that in this gene2cat object the gene ids are JCI/Ensembl ids, and NOT NCBI gene ids.You will thus need to use this object to continue with goseq, not the one based on NCBI ids (that is made with the function select).

0
Entering edit mode

Hi @Guido Hooiveld, there are 6 different version of Soybean (Glycine max) genome assembly aavailable in the phytozomes. James is used just one of them. I used this assembly and annotation file from the phytozomes: https://data.jgi.doe.gov/refine-download/phytozome?q=Glycine+max&expanded=Phytozome-678#:~:text=GmaxFiskeby_678_v1.0.softmasked.fa.gz, https://data.jgi.doe.gov/refine-download/phytozome?q=Glycine+max&expanded=Phytozome-678#:~:text=GmaxFiskeby_678_v1.1.gene_exons.gff3.gz (annotation file). Also for the annotation text file I used https://data.jgi.doe.gov/refine-download/phytozome?q=Glycine+max&expanded=Phytozome-678#:~:text=GmaxFiskeby_678_v1.1.annotation_info.txt.

Why do you think I didn't appreciated the code provided by the James? I am learning the GO enrichment analysis for the first time and I thought there might be some problem in my coding, as I got error message. I exactly followed whatever James provided to get the GO term unless I just used a different version of gene annotation text file, because I used this reference genome assembly version for alingnment.

0
Entering edit mode

I'll answer all the questions here.

1.) As Guido noted, you don't use both the OrgDb and biomaRt. Since you have IDs that can be used with Ensembl, we switched.

2.) You should try to interpret code that people give you rather than blindly using it. As an advanced user I tend to make complicated function calls, but it's easy enough to pick things apart. Note that something like

function1(arg1, arg2, function2(arg3, arg4))
## is the same thing as
newarg <- function(2(arg3, arg4)
function1(arg1, arg2, newarg)


And that is true here as well:

getBM(c("ensembl_gene_id","go_id"), "ensembl_gene_id", gsub("a0", "a_0", head(z[,2])), mart)


Do you see in that code string why you would only get a few results returned? If not, try ?head

3.) If you ever get an error that mentions a missing EOF (end of file), that means you have a truncated file and it needs to be downloaded again.

4.) If you are posting code, don't just copy/paste. You need to either highlight the code, and click the CODE button above the reply box, or add a triple backtick (top left button on a QWERTY keyboard) before and after your code. Note how readable my code is in comparison to yours.

0
Entering edit mode