Question

Using goseq with unsupported organism

0

Entering edit mode

Mehdi • 0

@mehdi-16509

Last seen 8 months ago

United States

Hi

I am trying to build the organism package for sheep. At the first i used the codes that was in vignette help page. and so run makeOrgPackage.

makeOrgPackage(gene_info=fSym, chromosome=fChr, go=fGO,
version="1.0.0",
maintainer="Some One <so@someplace.org>",
author="Some One <so@someplace.org>",
outputDir = "E:/test org mak/ovis out dir",
tax_id="9940",
genus="Ovis",
species="aries",
goTable="go")

the package was built and just installed by source type. but it dose not worked. i compared with other correct package and find out that other packages has much more files (stylesheet files) in html folder and my package just have 4 (00Index.html, org.Oaries.eg_dbconn.html , ...).

Any helpful idea will be appreciated.

Thanks

goseq • 2.1k views

ADD COMMENT • link updated 5.8 years ago by James W. MacDonald 65k • written 5.8 years ago by Mehdi • 0

0

Entering edit mode

Did you prepare the input files yourselves? I.e. fSym, fChr, and fGO? Could you show the content of these files using head()?

How did you conclude that it didn't work? Could you show some code and sessionInfo()?

As a side note: please be informed that through the AnnotationHub a pre-made OrgDb for sheep (based on NCBI Gene IDs) is available:

> library(AnnotationHub)
> ah <- AnnotationHub()
snapshotDate(): 2017-10-27
> query(ah, c("Ovis aries", "OrgDb"))
AnnotationHub with 1 record
# snapshotDate(): 2017-10-27
# names(): AH59076
# $dataprovider: ftp://ftp.ncbi.nlm.nih.gov/gene/DATA/
# $species: Ovis aries
# $rdataclass: OrgDb
# $rdatadateadded: 2017-11-24
# $title: org.Ovis_aries.eg.sqlite
# $description: NCBI gene ID based annotations about Ovis aries
# $taxonomyid: 9940
# $genome: NCBI genomes
# $sourcetype: NCBI/UniProt
# $sourceurl: ftp://ftp.ncbi.nlm.nih.gov/gene/DATA/, ftp://ftp.uniprot.org/pub/databases/uniprot/current_release/knowledgebase/idmapping/id...
# $sourcesize: NA
# $tags: c("NCBI", "Gene", "Annotation")
# retrieve record with 'object[["AH59076"]]'
>
>
> #note that: AH59076 | org.Ovis_aries.eg.sqlite
>
> #fetch the annotation database
> org.Oa.eg.db <- ah[["AH59076"]]
>
> # check
> columns(org.Oa.eg.db) #which annotation data can be retrieved?
 [1] "ACCNUM"      "ALIAS"       "CHR"         "ENSEMBL"     "ENTREZID"    "EVIDENCE"    "EVIDENCEALL" "GENENAME"    "GID"        
[10] "GO"          "GOALL"       "ONTOLOGY"    "ONTOLOGYALL" "PMID"        "REFSEQ"      "SYMBOL"      "UNIGENE"    
>
> keytypes(org.Oa.eg.db) #which identifiers can be queried with?
 [1] "ACCNUM"      "ALIAS"       "ENSEMBL"     "ENTREZID"    "EVIDENCE"    "EVIDENCEALL" "GENENAME"    "GID"         "GO"         
[10] "GOALL"       "ONTOLOGY"    "ONTOLOGYALL" "PMID"        "REFSEQ"      "SYMBOL"      "UNIGENE"    
>
> #example using the first 5 identifiers     
> select(org.Oa.eg.db, head(keys(org.Oa.eg.db)), c("SYMBOL", "GENENAME", "GO"))
'select()' returned 1:many mapping between keys and columns
      GID SYMBOL                                                            GENENAME         GO
1  442992  GLUT4 solute carrier family 2 (facilitated glucose transporter), member 4       <NA>
2  442993 UGT1A9                UDP glucuronosyltransferase 1 family, polypeptide A9 GO:0016021
3  442993 UGT1A9                UDP glucuronosyltransferase 1 family, polypeptide A9 GO:0015020
4  442993 UGT1A9                UDP glucuronosyltransferase 1 family, polypeptide A9 GO:0008152
5  442996   GHRL                                 ghrelin and obestatin prepropeptide GO:0005576
6  442996   GHRL                                 ghrelin and obestatin prepropeptide GO:0016608
7  442997 CXCL10                                     C-X-C motif chemokine ligand 10 GO:0009897
<<SNIP>>
84 442999   AQP1                                    aquaporin 1 (Colton blood group) GO:0035377
85 442999   AQP1                                    aquaporin 1 (Colton blood group) GO:0006833
>

Please note that depending on your Bioconductor version the snapshot date may differ. See e.g. A: AnnotationHub, force update snapshot date?.

ADD REPLY • link 5.8 years ago Guido Hooiveld ★ 3.9k

0

Entering edit mode

thanks Dr for your response.

yes, I've made the files. this is the codes:

## Makes an organism package for sheep data.frames:

finchFile <- system.file("extdata","finch_info.txt", package="AnnotationForge")
finch <- read.table(finchFile,sep="\t")

## prepare some data.frames

fSym <- finch[,c(2,3,9)]
fSym <- fSym[fSym[,2]!="-",]
fSym <- fSym[fSym[,3]!="-",]
colnames(fSym) <- c("GID","SYMBOL","GENENAME")

fChr <- finch[,c(2,7)]
fChr <- fChr[fChr[,2]!="-",]
colnames(fChr) <- c("GID","CHROMOSOME")

finchGOFile <- system.file("extdata","GO_finch.txt",package="AnnotationForge")

fGO <- read.table(finchGOFile,sep="\t")
fGO <- fGO[fGO[,2]!="",]
fGO <- fGO[fGO[,3]!="",]
colnames(fGO) <- c("GID","GO","EVIDENCE")

library("AnnotationForge")

## call the function
makeOrgPackage(gene_info=fSym, chromosome=fChr, go=fGO,
version="1.0.0",
maintainer="Some One <so@someplace.org>",
author="Some One <so@someplace.org>",
outputDir = "E:/test org mak/make orge pak/output",
tax_id="9940",
genus="Ovis",
species="aries",
goTable="go")

## install.packages based on the return value
install.packages("org.Oaries.eg.db", repos=NULL , type = "source")

###############################################################

> head(fSym)
GID SYMBOL GENENAME
1 751582 SNCA synuclein, alpha (non A4 component of amyloid precursor)
2 751583 NCALD neurocalcin delta
3 751584 BDNF brain-derived neurotrophic factor
4 751585 CREB1 cAMP responsive element binding protein 1
5 751586 MTNR1A melatonin receptor 1A
6 751588 MTNR1B melatonin receptor 1B
#######################################################

> head(fChr)
GID CHROMOSOME
1 751582 4
2 751583 2
3 751584 5
4 751585 7
5 751586 4
6 751588 1

################################################

> head(fGO)
GID GO EVIDENCE
1 100190152 GO:0008152 IEA
2 100190152 GO:0016310 IEA
3 100190152 GO:0006222 IEA
4 100190152 GO:0015937 IEA
5 100190152 GO:0000166 IEA
6 100190152 GO:0005524 IEA

##############################################

ADD REPLY • link 5.8 years ago Mehdi • 0

0

Entering edit mode

when i ran goseq, found it not work.

> library("org.Oaries.eg.db", lib.loc="C:/Program Files/R/R-3.5.0/library")
Loading required package: AnnotationDbi
Loading required package: Biobase
Welcome to Bioconductor

Vignettes contain introductory material; view with 'browseVignettes()'.
To cite Bioconductor, see 'citation("Biobase")', and for packages
'citation("pkgname")'.

> GO.hiper = goseq("pwf", "oviAri3", "ensGene",method = "Hypergeometric", test.cats="GO:BP",use_genes_without_cat = TRUE)
Fetching GO annotations...
Error in library(paste(orgstring, "db", sep = "."), character.only = TRUE) :
there is no package called ‘NA.db’

########################

sessionInfo()

R version 3.5.0 (2018-04-23)

Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows 7 x64 (build 7601) Service Pack 1

Matrix products: default

locale:
[1] LC_COLLATE=English_United States.1252 LC_CTYPE=English_United States.1252
[3] LC_MONETARY=English_United States.1252 LC_NUMERIC=C
[5] LC_TIME=English_United States.1252

attached base packages:
[1] parallel stats4 stats graphics grDevices utils datasets methods
[9] base

other attached packages:
[1] org.Oaries.eg.db_1.0.0 AnnotationDbi_1.42.1 Biobase_2.40.0
[4] rtracklayer_1.40.3 GenomicRanges_1.32.4 GenomeInfoDb_1.16.0
[7] IRanges_2.14.10 S4Vectors_0.18.3 BiocGenerics_0.26.0
[10] goseq_1.32.0 geneLenDataBase_1.16.0 BiasedUrn_1.07

loaded via a namespace (and not attached):
[1] Rcpp_0.12.17 compiler_3.5.0
[3] XVector_0.20.0 prettyunits_1.0.2
[5] GenomicFeatures_1.32.0 bitops_1.0-6
[7] tools_3.5.0 zlibbioc_1.26.0
[9] progress_1.2.0 biomaRt_2.36.1
[11] digest_0.6.15 bit_1.1-14
[13] nlme_3.1-137 RSQLite_2.1.1
[15] memoise_1.1.0 lattice_0.20-35
[17] mgcv_1.8-24 pkgconfig_2.0.1
[19] rlang_0.2.1 Matrix_1.2-14
[21] rstudioapi_0.7 DelayedArray_0.6.1
[23] DBI_1.0.0 yaml_2.1.19
[25] GenomeInfoDbData_1.1.0 httr_1.3.1
[27] stringr_1.3.1 hms_0.4.2
[29] Biostrings_2.48.0 bit64_0.9-7
[31] grid_3.5.0 R6_2.2.2

########################

Thanks for informing me about AnnotationHub. i was planed use it, if i had failed,but i didn't know it contained pre-made OrgDb for sheep.

but i want do it by Annotation Forge too for learning.

thanks a lot for your helps

ADD REPLY • link 5.8 years ago Mehdi • 0

score 0 · Answer 1 · 2018-07-18

0

Entering edit mode

James W. MacDonald 65k

@james-w-macdonald-5106

Last seen 15 hours ago

United States

Your error has nothing to do with the OrgDb package you created. The error is

Error in library(paste(orgstring, "db", sep = "."), character.only = TRUE) : 
there is no package called ‘NA.db’

Which you will note says that there is no package called NA.db!

If you look at ?goseq, you will see:

 genome: A string identifying the genome that 'genes' refer to.  For a
          list of supported organisms run 'supportedGenomes'.

And if you run supportedGenomes, you will see that oviAri3 isn't a supported genome. If you read further into the Details section for ?goseq, it says:

    'goseq' obtains length data from UCSC and GO mappings from the
     organim packages (see 'link{getgo}' and 'getlength' for details).
     If your data is in an unsupported format you will need to obtain
     the GO category mapping and supply them to the 'goseq' function
     using the 'gene2cat' arguement.

     To use your own gene to category mapping with 'goseq', use the
     'gene2cat' arguement.  This arguement takes a data.frame, with one
     column containing gene IDs and the other containing the associated
     categories.  As the mapping from gene <-> category is in general
     many to many there will be multiple rows containing the same gene
     identifier.  Alternatively, 'gene2cat' can take a list, where the
     names are the genes and the entries are the GO categories
     associated with the genes.  This is the format produced by the
     'getgo' function and is more space efficient than the data.frame
     representation.

ADD COMMENT • link 5.8 years ago James W. MacDonald 65k

0

Entering edit mode

And I should note that mapIds with multiVals = "list" should produce the second form of gene2cat mentioned in the help for goseq.

ADD REPLY • link 5.8 years ago James W. MacDonald 65k

0

Entering edit mode

Hi Dr MacDonald

thanks for your response

you are right, i have read goseq manual more times and other help files too. and ran it codes more times.

as you say, goseq package has no gene length (geneLenDataBase) an Go annotation for sheep.

###############################

for gene length: i directly use nullp function:

pwf = nullp(gene.vector, "oviAri3", "ensGene", bias.data =NULL, plot.fit = TRUE)

when set bias.data to NULL, nullp attempt to fetch length using getlength.

i used Gene stable ID from ensemble BioMart.

######################################

For GO therms i tried using getgo function:

Gomap <- getgo(gene.vector, "oviAri3", "ensGene", fetch.cats=c("GO:CC","GO:BP","GO:MF")).

Error in library(paste(orgstring, "db", sep = "."), character.only = TRUE) :
there is no package called ‘NA.db’.

i thought it is because of absence of organism package for sheep and tried to make one.

how i take GO therms by getgo? how to set ,"genome" and "id"?

thanks Dr

ADD REPLY • link 5.8 years ago Mehdi • 0

0

Entering edit mode

One of the tricks to becoming proficient with R, or any programming language I suppose, is to be able to read and understand the help pages. And also to read and understand what people tell you. So let me recap. I told you (or rather, the author of goseq told you, via the help page) that you need either a data.frame with two columns, one with the GO term and one with the associated Gene IDs, or a list with each name being a Gene ID and the entries being the GO IDs. Re-read my previous comment to see that this is exactly what the help page says.

I then said you could use mapIDs, with multiVals = "list" to generate the latter format. I didn't tell you exactly how to do that, because part of getting proficient is to be able to take some information and fill in the rest. I hoped that you would do ?mapIDs and use that to figure out exactly what I meant.

You responded by telling me what you have already done, which doesn't work. That isn't useful information though. I already know you are having problems, and part of it is because oviAri3 isn't a supported organism. You need to read and understand what I am telling you, rather than repeating information that I already have!

So here is the explicit answer to your dilemma.

> library(AnnotationHub)

> hub <- AnnotationHub()
snapshotDate(): 2018-04-30
> query(hub, c("Ovis","orgdb"))
AnnotationHub with 4 records
# snapshotDate(): 2018-04-30
# $dataprovider: ftp://ftp.ncbi.nlm.nih.gov/gene/DATA/
# $species: Ovis ammon_aries, Ovis aries, Ovis orientalis_aries, Ovis ovis
# $rdataclass: OrgDb
# additional mcols(): taxonomyid, genome, description,
#   coordinate_1_based, maintainer, rdatadateadded, preparerclass, tags,
#   rdatapath, sourceurl, sourcetype
# retrieve records with, e.g., 'object[["AH61970"]]'

            title                              
  AH61970 | org.Ovis_ammon_aries.eg.sqlite     
  AH61971 | org.Ovis_aries.eg.sqlite           
  AH61972 | org.Ovis_orientalis_aries.eg.sqlite
  AH61973 | org.Ovis_ovis.eg.sqlite            

> oae <- hub[["AH61971"]]
downloading 1 resources
retrieving 1 resource

> gene2cat <- mapIds(oae, keys(oae), "GOALL", "ENTREZID", multiVals = "list")
'select()' returned 1:many mapping between keys and columns
> gene2cat[1:3]
$`442992`
[1] NA

$`442993`
 [1] "GO:0008152" "GO:0008150" "GO:0016021" "GO:0005575" "GO:0016020"
 [6] "GO:0031224" "GO:0044425" "GO:0015020" "GO:0003674" "GO:0003824"
[11] "GO:0008194" "GO:0016740" "GO:0016757" "GO:0016758"

$`442996`
 [1] "GO:0005576" "GO:0005575" "GO:0016608" "GO:0005515" "GO:0003674"
 [6] "GO:0005102" "GO:0005179" "GO:0005488" "GO:0048018" "GO:0030545"
[11] "GO:0098772"

ADD REPLY • link 5.8 years ago James W. MacDonald 65k

1

Entering edit mode

So it's probably a bit trickier than all that. You also need what nullp would produce, which again would normally require a supportedGenome, which we know oviAri3 is not. And nullp would use the genome argument to get the bias.data argument:

Usage:

     nullp(DEgenes, genome, id, bias.data=NULL,plot.fit=TRUE)
     
Arguments:

 DEgenes: A named binary vector where 1 represents DE, 0 not DE and the
          names are gene IDs.

  genome: A string identifying the genome that 'genes' refer to.  For a
          list of supported organisms run 'supportedGenomes'.

      id: A string identifying the gene identifier used by 'genes'.
          For a list of supported gene IDs run 'supportedGeneIDs'.

bias.data: A numeric vector containing the data on which the DE may
          depend.  Usually this is the median transcript length of each
          gene in bp.  If set to 'NULL' 'nullp' will attempt to fetch
          length using 'getlength'.

Since oviAri3 isn't supported, we need the median transcript length, which is a non-trivial task. We can use an EnsDb from AnnotationHub for that:

> query(hub, c("Ovis", "ensdb"))
AnnotationHub with 6 records
# snapshotDate(): 2018-04-30
# $dataprovider: Ensembl
# $species: Ovis Aries
# $rdataclass: EnsDb
# additional mcols(): taxonomyid, genome, description,
#   coordinate_1_based, maintainer, rdatadateadded, preparerclass, tags,
#   rdatapath, sourceurl, sourcetype
# retrieve records with, e.g., 'object[["AH53225"]]'

            title                          
  AH53225 | Ensembl 87 EnsDb for Ovis Aries
  AH53729 | Ensembl 88 EnsDb for Ovis Aries
  AH56695 | Ensembl 89 EnsDb for Ovis Aries
  AH57777 | Ensembl 90 EnsDb for Ovis Aries
  AH60796 | Ensembl 91 EnsDb for Ovis Aries
  AH61001 | Ensembl 92 EnsDb for Ovis Aries
> z <- hub[["AH61001"]]
require("ensembldb")
downloading 1 resources
retrieving 1 resource

> txs <- transcriptsBy(z, "gene")
# the txs object is a GRangesList, containing all the transcripts for each gene. We want the median length. I won't explain this next step - you can consider it homework
> txwidth <- sapply(split(width(unlist(txs)), names(unlist(txs))), median)
> head(txwidth)
ENSOARG00000000001 ENSOARG00000000002 ENSOARG00000000003 ENSOARG00000000004
                68                958                 67               1574
ENSOARG00000000005 ENSOARG00000000006
                75                955

But now our bias.data is based on Ensembl, so we can't use the gene2cat we already made (you don't want to mix'n'match EBI/EMBL and NCBI data).

You can use the txwidth object as the input for bias.data in nullp. But now we need an Ensembl based gene2cat:

> library(biomaRt)
> mart <- useMart("ensembl", "oaries_gene_ensembl")

> gene2cat <- getBM(c("ensembl_gene_id","go_id"), "ensembl_gene_id", names(txwidth), mart)
> head(gene2cat)
     ensembl_gene_id      go_id
1 ENSOARG00000000001           
2 ENSOARG00000000002           
3 ENSOARG00000000003           
4 ENSOARG00000000004           
5 ENSOARG00000000005           
6 ENSOARG00000000006 GO:0016020

And that should now all work.

ADD REPLY • link 5.8 years ago James W. MacDonald 65k

0

Entering edit mode

Hello dear Dr. Macdonald

Thanks a lot for your help. i finished it. you are right, i need to work more carefully. but it was may first R project and first GWAS-GSA project. i began it about one month ago.

#####################

my homework:

get one vector contain length of genes that are in gene.vector for nullp( , ... , bias.data= lendata, ...)

assayed.genes = array(total_gene)
de.genes = array(sig_gene)
gene.vector = as.integer(assayed.genes%in%de.genes)
names(gene.vector) = assayed.genes

GV_DF <- as.data.frame(gene.vector)

> head(GV_DF)
gene.vector
ENSOARG00000000003 1
ENSOARG00000001251 1
ENSOARG00000000240 1
ENSOARG00000000002 1
ENSOARG00000000205 1
ENSOARG00000000681 1

> txwidth_DF <- as.data.frame(txwidth)
> head(txwidth_DF)
txwidth
ENSOARG00000000001 68
ENSOARG00000000002 958
ENSOARG00000000003 67
ENSOARG00000000004 1574
ENSOARG00000000005 75
ENSOARG00000000006 955

convert row to first column for merge

library(data.table)
setDT(GV_DF, keep.rownames = TRUE)[]
setDT(txwidth_DF, keep.rownames = TRUE)[]
lenData <- merge(GV_DF, txwidth_DF, by="rn", sort=FALSE)
lenDatafinal <- lenData$txwidth

pwf = nullp(gene.vector,"oviAri3", "ensGene", bias.data = lenDatafinal )

##############################

i was founded this codes that works too, but NA.db error deviate me.

library(GenomicFeatures)

txdb <- makeTxDbFromBiomart(biomart="ENSEMBL_MART_ENSEMBL", dataset = "oaries_gene_ensembl", transcript_ids = NULL, circ_seqs=DEFAULT_CIRC_SEQS,id_prefix="ensembl_", host="www.ensembl.org", port = 80, taxonomyId = NA, miRBaseBuild = NA)

> txsByGene <- transcriptsBy(txdb,"gene")
> lengthData <- median(width(txsByGene))

thanks again Dr

Regards
Mehdi

ADD REPLY • link 5.8 years ago Mehdi • 0