Search
Question: Ortholog conversions of non-model organisms not present in Biomart, ensembl or KEGG
0
26 days ago by
laural7100 wrote:

Hi

I am wondering whether i can get some advice. I have read through the postings on this and could not find anything that helped.

I am working on rainbow trout and gotten to the point where i have a list of DEG's (~8,000) which i would like to run through KEGG. I have checked withing the species list at KEGG and rainbow trout (taxid= 8022) is not present so the first thing i need to is find the orthologous genes to this species. Rainbow trout is also not located on Biomart or in the ensembl.

Studies have previoulsy used zebrafish or salmon (sasa) which are present in the kegg database. I have checked through various recommended online programmes to find a concersion tool, but my species is not listed on any of them. The closest i could find to working is bioDBnet, and they only have the concersion species, not my species.

I have managed to build a annotation package for rainbow trout within R, but this does not have kegg information and i'm struggling now. I have converted to various different id's including GID, Entrez ID, uniprot etc from the original REFSEQ id, but nothing has worked. Is there a specific package that works with non-model organisms that actually contains non-model species?

I have submitted some of my sequeces to Ghostkoala as a last ditch effort, but is there a package within R that can do this?

Any help would be very much appreciated.

L

modified 26 days ago by James W. MacDonald48k • written 26 days ago by laural7100
1
26 days ago by
United States
James W. MacDonald48k wrote:

The annotation data in Bioconductor are, as a rule, simply re-packaging of existing data. And this in general does not include many non-model species, because (as you have found) there just isn't much data out there.

I recently did an RNA-Seq analysis using O. mykiss, (using the salmon aligner - ha!), and I found that there really wasn't much difference in the number of reads that align to the S. salar transcriptome as compared to O. mykiss, so we ended up aligning to the more well annotated transcriptome, which allowed us to do GO and KeGG stuff on the back end.

If the number of reads that were aligning to the 'wrong' transcriptome were much different I probably would have done something slightly different, instead aligning to the O. mykiss transcriptome, and then trying to map the transcripts to their S. salar equivalents using BLAST. That brings up some added complexity, because you may have variable numbers of transcripts for a given gene that map across species. Since the alignments to Salmo seemed OK, we just went with the cross-species alignment.

Thanks for that, and for now knowing someone else has had the same issues with this species as me even though its ironically a model species!

It was an option but i managed to get to the point of GO enrichment using clusterProfiler and building a rainbow trout annotation package via AnnotationForge, so i'd like to continue if possible using the rainbow trout genome even though its been a struggle. I've managed to get some results via GhostKoala but wondered with your aligment ot the S.salar transcriptome, what was your annotation data like for KEGG? Mine is within the region of 22% but i don't have anything to ground this data with?

KeGG has pretty much all of them:

> library(KEGGREST)
> zz <- keggList("sasa")
> length(zz)
[1] 55214
## read in salmon alignments using tximport and compare
> sum(row.names(counts$counts) %in% gsub("sasa:", "", names(zz))/nrow(counts$counts)
[1] 0.9980598