GO/KEEG enritchment for non-model bacteria - help for beginner
Hi everyone, I just begin with my RNAseq analysis and I am on the last step
My organism is a new strain of Aminobacter genus, closely related to Aminobacter aminovorans KTC2477 (KEEG entry T04342, aak 83263). In fact is the same specie but, in contrast with the reference specie, my strain won an symbiotic island from other organism (Mesorhizobium sp) and the insertion is located on the chromosome. My strain is the first Aminobacter strain able to perform symbiosis with plants and is able to survive under abiotic stress as well.
I want to know which genes and pathways are involved with the abiotic stress response. In order to, I performed an RNAseq under abiotic stress. I have the complete sequenced genome and I made the annotation using RAST server. The server used the KTC2477 genome as reference, both files were used for the analysis (FA and GTF files) . The Differential expression analysis was made by DEseq2 .
This is how my GTF file looks, beside the ID column (ID=fig|83263.11.peg.1) is the description (name)
##gff-version 3
tig00000001 FIG CDS 190 300 . + 1 ID=fig|83263.11.peg.1;Name=hypothetical protein
tig00000001 FIG CDS 892 1479 . + 1 ID=fig|83263.11.peg.2;Name=hypothetical protein
tig00000001 FIG CDS 1839 1967 . - 0 ID=fig|83263.11.peg.3;Name=hypothetical protein
tig00000001 FIG CDS 1971 2294 . + 0 ID=fig|83263.11.peg.4;Name=hypothetical protein
tig00000001 FIG CDS 2322 2594 . + 0 ID=fig|83263.11.peg.5;Name=Dipeptide transport system permease protein DppB (TC 3.A.1.5.2)
and my Deseq like this
DataFrame with 663 rows and 6 columns
baseMean log2FoldChange lfcSE
<numeric> <numeric> <numeric>
fig|83263.11.peg.3018 1470.74139158347 -4.56799911487077 0.209938589442722
fig|83263.11.peg.6033 499.052105615201 4.93572771540093 0.271157509696713
fig|83263.11.peg.2326 1561.17740754287 -4.09525319727112 0.236243487651701
fig|83263.11.peg.2325 694.205461173177 -3.85382768516696 0.226404901957689
fig|83263.11.peg.6032 304.943042427634 4.89500515555429 0.314517930030089
... ... ... ...
fig|83263.11.peg.1111 715.51725115005 0.523448875593969 0.189880233819811
fig|83263.11.peg.3889 177.084420993171 -0.842781044899269 0.305803225177248
fig|83263.11.peg.580 275.712626883396 0.688122866894606 0.249859595748499
fig|83263.11.peg.240 242.504276377507 0.625038523181459 0.227011499884674
fig|83263.11.peg.3003 287.946224525665 0.63544858587411 0.231341793195383
I would like to use clusterProfilers tool but as you can see the ID column have names like "fig|83263.11.peg.3018" and it's no possible (for me) to perform GO classification by usual way. I don't know how to aboard it. Can anyone suggest me any solution?
Also I was wondering due my organism have in the chromosome an island insertion from other bacteria, it could be a problem for the ID gene annotation?
Each answer will be deeply appreciate!
Thanks
Liz
ps: My organism doesn't have OrgDb, or query OrgDb (AnnotationHub)
Yes, the Strain ATCC2477 is the closest one. My problem is particularly about the ID of my DE file. I mean, they looks like "fig|83263.11.peg.3018" , are they correct for analysis? Can I use it for enrichment?. I've been reading and I should to proceed with the annotation of my IDs first. But I don't know how to do that. Sorry for the questions this is my first analysis.
If you using AnnotationHub for orgdb retrieval then chcek here - https://annotationhub.bioconductor.org/species if you strain is present. Then u can check if the IDs belongs to that strain. If yes then u can go with enrichment using clusterProfiler as u mentioned.
If not, then u need take the sequences for those IDs from ur data and do orthology search. In that case sequences will mapped IDs of that statin, thak those orthologoes mapped IDs for enrichment.
Hope this helps.
I used AnnotationHub, and my organism does not have orgdb
I will try with the orthology search Thanks for you advices!