Question

Need to convert gene IDs for GAGE pathway analysis of honey bee RNA-seq data

0

Entering edit mode

joanito.liberti • 0

@joanitoliberti-9686

Last seen 6.6 years ago

Hi everybody,

I'm trying to perform a pathway analysis with the R Bioconductor packages Gage/Pathview on honey bee RNA-seq data. I think I got to more or less write a script that would work, except it doesn't because my gene expression data uses gene IDs from BeeBase, while the Kegg package of the honey bee uses ncbi gene ids, so I would need to convert the gene names of my differential expression results. Here's the code I wrote so far:

`res_ut_ins.fc=res_ut_ins$log2FoldChange

names(res_ut_ins.fc)=rownames(res_ut_ins)

exp.fc=res_ut_ins.fc

out.suffix="deseq2"


require(gage)

datakegg.gs)

kg.ame=kegg.gsets("ame")

names(kg.ame)

lapply(kg.ame[1:3], head)

lapply(exp.fc[1:3],head)

#Use the signaling pathways and metabolic pathways in the analysis. Extract those pathways:
kegg.gs=kg.ame$kg.sets[kg.ame$sigmet.idx]

#now call gage with your data like:
fc.kegg.p <- gage(exp.fc, gsets = kegg.gs, ref = NULL, samp = NULL, gene.idtype="RefSeq" )

sel <- fc.kegg.p$greater[, "q.val"] < 1 &
  !is.na(fc.kegg.p$greater[, "q.val"])

path.ids <- rownames(fc.kegg.p$greater)[sel]

sel.l <- fc.kegg.p$less[, "q.val"] < 1 &
  !is.na(fc.kegg.p$less[,"q.val"])

path.ids.l <- rownames(fc.kegg.p$less)[sel.l]

path.ids2 <- substr(c(path.ids, path.ids.l), 1, 8)

lapplykegg.gs[1:3], head, 3)

lengthkegg.gs)

head(exp.fc)

str(exp.fc)

head(path.ids)

head(path.ids2)

head(fc.kegg.p$greater)

require(pathview)
#view first 3 pathways as demo
pv.out.list <- sapply(path.ids2, function(pid) pathview(
  gene.data= exp.fc, pathway.id=pid, 
  species="ame", out.suffix=outsuffix, gene.idtype="Kegg"))`

So I think the problem is that my kg.sets look like:

$kg.sets$`ame04512 ECM-receptor interaction`
 [1] "100576742" "406097"    "408393"    "408551"    "408552"    "408826"    "409448"   
 [8] "409474"    "409924"    "410038"    "410775"    "412663"    "412941"    "413739"   
[15] "413782"    "724950"    "726736"    "726871"

While my exp.fc looks like:

GB40001-RA GB40002-RA GB40003-RA GB40004-RA GB40005-RA GB40006-RA 
 0.0000000         NA -0.1594682         NA -0.1337054 -0.1054865

Looking around for related posts I've found this solution (http://seqanswers.com/forums/archive/index.php/t-35472.html) to create a mapping table for conversion:

"You can download the gene_info data file from NCBI ftp site:ftp://ftp.ncbi.nlm.nih.gov/gene/DATA/GENE_INFO/Invertebrates/All_Invertebrates.gene_info.gz under unix/linux shell, do: gunzip All_Invertebrates.gene_info.gz egrep '(^7460)' All_Invertebrates.gene_info >>am.gene_info.txt

Column 2-6 are (Entrez) GeneID, Symbol, LocusTag, Synonyms, dbXrefs. You see GB numbers in Column 5 and 6. After get the ID mapping, you can use mol.sum in pathview package to merge redundant IDs into a unified expression level. ?pathview::mol.sum "

But my problem is that I don't know how to do the last bit of what is suggested here, getting the ID mapping in r and using mol.sum to merge the redundant IDs. Could someone check my script so far and help writing the missing lines of the script?

Thanks a lot in advance for your kind help.

Best, Joanito

R gage package rnaseq pathview • 1.9k views

ADD COMMENT • link 9.5 years ago joanito.liberti • 0

score 1 · Answer 1 · 2016-08-16

First off, please don't cross post. Either post here or on BioStars, but it's considered bad form to post in both places.

I don't know anything about mol.sum in the pathview package, but a quick perusal of the help page makes me think it isn't really the tool for the job. I could be mistaken, but this just seems like an issue of mapping IDs, which is to me just a base R thing. I took a quick look at beebase.org, but they don't seem to have anything useful for this step, so you might need to go with the NCBI file that the person on seqanswers came up with. The basic idea is to use the match function. I'm not going to download some monster file to show this, so I will just fake something up. Say you have a named vector exp.fc:

> exp.fc
GB40001-RA GB40002-RA GB40003-RA GB40004-RA GB40005-RA GB40006-RA
     0.000         NA     -0.150         NA     -0.130     -0.105

And you have the downloaded data from NCBI, with two columns; one contains the BeeBase ID, and the other the Gene ID:

> mapper
     beebase   egid
1 GB40001-RA 229416
2 GB40002-RA 892858
3 GB40003-RA 516761
4 GB40004-RA 382197
5 GB40005-RA 208267
6 GB40006-RA 901554

Mapping the beebase ID to Gene ID is a simple use of match.

> names(exp.fc) <- mapper[match(names(exp.fc), mapper$beebase),2]
> exp.fc
229416 892858 516761 382197 208267 901554
 0.000     NA -0.150     NA -0.130 -0.105

Note that match chooses the first match it sees, which may or may not be what you want. For any one-to-many matches (e.g. one BeeBase ID matches to two or more Gene IDs), you can do other things like randomly choose the Gene ID to use, but I'm not sure the benefits of that outweigh the difficulty of doing so.

score 0 · Answer 2 · 2016-08-16

0

Entering edit mode

joanito.liberti • 0

@joanitoliberti-9686

Last seen 6.6 years ago

Dear James,

Thank you so much for your help, it got me through the analysis. I'm sorry for posting this twice I thought the two forums were independent of one another, won't repeat!

Thanks again,

Joanito

ADD COMMENT • link 9.5 years ago joanito.liberti • 0