Search
Question: toTable(org.Dr.egGO) into a data.frame with unique gene_id and all ensembl_ids in one column
0
gravatar for Mehmet Ilyas Cosacak
12 months ago by
Germany/Dresden/ CRTD - DZNE
Mehmet Ilyas Cosacak0 wrote:

Hi,

I am trying to generate a data frame as below from toTable(org.Dr.egENSEMBL2EG).

as an example convert the following rows into a single row:

         gene_id         ensembl_id

16939 100000783 ENSDARG00000093071
16940 100000783 ENSDARG00000103015
16941 100000783 ENSDARG00000086233
16942 100000783 ENSDARG00000099123
16943 100000783 ENSDARG00000086304
16944 100000783 ENSDARG00000086591
16945 100000783 ENSDARG00000051736

as below:

           gene_id  ensembl_id

1       100000783 "ENSDARG00000093071,ENSDARG00000103015,ENSDARG00000086233,ENSDARG00000099123,ENSDARG00000086304,ENSDARG00000086591,ENSDARG00000051736"

 

my code is as below but it takes long time to generate the data.frame that I want to generate.

library(org.Dr.eg.db)
nDf <- toTable(org.Dr.egENSEMBL2EG)
d <- duplicated(nDf[,1])
nDb <- nDf[!d,]
tmp1 <- nDf[d,]
for(i in 1:length(nDb[,1])){
    idxs <- which(tmp1[,1] == nDb[i,1])
    nDb[i,2] <- paste(nDb[i,2], paste(tmp1[c(idxs),2], collapse = ","), sep = ",")
}

best,

ilyas.
 

ADD COMMENTlink modified 12 months ago by James W. MacDonald42k • written 12 months ago by Mehmet Ilyas Cosacak0
2
gravatar for James W. MacDonald
12 months ago by
United States
James W. MacDonald42k wrote:

You should be using mapIds for this sort of thing. There are any number of ways you could do what you want, and probably better ways to present the data than a comma separated vector.

> z <- mapIds(org.Dr.eg.db, keys(org.Dr.eg.db), "ENSEMBL", "ENTREZID",multiVals="CharacterList")
> zz <- DataFrame(ENTREZID = names(z), ENSEMBL = z)
> zz
DataFrame with 37241 rows and 2 columns
             ENTREZID            ENSEMBL
          <character>    <CharacterList>
30037           30037 ENSDARG00000021948
30038           30038 ENSDARG00000010770
30065           30065 ENSDARG00000101744
30066           30066 ENSDARG00000077840
30067           30067 ENSDARG00000019588
...               ...                ...
105751184   105751184                 NA
105751185   105751185                 NA
106023290   106023290                 NA
106144553   106144553                 NA
106144554   106144554                 NA

> zzz <- data.frame(ENTREZID = names(z), ENSEMBL = sapply(z, paste, collapse = ", "))
> head(zzz)
      ENTREZID            ENSEMBL
30037    30037 ENSDARG00000021948
30038    30038 ENSDARG00000010770
30065    30065 ENSDARG00000101744
30066    30066 ENSDARG00000077840
30067    30067 ENSDARG00000019588
30068    30068 ENSDARG00000104702
> head(zzz[sapply(z, length) > 1,])
      ENTREZID                                                    ENSEMBL
30163    30163                     ENSDARG00000079402, ENSDARG00000045011
30217    30217                     ENSDARG00000097238, ENSDARG00000089087
30478    30478                     ENSDARG00000009702, ENSDARG00000101628
30491    30491                     ENSDARG00000087359, ENSDARG00000052207
30593    30593                     ENSDARG00000086522, ENSDARG00000090237
30597    30597 ENSDARG00000089475, ENSDARG00000089124, ENSDARG00000088330
ADD COMMENTlink modified 12 months ago • written 12 months ago by James W. MacDonald42k

Thank you very much James! Sometimes I need a data.frame or an input that as above, e.g., for topGO, an input file with ensembl_id in first column and all go_id s in the second column. That is one of the aim that I am trying to learn a quicker way to generate the data.frame that has multiple mappings on another column.

best,

ilyas.

ADD REPLYlink written 12 months ago by Mehmet Ilyas Cosacak0
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.2.0
Traffic: 230 users visited in the last hour