Hey everyone,
I am working with the AnnotationForge package, but realised a more general issue wrt. to the select
command.
The results seem to be sensitive to the order passed to the colums
parameter, and do in some cases not contain all the data.
Let me give a working example to illustrate my issue:
Gid <- paste0("transcript",c(1:8))
Sym <- c(paste0("rp",1:8))
Chr <- c(rep(1,3), rep(2,5))
Gid2 <- paste0("transcript",c(7:13,7))
Pro <- c(7:14)
fSym <- data.frame(GID=Gid, SYMBOL=Sym, CHROM=as.integer(Chr))
fSym$GID <- as.character(fSym$GID)
fPro <- data.frame(GID=Gid2, PROTEIN=Pro)
fPro$GID <- as.character(fPro$GID)
makeOrgPackage(gene_info=fSym, pro_info=fPro,
version="0.1",
maintainer="Some One <so@someplace.org>",
author="Some One <so@someplace.org>",
outputDir = ".",
tax_id="42",
genus="Omnia",
species="Marsupilami")
install.packages("./org.OMarsupilami.eg.db", repos=NULL)
Calling now the select
command three times
select(x = org.OMarsupilami.eg.db,
keys = keys(org.OMarsupilami.eg.db),
columns = c("GID","CHROM","PROTEIN"),
keytype = "GID")
select(x = org.OMarsupilani.eg.db,
keys = keys(org.OMarsupilami.eg.db),
columns = c("CHROM","PROTEIN"),
keytype = "GID")
select(x = org.OMarsupilami.eg.db,
keys = keys(org.OMarsupilami.eg.db),
columns = c("PROTEIN","CHROM"),
keytype = "GID")
where all examples query for the columns PROTEINS
and CHROM
. The last two example query them in the two possible orders
c("CHROM","PROTEIN")
and c("PROTEIN", "CHROM")
, and example one includes also the GID
column. So, I had expected their output to not deviate much from each other, however the results are somehow remarkable:
GID CHROM PROTEIN
1 transcript1 1 <NA>
2 transcript2 1 <NA>
3 transcript3 1 <NA>
4 transcript4 2 <NA>
5 transcript5 2 <NA>
6 transcript6 2 <NA>
7 transcript7 2 7
8 transcript7 2 14
9 transcript8 2 8
10 transcript9 <NA> 9
11 transcript10 <NA> 10
12 transcript11 <NA> 11
13 transcript12 <NA> 12
14 transcript13 <NA> 13
for the first example, which is pretty exactly what I expected. However, the other two made me raise my eyebrows.
GID CHROM PROTEIN
1 transcript1 1 <NA>
2 transcript2 1 <NA>
3 transcript3 1 <NA>
4 transcript4 2 <NA>
5 transcript5 2 <NA>
6 transcript6 2 <NA>
7 transcript7 2 7
8 transcript7 2 14
9 transcript8 2 8
10 transcript9 <NA> <NA>
11 transcript10 <NA> <NA>
12 transcript11 <NA> <NA>
13 transcript12 <NA> <NA>
14 transcript13 <NA> <NA>
GID PROTEIN CHROM
1 transcript1 <NA> <NA>
2 transcript2 <NA> <NA>
3 transcript3 <NA> <NA>
4 transcript4 <NA> <NA>
5 transcript5 <NA> <NA>
6 transcript6 <NA> <NA>
7 transcript7 7 2
8 transcript7 14 2
9 transcript8 8 2
10 transcript9 9 <NA>
11 transcript10 10 <NA>
12 transcript11 11 <NA>
13 transcript12 12 <NA>
14 transcript13 13 <NA>
It is somehow striking that neither of both examples is returning all the data, and furthermore the data in the last column seems to be dominated by the first column wrt. to an intersection restriction, i.e. is order dependent. This dominance of the first column is my attempt to explain why all data is returned in the first example.
So, this behaviour boiled down in my mind to the three following questions :
1. Is this behaviour due to a certain SQL join (left, right, ...) order ?
2. Is there a plain way to overcome it and return a full joint ?
3. Or is the workaround to have GID
always included in the query as the very first column?
Anyone out there an idea?
And you could argue that a DataFrame is an even better thing to use, as there are one to many mappings in there: