trouble using select() to get all GO terms when have duplicate query IDs
1
0
Entering edit mode
Jenny Drnevich ★ 2.0k
@jenny-drnevich-2812
Last seen 6 months ago
United States

Hello,

I just spent WAY too much time tracking this down, but there's been a change somewhere between R 3.1.3 and R 3.2.2 in how select() handles duplicated query IDs, particularly for GO terms that have 1:many relationships. (I'm having trouble getting R 3.2.0 and 3.2.1 not to use 3.2.2's packages, so I'm not sure when the change happened). In R 3.1.3,  select() would give you a warning about duplicate query keys, but would still give you all the mappings:

R version 3.1.3 (2015-03-09) -- "Smooth Sidewalk"
#lines removed...
> library(org.Sc.sgd.db)
#lines removed...
> 
> #get all SGD ids for yeast
> all.sgd <- keys(org.Sc.sgd.db, keytype = "SGD")
> length(all.sgd)
[1] 16389
> 
> all.sgd.go <- select(org.Sc.sgd.db, keys = all.sgd, keytype = "SGD", columns = "GO")
Warning message:
In .generateExtraRows(tab, keys, jointype) :
  'select' resulted in 1:many mapping between keys and
return rows
> dim(all.sgd.go)
[1] 87827     4
> 
> #duplicate just a few SGD:
> 
> dup.sgd <- c(all.sgd,all.sgd[1:3])
> 
> dup.sgd.go <- select(org.Sc.sgd.db, keys = dup.sgd, keytype = "SGD", columns = "GO")
Warning message:
In .generateExtraRows(tab, keys, jointype) :
  'select' and duplicate query keys resulted in 1:many
mapping between keys and return rows
> dim(dup.sgd.go)
[1] 87827     4
> sessionInfo()
R version 3.1.3 (2015-03-09)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows 7 x64 (build 7601) Service Pack 1

locale:
[1] LC_COLLATE=English_United States.1252 
[2] LC_CTYPE=English_United States.1252   
[3] LC_MONETARY=English_United States.1252
[4] LC_NUMERIC=C                          
[5] LC_TIME=English_United States.1252    

attached base packages:
[1] parallel  stats4    stats     graphics  grDevices utils    
[7] datasets  methods   base     

other attached packages:
[1] org.Sc.sgd.db_3.0.0  RSQLite_1.0.0       
[3] DBI_0.3.1            AnnotationDbi_1.28.1
[5] GenomeInfoDb_1.2.4   IRanges_2.0.1       
[7] S4Vectors_0.4.0      Biobase_2.26.0      
[9] BiocGenerics_0.12.1 

loaded via a namespace (and not attached):
[1] tools_3.1.3

 

In R 3.2.2, I do like that select now outputs whether you get 1:1, 1:many or many:many mappings, but there seems to be a bug in how it treats many:many mapping, because it's only listing one GO term per query key, including duplicates:

R version 3.2.2 (2015-08-14) -- "Fire Safety"
#lines removed
> library(org.Sc.sgd.db)
#lines removed
> 
> #get all SGD ids for yeast
> all.sgd <- keys(org.Sc.sgd.db, keytype = "SGD")
> length(all.sgd)
[1] 16450
> 
> all.sgd.go <- select(org.Sc.sgd.db, keys = all.sgd, keytype = "SGD", columns = "GO")
'select()' returned 1:many mapping between keys and columns
> dim(all.sgd.go)
[1] 88004     4
> 
> #duplicate just a few SGD:
> 
> dup.sgd <- c(all.sgd,all.sgd[1:3])
> length(dup.sgd)
[1] 16453
> dup.sgd.go <- select(org.Sc.sgd.db, keys = dup.sgd, keytype = "SGD", columns = "GO")
'select()' returned many:many mapping between keys and columns
> dim(dup.sgd.go)
[1] 16453     4
> sessionInfo()
R version 3.2.2 (2015-08-14)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows 7 x64 (build 7601) Service Pack 1

locale:
[1] LC_COLLATE=English_United States.1252 
[2] LC_CTYPE=English_United States.1252   
[3] LC_MONETARY=English_United States.1252
[4] LC_NUMERIC=C                          
[5] LC_TIME=English_United States.1252    

attached base packages:
[1] parallel  stats4    stats     graphics  grDevices utils     datasets 
[8] methods   base     

other attached packages:
[1] org.Sc.sgd.db_3.2.3  RSQLite_1.0.0        DBI_0.3.1           
[4] AnnotationDbi_1.32.0 IRanges_2.4.1        S4Vectors_0.8.2     
[7] Biobase_2.30.0       BiocGenerics_0.16.1 

loaded via a namespace (and not attached):
[1] tools_3.2.2

Is this a bug or a desired behavior? 

Thanks,

Jenny

 

 
AnnotationDbi • 12k views
ADD COMMENT
0
Entering edit mode
@james-w-macdonald-5106
Last seen 1 hour ago
United States

Hi Jenny,

It's probably a bug, but what was happening before was probably a bug as well. Or maybe not. It depends on expectations, I suppose.

In other words, what should be returned if there are duplicate keys? You could argue that the return object should conform to the inputs, in which case if you pass in the same key 5 times you should get five sets of results back out, which is what happens currently:

> select(org.Sc.sgd.db, keys(org.Sc.sgd.db, "SGD")[rep(1,5)], "ORF", "SGD")
'select()' returned many:1 mapping between keys and columns
         SGD     ORF
1 S000000001 YAL001C
2 S000000001 YAL001C
3 S000000001 YAL001C
4 S000000001 YAL001C
5 S000000001 YAL001C

But not consistently. I patched AnnotationDbi to do what it used to do, and this is what you get:

> select(org.Sc.sgd.db, keys(org.Sc.sgd.db, "SGD")[rep(1,5)], "GO", "SGD")
'select()' returned many:many mapping between keys and columns
          SGD         GO EVIDENCE ONTOLOGY
1  S000000001 GO:0000127      IDA       CC
2  S000000001 GO:0000799      IDA       CC
3  S000000001 GO:0001002      IDA       MF
4  S000000001 GO:0001003      IDA       MF
5  S000000001 GO:0001005      IDA       MF
6  S000000001 GO:0001008      IDA       MF
7  S000000001 GO:0001009      IDA       BP
8  S000000001 GO:0001041      IDA       BP
9  S000000001 GO:0003677      IEA       MF
10 S000000001 GO:0005634      IEA       CC
11 S000000001 GO:0005739      IDA       CC
12 S000000001 GO:0005739      IEA       CC
13 S000000001 GO:0006351      IEA       BP
14 S000000001 GO:0006355      IEA       BP
15 S000000001 GO:0008301      IDA       MF
16 S000000001 GO:0042791      IDA       BP
17 S000000001 GO:0042791      IMP       BP
18 S000000001 GO:0071168      IMP       BP

That is not the same - to be consistent we should get these results repeated five times. But that's sort of silly, isn't it? But mapIds() does more what I suppose we should expect:

> mapIds(org.Sc.sgd.db, keys(org.Sc.sgd.db, "SGD")[rep(1,5)], "GO", "SGD", multiVals="CharacterList")
'select()' returned many:many mapping between keys and columns
CharacterList of length 5
[["S000000001"]] GO:0000127 GO:0000799 GO:0001002 ... GO:0042791 GO:0071168
[["S000000001"]] GO:0000127 GO:0000799 GO:0001002 ... GO:0042791 GO:0071168
[["S000000001"]] GO:0000127 GO:0000799 GO:0001002 ... GO:0042791 GO:0071168
[["S000000001"]] GO:0000127 GO:0000799 GO:0001002 ... GO:0042791 GO:0071168
[["S000000001"]] GO:0000127 GO:0000799 GO:0001002 ... GO:0042791 GO:0071168

You could argue that duplicate keys imply an error on the part of the end user, in which case you would first subset to unique keys and then do the mapping, with a message saying that duplicate keys have been removed. That would certainly make the code simpler, and would remove any of these ambiguities. But maybe there is a reasonable use case for duplicated keys, in which case select() should be further modified to return duplicated results.

ADD COMMENT
0
Entering edit mode

Hi Jenny,

I have patched AnnotationDbi to do what is consistent, which is NOT the same thing as it did before. In other words, if you pass in duplicated keys, you get duplicated results, even if it's a 1:many mapping. As an example:

> select(org.Hs.eg.db, c("1","1"), "GO")
'select()' returned many:many mapping between keys and columns
   ENTREZID         GO EVIDENCE ONTOLOGY
1         1 GO:0003674       ND       MF
2         1 GO:0005576      IDA       CC
3         1 GO:0005615      IDA       CC
4         1 GO:0008150       ND       BP
5         1 GO:0070062      IDA       CC
6         1 GO:0072562      IDA       CC
7         1 GO:0003674       ND       MF
8         1 GO:0005576      IDA       CC
9         1 GO:0005615      IDA       CC
10        1 GO:0008150       ND       BP
11        1 GO:0070062      IDA       CC
12        1 GO:0072562      IDA       CC

Previously you would have received the same result as if you passed in a set of unique keys, but that wasn't a consistent result, as you got duplicates for a 1:1 mapping. This also affects mapIds():

> mapIds(org.Hs.eg.db, keys(org.Hs.eg.db)[c(1:5,1)], "GO", "ENTREZID", multiVals = "CharacterList")
'select()' returned 1:many mapping between keys and columns
CharacterList of length 6
[["1"]] GO:0003674 GO:0005576 GO:0005615 GO:0008150 GO:0070062 GO:0072562
[["2"]] GO:0001869 GO:0002020 GO:0002576 ... GO:0051056 GO:0070062 GO:0072562
[["3"]] <NA>
[["9"]] GO:0004060 GO:0005829 GO:0006805 GO:0044281
[["10"]] GO:0004060 GO:0005515 GO:0005829 GO:0006805 GO:0044281
[["1"]] GO:0003674 GO:0005576 GO:0005615 GO:0008150 GO:0070062 GO:0072562
> mapIds(org.Hs.eg.db, keys(org.Hs.eg.db)[c(1:5,1)], "GO", "ENTREZID", multiVals = "list")
'select()' returned 1:many mapping between keys and columns
$`1`
[1] "GO:0003674" "GO:0005576" "GO:0005615" "GO:0008150" "GO:0070062"
[6] "GO:0072562"

$`2`
 [1] "GO:0001869" "GO:0002020" "GO:0002576" "GO:0004867" "GO:0005102"
 [6] "GO:0005515" "GO:0005576" "GO:0005576" "GO:0005829" "GO:0007264"
[11] "GO:0007596" "GO:0007597" "GO:0010951" "GO:0010951" "GO:0019838"
[16] "GO:0019899" "GO:0019959" "GO:0019966" "GO:0022617" "GO:0030168"
[21] "GO:0030198" "GO:0031093" "GO:0043120" "GO:0048306" "GO:0048863"
[26] "GO:0051056" "GO:0070062" "GO:0072562"

$`3`
[1] NA

$`9`
[1] "GO:0004060" "GO:0005829" "GO:0006805" "GO:0044281"

$`10`
[1] "GO:0004060" "GO:0005515" "GO:0005829" "GO:0006805" "GO:0044281"

$`1`
[1] "GO:0003674" "GO:0005576" "GO:0005615" "GO:0008150" "GO:0070062"
[6] "GO:0072562"

Thanks for the bug report!

ADD REPLY
0
Entering edit mode

Thanks for making the changes. I’m out all this week so I won’t get to try it out until next week. I think an argument can be made that in the case of a X:many mapping using select(), that any duplicated keys could be removed safely because there are a variable number of return rows per key. However, I can see why you’d want the behavior to be similar to mapIds() or select() with X:1 mappings, where it makes sense to output one row per input key. Most importantly is that select() with many:many no longer just outputs one of the many! Happy Thanksgiving, Jenny

ADD REPLY

Login before adding your answer.

Traffic: 593 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6