Entering edit mode
Putting this back on the list.
On 07/26/2013 01:59 PM, Marc Carlson wrote:
> On 07/26/2013 01:46 PM, Hervé Pagès wrote:
>> Hi Marc,
>>
>> On 07/26/2013 12:57 PM, Marc Carlson wrote:
>> ...
>>> Hello everyone,
>>>
>>> Sorry that I saw this thread so late. Basically, select() does
*try* to
>>> keep your initial keys and map them each to an equivalent number
of
>>> unique values. We did actually anticipate that people would
*want* to
>>> cbind() their results.
>>>
>>> But as you discovered there are many circumstances where the data
make
>>> this kind of behavior impossible.
>>>
>>> So passing in NAs as keys for example can't ever find anything
>>> meaningful. Those will simply have to be removed before we can
>>> proceed. And, it is also impossible to maintain a 1:1 mapping if
you
>>> retrieve fields that have many to one relationships with your
initial
>>> keys (also seen here).
>>>
>>> For convenience, when this kind of 1:1 output is already
impossible (as
>>> it is for most of your examples), select will also try to simplify
the
>>> output by removing rows that are identical all the way across
etc..
>>>
>>> My aim was that select should try to do the most reasonable thing
>>> possible based on the data we have in each case. The rationale is
that
>>> in the case where there are 1:many mappings, you should not be
planning
>>> to bind those directly onto any other data.frames anyways (as this
>>> circumstance would require you to call merge() instead). So in
that
>>> case, non-destructive simplification seems beneficial.
>>
>> Other tools in our infrastructure use an extra argument to pick-up
1
>> thing in case of multiple mapping e.g. findOverlaps() has the
'select'
>> argument with possible values "all", "first", "last", and
"arbitrary".
>> Also nearest() and family have this argument and it accepts similar
>> values.
>>
>> Couldn't select() use a similar approach? The default should be
"all"
>> so the current behavior is preserved but if it's something else
then
>> the returned data.frame should align with the input.
>>
>> Thanks,
>> H.
>
> Hi Herve,
>
> I know that for things like findOverlaps it can sometimes make some
> sense to allow this kind of behavior. But in this case, I really
don't
> think that this a good idea. Biologically speaking, this is not
> something anyone should really ever do with annotations like this,
so I
> really see no upside to making it more convenient for people to do
stuff
> that they should basically never do anyways.
I agree that picking up one thing when your key is mapped to more than
one thing should be done carefully and the user needs to understand
the
consequences so I'm all for keeping the current behavior as it is.
However I think there are a few legitimate situations where the user
might actually want to get rid of this complexity. For example:
(a) One of them is the "multiple probe" situation i.e. when a probe
is mapped to more than 1 gene. This is currently handled thru
toggleProbes() when working directly with the Bimap API.
However
there is no equivalent for the select() API. This could be
handled by supporting "none" in addition to disambiguation
modes "all", "first", "last", and "arbitrary".
(b) Another situation is when the user only wants to know whether
the keys map to something or not. If s/he could disambiguate
with "first" (or "last", or "arbitrary", it doesn't matter)
then
s/he would get back a data.frame that aligns with his/her
vector
of keys which is very convenient.
Given that those mappings are retrieved from a database, the notion of
"first" and "last" is maybe not clear and we might just want to
support
"all", "arbitrary", and "none".
>
> On the other hand, I suppose that I could have support for
DataFrames as
> an output. But I worry that this would be a ton of work since the
code
> would have to compress things at the right times?
Very roughly each column (except the 1st one) of the data.frame
returned by select() would need to go thru splitAsList() using the
1st column as the common split factor, and then go thru subsetting
by the user-supplied 'keys' vector. Here is a simple wrapper to
select() that does this but it doesn't know how to handle NAs in
the input:
selectAsDataFrame <- function(x, keys, columns, keytype)
{
if anyis.na(keys)))
stop("NAs in 'keys' are not supported yet")
keys0 <- unique(keys)
ans0 <- select(x, keys0, columns, keytype)
stopifnot(names(ans0)[1L] == keytype)
f <- ans0[[1]]
if (ncol(ans0) == 1L) {
ans_col1 <- unique(f)
m <- match(keys, ans_col1)
ans <- DataFrame(ans_col1[m])
} else {
ans_col2 <- unique(splitAsList(ans0[[2L]], f))
ans_col1 <- names(ans_col2)
m <- match(keys, ans_col1)
ans <- DataFrame(ans_col1[m], unname(ans_col2)[m])
if (ncol(ans0) >= 3L) {
ans_cols <- lapply(ans0[-(1:2)],
function(col)
unname(unique(splitAsList(col,
f)[m])))
ans <- cbind(ans, DataFrame(ans_cols))
}
}
colnames(ans) <- colnames(ans0)
ans
}
Then:
> library(org.Hs.eg.db)
> selectAsDataFrame(org.Hs.eg.db, keys=c("ALOX5", "HTR7", "XX",
"ALOX5"),
keytype="ALIAS", columns=c("ENTREZID",
"ENSEMBL"))
DataFrame with 4 rows and 3 columns
ALIAS ENTREZID ENSEMBL
<character> <characterlist> <characterlist>
1 ALOX5 240 ENSG00000012779,ENSG00000262552
2 HTR7 3363 ENSG00000148680
3 XX NA NA
4 ALOX5 240 ENSG00000012779,ENSG00000262552
An unpleasant effect of this approach is that every columns is now a
List object even when not needed (e.g. ENTREZID).
> Is there an
> automagical way to have DataFrames automagically pack up redundant
> columns? Of course, what then? Could they then cbind() a DataFrame
to
> a data.frame?
No but they could cbind() it to another DataFrame.
My feeling is that the DataFrame route is maybe a little bit more
complicated than introducing an extra argument for disambiguating
the ordinary data.frame but I could be wrong. They are complementary
though so we could also support both ;-) Depends what the people on
the list find more useful...
Thanks,
H.
>
>
>
> Marc
>
>
>
>
--
Hervé Pagès
Program in Computational Biology
Division of Public Health Sciences
Fred Hutchinson Cancer Research Center
1100 Fairview Ave. N, M1-B514
P.O. Box 19024
Seattle, WA 98109-1024
E-mail: hpages at fhcrc.org
Phone: (206) 667-5791
Fax: (206) 667-1319