Question

DataFrame compatible left_join (merge) operation supporting S4 complex columns that never reorders rows

0

Entering edit mode

Michael Steinbaugh ▴ 90

@mjsteinbaugh

Last seen 2.2 years ago

Cambridge, MA

Is there a left join operation that works on DataFrame class objects and NEVER rearranges the rows? The base merge operation (i.e. S4Vectors::merge) currently will reorder rows, even when sort = FALSE.

# This supports S4 columns but will flip rows.
m <- S4Vectors::merge(
    x = x, y = y,
    by = "gene_id",
    all.x = TRUE,
    sort = FALSE
)

I'd like to be able to use something like dplyr::left_join() that supports complex S4 columns (e.g. CompressedCharacterList), rather than just atomic and list columns supported in tibbles.

# This never flips rows, but doesn't support S4.
m <- dplyr::left_join(
    x = x, y = y,
    by = "gene_id"
)

s4 dataframe left_join merge • 1.7k views

ADD COMMENT • link updated 6.8 years ago by Hervé Pagès 16k • written 6.8 years ago by Michael Steinbaugh ▴ 90

score 2 · Accepted Answer · 2019-04-22

Hi,

A workaround is to perform your own merge e.g. with something like this:

library(S4Vectors)
x <- DataFrame(tx_id=letters[1:7], gene_id=c(3, 19, 4, 1, 1, 3, 1))
y <- DataFrame(gene_id=1:5, gene_name=LETTERS[1:5])
m <- match(x$gene_id, y$gene_id)
cbind(x, y[m, ])
# DataFrame with 7 rows and 4 columns
#         tx_id   gene_id   gene_id   gene_name
#   <character> <numeric> <integer> <character>
# 1           a         3         3           C
# 2           b        19        NA          NA
# 3           c         4         4           D
# 4           d         1         1           A
# 5           e         1         1           A
# 6           f         3         3           C
# 7           g         1         1           A

There is one problem though if the right DataFrame has a column that is an S4 object that doesn't support subsetting by a subscript with NAs:

library(GenomicRanges)
y$range <- GRanges("chr1", IRanges(11:15, width=5))
y
# DataFrame with 5 rows and 3 columns
#     gene_id   gene_name      range
#   <integer> <character>  <GRanges>
# 1         1           A chr1:11-15
# 2         2           B chr1:12-16
# 3         3           C chr1:13-17
# 4         4           D chr1:14-18
# 5         5           E chr1:15-19

cbind(x, y[m, ])
# Error: subscript contains NAs

That's because GRanges objects don't accept NAs in the subscript:

y$range[m]
# Error: subscript contains NAs

One way to deal with this is to make sure that all the gene ids in the left DataFrame are mapped to a gene id in the right DataFrame. This will guarantee that the call to match() doesn't return any NA.

Another way is to exclude from the results the rows in x that are not matched to a row in y:

keep_idx <- !is.na(m)
cbind(x[keep_idx, ], y[m[keep_idx], ])
# DataFrame with 6 rows and 5 columns
#         tx_id   gene_id   gene_id   gene_name      range
#   <character> <numeric> <integer> <character>  <GRanges>
# 1           a         3         3           C chr1:13-17
# 2           c         4         4           D chr1:14-18
# 3           d         1         1           A chr1:11-15
# 4           e         1         1           A chr1:11-15
# 5           f         3         3           C chr1:13-17
# 6           g         1         1           A chr1:11-15

This is equivalent to calling merge() with all.x=FALSE, except that we've preserved the original order of the rows in x.

Hope this helps,

H.