Question: DataFrame compatible left_join (merge) operation supporting S4 complex columns that never reorders rows
0
gravatar for Michael Steinbaugh
5 months ago by
Constellation Pharmaceuticals
Michael Steinbaugh30 wrote:

Is there a left join operation that works on DataFrame class objects and NEVER rearranges the rows? The base merge operation (i.e. S4Vectors::merge) currently will reorder rows, even when sort = FALSE.

# This supports S4 columns but will flip rows.
m <- S4Vectors::merge(
    x = x, y = y,
    by = "gene_id",
    all.x = TRUE,
    sort = FALSE
)

I'd like to be able to use something like dplyr::left_join() that supports complex S4 columns (e.g. CompressedCharacterList), rather than just atomic and list columns supported in tibbles.

# This never flips rows, but doesn't support S4.
m <- dplyr::left_join(
    x = x, y = y,
    by = "gene_id"
)
s4 merge dataframe left_join • 135 views
ADD COMMENTlink modified 5 months ago by Hervé Pagès ♦♦ 14k • written 5 months ago by Michael Steinbaugh30
Answer: DataFrame compatible left_join (merge) operation supporting S4 complex columns t
2
gravatar for Hervé Pagès
5 months ago by
Hervé Pagès ♦♦ 14k
United States
Hervé Pagès ♦♦ 14k wrote:

Hi,

A workaround is to perform your own merge e.g. with something like this:

library(S4Vectors)
x <- DataFrame(tx_id=letters[1:7], gene_id=c(3, 19, 4, 1, 1, 3, 1))
y <- DataFrame(gene_id=1:5, gene_name=LETTERS[1:5])
m <- match(x$gene_id, y$gene_id)
cbind(x, y[m, ])
# DataFrame with 7 rows and 4 columns
#         tx_id   gene_id   gene_id   gene_name
#   <character> <numeric> <integer> <character>
# 1           a         3         3           C
# 2           b        19        NA          NA
# 3           c         4         4           D
# 4           d         1         1           A
# 5           e         1         1           A
# 6           f         3         3           C
# 7           g         1         1           A

There is one problem though if the right DataFrame has a column that is an S4 object that doesn't support subsetting by a subscript with NAs:

library(GenomicRanges)
y$range <- GRanges("chr1", IRanges(11:15, width=5))
y
# DataFrame with 5 rows and 3 columns
#     gene_id   gene_name      range
#   <integer> <character>  <GRanges>
# 1         1           A chr1:11-15
# 2         2           B chr1:12-16
# 3         3           C chr1:13-17
# 4         4           D chr1:14-18
# 5         5           E chr1:15-19

cbind(x, y[m, ])
# Error: subscript contains NAs

That's because GRanges objects don't accept NAs in the subscript:

y$range[m]
# Error: subscript contains NAs

One way to deal with this is to make sure that all the gene ids in the left DataFrame are mapped to a gene id in the right DataFrame. This will guarantee that the call to match() doesn't return any NA.

Another way is to exclude from the results the rows in x that are not matched to a row in y:

keep_idx <- !is.na(m)
cbind(x[keep_idx, ], y[m[keep_idx], ])
# DataFrame with 6 rows and 5 columns
#         tx_id   gene_id   gene_id   gene_name      range
#   <character> <numeric> <integer> <character>  <GRanges>
# 1           a         3         3           C chr1:13-17
# 2           c         4         4           D chr1:14-18
# 3           d         1         1           A chr1:11-15
# 4           e         1         1           A chr1:11-15
# 5           f         3         3           C chr1:13-17
# 6           g         1         1           A chr1:11-15

This is equivalent to calling merge() with all.x=FALSE, except that we've preserved the original order of the rows in x.

Hope this helps,

H.

ADD COMMENTlink modified 5 months ago • written 5 months ago by Hervé Pagès ♦♦ 14k

Thanks Hervé! That's really clever, and exactly what I'm looking for.

Best, Mike

ADD REPLYlink written 5 months ago by Michael Steinbaugh30
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 16.09
Traffic: 157 users visited in the last hour