Hi,
A workaround is to perform your own merge e.g. with something like this:
library(S4Vectors)
x <- DataFrame(tx_id=letters[1:7], gene_id=c(3, 19, 4, 1, 1, 3, 1))
y <- DataFrame(gene_id=1:5, gene_name=LETTERS[1:5])
m <- match(x$gene_id, y$gene_id)
cbind(x, y[m, ])
# DataFrame with 7 rows and 4 columns
# tx_id gene_id gene_id gene_name
# <character> <numeric> <integer> <character>
# 1 a 3 3 C
# 2 b 19 NA NA
# 3 c 4 4 D
# 4 d 1 1 A
# 5 e 1 1 A
# 6 f 3 3 C
# 7 g 1 1 A
There is one problem though if the right DataFrame has a column that is an S4 object that doesn't support subsetting by a subscript with NAs:
library(GenomicRanges)
y$range <- GRanges("chr1", IRanges(11:15, width=5))
y
# DataFrame with 5 rows and 3 columns
# gene_id gene_name range
# <integer> <character> <GRanges>
# 1 1 A chr1:11-15
# 2 2 B chr1:12-16
# 3 3 C chr1:13-17
# 4 4 D chr1:14-18
# 5 5 E chr1:15-19
cbind(x, y[m, ])
# Error: subscript contains NAs
That's because GRanges objects don't accept NAs in the subscript:
y$range[m]
# Error: subscript contains NAs
One way to deal with this is to make sure that all the gene ids in the left DataFrame are mapped to a gene id in the right DataFrame. This will guarantee that the call to match()
doesn't return any NA.
Another way is to exclude from the results the rows in x
that are not matched to a row in y
:
keep_idx <- !is.na(m)
cbind(x[keep_idx, ], y[m[keep_idx], ])
# DataFrame with 6 rows and 5 columns
# tx_id gene_id gene_id gene_name range
# <character> <numeric> <integer> <character> <GRanges>
# 1 a 3 3 C chr1:13-17
# 2 c 4 4 D chr1:14-18
# 3 d 1 1 A chr1:11-15
# 4 e 1 1 A chr1:11-15
# 5 f 3 3 C chr1:13-17
# 6 g 1 1 A chr1:11-15
This is equivalent to calling merge()
with all.x=FALSE
, except that we've preserved the original order of the rows in x
.
Hope this helps,
H.
Thanks Hervé! That's really clever, and exactly what I'm looking for.
Best, Mike