Question

Correct way to split and unsplit a DataFrame

0

Entering edit mode

Michael Steinbaugh ▴ 90

@mjsteinbaugh

Last seen 5 months ago

Cambridge, MA

I'm having trouble splitting and unsplitting a DataFrame, using the methods defined in IRanges. Here's an attempt at a minimal reprex.

library(IRanges)
df <- DataFrame(
    a = seq_len(4L),
    b = as.factor(rep(c("b", "a"), each = 2L)),
    row.names = LETTERS[seq_len(4L)]
)
print(df)

DataFrame with 4 rows and 2 columns
          a        b
  <integer> <factor>
A         1        b
B         2        b
C         3        a
D         4        a

split <- split(x = df, f = df[["b"]])
print(split)

SplitDataFrameList of length 2
$a
DataFrame with 2 rows and 2 columns
          a        b
  <integer> <factor>
C         3        a
D         4        a

$b
DataFrame with 2 rows and 2 columns
          a        b
  <integer> <factor>
A         1        b
B         2        b

This is all good and lets me manipulate the DataFrame by a grouping factor, similar to the approach in dplyr with group_by. However, I'm having trouble coercing the split back to a standard DataFrame via unsplit().

unlist() will coerce back to DataFrame but flips the row names, because we're not keeping track of our factor grouping:

unlist(split, use.names = FALSE)

DataFrame with 4 rows and 2 columns
          a        b
  <integer> <factor>
C         3        a
D         4        a
A         1        b
B         2        b

Neither one of these approaches with unsplit() seems to work:

unsplit(split, f = df[["b"]])

## Error in unsplit(split, f = df[["b"]]) : 
##   Length of 'unlist(value)' must equal length of 'f'

unsplit(split, f = split[, "b"])

## Error in `splitAsList<-`(`*tmp*`, f, drop = drop, value = value) : 
##   Length of 'value' must equal the length of a split on 'f'

See related S4 method definition:

getMethod(
    f = "unsplit",
    signature = "List",
    where = asNamespace("IRanges")
)

s4vectors iranges • 1.6k views

ADD COMMENT • link updated 4.7 years ago by Michael Lawrence ★ 11k • written 4.7 years ago by Michael Steinbaugh ▴ 90

0

Entering edit mode

The stack() function also gets close but doesn't unsplit back to the original DataFrame unmodified:

help(topic = "SplitDataFrameList", package = "IRanges")
stack(x = split, index.var = ".idx")

DataFrame with 4 rows and 3 columns
   .idx         a        b
  <Rle> <integer> <factor>
C     a         3        a
D     a         4        a
A     b         1        b
B     b         2        b

ADD REPLY • link 4.7 years ago Michael Steinbaugh ▴ 90

score 1 · Accepted Answer · 2019-08-24

1

Entering edit mode

Michael Lawrence ★ 11k

@michael-lawrence-3846

Last seen 2.4 years ago

United States

Thanks, fixed in version 2.18.2, to appear.

ADD COMMENT • link 4.7 years ago Michael Lawrence ★ 11k

0

Entering edit mode

Perfect, thanks Michael!

ADD REPLY • link 4.7 years ago Michael Steinbaugh ▴ 90