Question

Printing DataFrame with nested DataFrames causes error

1

Entering edit mode

Welliton de Souza ▴ 70

@wdesouza

Last seen 3.4 years ago

Brazil

I would like to use DataFrame class to represent data.frame with nested data frames. For example, a data frame that have a list of data frame as column (one data frame for each row).

library(S4Vectors)
df <- DataFrame(a=c(1,2,3), b=c("a","b","c"))
df

Outputs:

DataFrame with 3 rows and 2 columns
          a           b
  <numeric> <character>
1         1           a
2         2           b
3         3           c

Now add a list of data frames as new column of DataFrame. These data frames may have different columns and number of rows.

df$c <- list(DataFrame(x=c(1,2)), DataFrame(x=1,y=2), DataFrame())
df

Outputs an error:

DataFrame with 3 rows and 3 columns
Error in as.vector(x, mode = "character") : 
  no method for coercing this S4 class to a vector

But it works:

df[2, 3]

[[1]]
DataFrame with 1 row and 2 columns
          x         y
  <numeric> <numeric>
1         1         2

df[1, 3]

[[1]]
DataFrame with 2 rows and 1 column
          x
  <numeric>
1         1
2         2

However it returns a list of 1 element..

Is there a better way to work with nested data frames using Bioconductor base classes?

s4vectors dataframe • 2.2k views

ADD COMMENT • link updated 7.2 years ago by Hervé Pagès 16k • written 7.2 years ago by Welliton de Souza ▴ 70

0

Entering edit mode

I wonder whether these nested-data-frame structures are really consistent with R's vectorization and end-user (including the person who creates these objects!) comprehension?

For me a more natural way to represent this (when all nested DataFrame have the same columns) would be a single data frame with column(s) describing the 'partitioning' df$group of rows into groups. Operations on columns (e.g., 'take the log of column x') are easily vectorized (df$logx <- log(df$x)) and many group-wise operations can be efficiently implemented using the *List infrastructure (e.g., the mean of column x by group, mean(splitAsList(df$x, df$group))).

Even if the data frames have different structure, I do think that a 'tidy' data structure will in the end be more useful.

ADD REPLY • link 7.2 years ago Martin Morgan 25k

0

Entering edit mode

Thank you Martin for the comment. Actually, the nested data frames may have different shapes (number of columns and rows). This data I am working on came from web APIs (using httr and jsonlite packages). I will update my example.

ADD REPLY • link 7.2 years ago Welliton de Souza ▴ 70

1

Entering edit mode

Hervé Pagès 16k

@herve-pages-1542

Last seen 17 hours ago

Seattle, WA, United States

Hi,

Note that this kind of nesting also "works" with ordinary data frames:

df <- data.frame(a=1:4, b=LETTERS[1:4])
df$c <- list(data.frame(x=1:2),
             NULL,
             data.frame(x=1:3,y=LETTERS[7:9],
                        stringsAsFactors=FALSE),
             data.frame())

Trying to display the object doesn't raise an error like in the DataFrame case but doesn't really do a good job:

df
#   a b                c
# 1 1 A             1, 2
# 2 2 B             NULL
# 3 3 C 1, 2, 3, G, H, I
# 4 4 D             NULL

2D-style subsetting works and also returns a data frame wrapped in a list of length 1:

df[4, 3]
# [[1]]
# data frame with 0 columns and 0 rows

This behaves as expected if we think of 2D-style subsetting df[4, 3] as equivalent to df[[3]][4]. One could argue that this semantic is a little bit arbitrary and that we should rather think of it as equivalent to df[[3]][[4]] . However the df[[j]][[i]] semantic would not be desirable in certain situations e.g. when the j-th column of a DataFrame is an IRanges object. It would also cause some surprises e.g. when i is an integer vector that is the result of a computation and is expected to be of arbitrary length but ends up being of length 1 in some situations.

One can always work around the small inconvenience of the current semantic (df[[j]][i]) by doing df[[3]][[4]].

So it looks like all what needs to be fixed is the display of a DataFrame with columns that are lists of data-frame-like objects.

Cheers,

H.

ADD COMMENT • link 7.2 years ago Hervé Pagès 16k

0

Entering edit mode

The display has been fixed in devel. The dropping behavior is already complex enough, so the goal is just consistency with data.frame.

ADD REPLY • link 7.2 years ago Michael Lawrence ★ 11k

0

Entering edit mode

Thanks Michael.

I forgot about df[[i, j]] (I never use it) but it works on ordinary data frames and does df[[j]][[i]]:

df[4, 3]
# [[1]]
# data frame with 0 columns and 0 rows

df[[4, 3]]
# data frame with 0 columns and 0 rows

Maybe DataFrame objects could support it too.

H.

ADD REPLY • link 7.2 years ago Hervé Pagès 16k

0

Entering edit mode

I also forgot about that, thanks for the reminder. Support added.

ADD REPLY • link 7.2 years ago Michael Lawrence ★ 11k

0

Entering edit mode

Great, thanks! I should probably do the same for DelayedArray objects.

H.

ADD REPLY • link 7.2 years ago Hervé Pagès 16k

0

Entering edit mode

Thank you Hervé for the explanation. It was very clarifying. I think the fix in development version worked for me. I agree with you and Michael about behavior of DataFrame being the consistent with base data.frame.

ADD REPLY • link 7.2 years ago Welliton de Souza ▴ 70

score 2 · Accepted Answer · 2017-02-11

2

Entering edit mode

Michael Lawrence ★ 11k

@michael-lawrence-3846

Last seen 2.4 years ago

United States

A fix will soon propagate for the display issue. For the extraction issue, what were you expecting if not a single element list?

ADD COMMENT • link 7.2 years ago Michael Lawrence ★ 11k

0

Entering edit mode

Thank you Michael. I updated my R installation and I got the latest version of the S4Vectors package. The error does not occur anymore. ~~I expected the DataFrame object itself instead of a list.~~ I tested the Hervé's example with base data frames and the behavior was the same.

ADD REPLY • link 7.2 years ago Welliton de Souza ▴ 70