Search
Question: Printing DataFrame with nested DataFrames causes error
1
17 months ago by
Brazil
Welliton Souza70 wrote:

I would like to use DataFrame class to represent data.frame with nested data frames. For example, a data frame that have a list of data frame as column (one data frame for each row).

library(S4Vectors)
df <- DataFrame(a=c(1,2,3), b=c("a","b","c"))
df

Outputs:

DataFrame with 3 rows and 2 columns
a           b
<numeric> <character>
1         1           a
2         2           b
3         3           c

Now add a list of data frames as new column of DataFrame. These data frames may have different columns and number of rows.

df$c <- list(DataFrame(x=c(1,2)), DataFrame(x=1,y=2), DataFrame()) df Outputs an error: DataFrame with 3 rows and 3 columns Error in as.vector(x, mode = "character") : no method for coercing this S4 class to a vector But it works: df[2, 3] [[1]] DataFrame with 1 row and 2 columns x y <numeric> <numeric> 1 1 2 df[1, 3] [[1]] DataFrame with 2 rows and 1 column x <numeric> 1 1 2 2 However it returns a list of 1 element.. Is there a better way to work with nested data frames using Bioconductor base classes? ADD COMMENTlink modified 17 months ago by Hervé Pagès ♦♦ 13k • written 17 months ago by Welliton Souza70 I wonder whether these nested-data-frame structures are really consistent with R's vectorization and end-user (including the person who creates these objects!) comprehension? For me a more natural way to represent this (when all nested DataFrame have the same columns) would be a single data frame with column(s) describing the 'partitioning' df$group of rows into groups. Operations on columns (e.g., 'take the log of column x') are easily vectorized (df$logx <- log(df$x)) and many group-wise operations can be efficiently implemented using the *List infrastructure (e.g., the mean of column x by group, mean(splitAsList(df$x, df$group))).

Even if the data frames have different structure, I do think that a 'tidy' data structure will in the end be more useful.

Thank you Martin for the comment. Actually, the nested data frames may have different shapes (number of columns and rows). This data I am working on came from web APIs (using httr and jsonlite packages). I will update my example.

2
17 months ago by
United States
Michael Lawrence10k wrote:

A fix will soon propagate for the display issue. For the extraction issue, what were you expecting if not a single element list?

Thank you Michael. I updated my R installation and I got the latest version of the S4Vectors package. The error does not occur anymore. I expected the DataFrame object itself instead of a list. I tested the Hervé's example with base data frames and the behavior was the same.

1
17 months ago by
Hervé Pagès ♦♦ 13k
United States
Hervé Pagès ♦♦ 13k wrote:

Hi,

Note that this kind of nesting also "works" with ordinary data frames:

df <- data.frame(a=1:4, b=LETTERS[1:4])
df\$c <- list(data.frame(x=1:2),
NULL,
data.frame(x=1:3,y=LETTERS[7:9],
stringsAsFactors=FALSE),
data.frame())


Trying to display the object doesn't raise an error like in the DataFrame case but doesn't really do a good job:

df
#   a b                c
# 1 1 A             1, 2
# 2 2 B             NULL
# 3 3 C 1, 2, 3, G, H, I
# 4 4 D             NULL


2D-style subsetting works and also returns a data frame wrapped in a list of length 1:

df[4, 3]
# [[1]]
# data frame with 0 columns and 0 rows


This behaves as expected if we think of 2D-style subsetting df[4, 3] as equivalent to df[[3]][4]. One could argue that this semantic is a little bit arbitrary and that we should rather think of it as equivalent to df[[3]][[4]] . However the df[[j]][[i]] semantic would not be desirable in certain situations e.g. when the j-th column of a DataFrame is an IRanges object. It would also cause some surprises e.g. when i is an integer vector that is the result of a computation and is expected to be of arbitrary length but ends up being of length 1 in some situations.

One can always work around the small inconvenience of the current semantic (df[[j]][i]) by doing df[[3]][[4]].

So it looks like all what needs to be fixed is the display of a DataFrame with columns that are lists of data-frame-like objects.

Cheers,

H.

The display has been fixed in devel. The dropping behavior is already complex enough, so the goal is just consistency with data.frame.

Thanks Michael.

I forgot about df[[i, j]] (I never use it) but it works on ordinary data frames and does df[[j]][[i]]:

df[4, 3]
# [[1]]
# data frame with 0 columns and 0 rows

df[[4, 3]]
# data frame with 0 columns and 0 rows

Maybe DataFrame objects could support it too.

H.

Great, thanks! I should probably do the same for DelayedArray objects.

H.