Printing DataFrame with nested DataFrames causes error
2
1
Entering edit mode
@wdesouza
Last seen 10 months ago
Brazil

I would like to use DataFrame class to represent data.frame with nested data frames. For example, a data frame that have a list of data frame as column (one data frame for each row).

library(S4Vectors)
df <- DataFrame(a=c(1,2,3), b=c("a","b","c"))
df

Outputs:

DataFrame with 3 rows and 2 columns
a           b
<numeric> <character>
1         1           a
2         2           b
3         3           c

Now add a list of data frames as new column of DataFrame. These data frames may have different columns and number of rows.

df$c <- list(DataFrame(x=c(1,2)), DataFrame(x=1,y=2), DataFrame()) df Outputs an error: DataFrame with 3 rows and 3 columns Error in as.vector(x, mode = "character") : no method for coercing this S4 class to a vector But it works: df[2, 3] [[1]] DataFrame with 1 row and 2 columns x y <numeric> <numeric> 1 1 2 df[1, 3] [[1]] DataFrame with 2 rows and 1 column x <numeric> 1 1 2 2 However it returns a list of 1 element.. Is there a better way to work with nested data frames using Bioconductor base classes? s4vectors dataframe • 1.0k views ADD COMMENT 0 Entering edit mode I wonder whether these nested-data-frame structures are really consistent with R's vectorization and end-user (including the person who creates these objects!) comprehension? For me a more natural way to represent this (when all nested DataFrame have the same columns) would be a single data frame with column(s) describing the 'partitioning' df$group of rows into groups. Operations on columns (e.g., 'take the log of column x') are easily vectorized (df$logx <- log(df$x)) and many group-wise operations can be efficiently implemented using the *List infrastructure (e.g., the mean of column x by group, mean(splitAsList(df$x, df$group))).

Even if the data frames have different structure, I do think that a 'tidy' data structure will in the end be more useful.

0
Entering edit mode

Thank you Martin for the comment. Actually, the nested data frames may have different shapes (number of columns and rows). This data I am working on came from web APIs (using httr and jsonlite packages). I will update my example.

2
Entering edit mode
@michael-lawrence-3846
Last seen 7 weeks ago
United States

A fix will soon propagate for the display issue. For the extraction issue, what were you expecting if not a single element list?

0
Entering edit mode

Thank you Michael. I updated my R installation and I got the latest version of the S4Vectors package. The error does not occur anymore. I expected the DataFrame object itself instead of a list. I tested the Hervé's example with base data frames and the behavior was the same.

1
Entering edit mode
@herve-pages-1542
Last seen 2 days ago
Seattle, WA, United States

Hi,

Note that this kind of nesting also "works" with ordinary data frames:

df <- data.frame(a=1:4, b=LETTERS[1:4])
df\$c <- list(data.frame(x=1:2),
NULL,
data.frame(x=1:3,y=LETTERS[7:9],
stringsAsFactors=FALSE),
data.frame())


Trying to display the object doesn't raise an error like in the DataFrame case but doesn't really do a good job:

df
#   a b                c
# 1 1 A             1, 2
# 2 2 B             NULL
# 3 3 C 1, 2, 3, G, H, I
# 4 4 D             NULL


2D-style subsetting works and also returns a data frame wrapped in a list of length 1:

df[4, 3]
# [[1]]
# data frame with 0 columns and 0 rows


This behaves as expected if we think of 2D-style subsetting df[4, 3] as equivalent to df[[3]][4]. One could argue that this semantic is a little bit arbitrary and that we should rather think of it as equivalent to df[[3]][[4]] . However the df[[j]][[i]] semantic would not be desirable in certain situations e.g. when the j-th column of a DataFrame is an IRanges object. It would also cause some surprises e.g. when i is an integer vector that is the result of a computation and is expected to be of arbitrary length but ends up being of length 1 in some situations.

One can always work around the small inconvenience of the current semantic (df[[j]][i]) by doing df[[3]][[4]].

So it looks like all what needs to be fixed is the display of a DataFrame with columns that are lists of data-frame-like objects.

Cheers,

H.

0
Entering edit mode

The display has been fixed in devel. The dropping behavior is already complex enough, so the goal is just consistency with data.frame.

0
Entering edit mode

Thanks Michael.

I forgot about df[[i, j]] (I never use it) but it works on ordinary data frames and does df[[j]][[i]]:

df[4, 3]
# [[1]]
# data frame with 0 columns and 0 rows

df[[4, 3]]
# data frame with 0 columns and 0 rows

Maybe DataFrame objects could support it too.

H.

0
Entering edit mode

0
Entering edit mode

Great, thanks! I should probably do the same for DelayedArray objects.

H.

0
Entering edit mode

Thank you Hervé for the explanation. It was very clarifying. I think the fix in development version worked for me. I agree with you and Michael about behavior of DataFrame being the consistent with base data.frame.