Search
Question: Printing DataFrame with nested DataFrames causes error
1
gravatar for Welliton Souza
9 months ago by
Brazil
Welliton Souza70 wrote:

I would like to use DataFrame class to represent data.frame with nested data frames. For example, a data frame that have a list of data frame as column (one data frame for each row).

library(S4Vectors)
df <- DataFrame(a=c(1,2,3), b=c("a","b","c"))
df

Outputs:

DataFrame with 3 rows and 2 columns
          a           b
  <numeric> <character>
1         1           a
2         2           b
3         3           c

Now add a list of data frames as new column of DataFrame. These data frames may have different columns and number of rows.

df$c <- list(DataFrame(x=c(1,2)), DataFrame(x=1,y=2), DataFrame())
df

Outputs an error:

DataFrame with 3 rows and 3 columns
Error in as.vector(x, mode = "character") : 
  no method for coercing this S4 class to a vector

But it works:

df[2, 3]

[[1]]
DataFrame with 1 row and 2 columns
          x         y
  <numeric> <numeric>
1         1         2

df[1, 3]

[[1]]
DataFrame with 2 rows and 1 column
          x
  <numeric>
1         1
2         2

However it returns a list of 1 element..

Is there a better way to work with nested data frames using Bioconductor base classes?

ADD COMMENTlink modified 9 months ago by Hervé Pagès ♦♦ 13k • written 9 months ago by Welliton Souza70

I wonder whether these nested-data-frame structures are really consistent with R's vectorization and end-user (including the person who creates these objects!) comprehension?

For me a more natural way to represent this (when all nested DataFrame have the same columns) would be a single data frame with column(s) describing the 'partitioning' df$group of rows into groups. Operations on columns (e.g., 'take the log of column x') are easily vectorized (df$logx <- log(df$x)) and many group-wise operations can be efficiently implemented using the *List infrastructure (e.g., the mean of column x by group, mean(splitAsList(df$x, df$group))).

Even if the data frames have different structure, I do think that a 'tidy' data structure will in the end be more useful.

ADD REPLYlink modified 9 months ago • written 9 months ago by Martin Morgan ♦♦ 20k

Thank you Martin for the comment. Actually, the nested data frames may have different shapes (number of columns and rows). This data I am working on came from web APIs (using httr and jsonlite packages). I will update my example.

ADD REPLYlink written 9 months ago by Welliton Souza70
2
gravatar for Michael Lawrence
9 months ago by
United States
Michael Lawrence9.8k wrote:

A fix will soon propagate for the display issue. For the extraction issue, what were you expecting if not a single element list?

ADD COMMENTlink written 9 months ago by Michael Lawrence9.8k

Thank you Michael. I updated my R installation and I got the latest version of the S4Vectors package. The error does not occur anymore. I expected the DataFrame object itself instead of a list. I tested the Hervé's example with base data frames and the behavior was the same.

ADD REPLYlink modified 9 months ago • written 9 months ago by Welliton Souza70
1
gravatar for Hervé Pagès
9 months ago by
Hervé Pagès ♦♦ 13k
United States
Hervé Pagès ♦♦ 13k wrote:

Hi,

Note that this kind of nesting also "works" with ordinary data frames:

df <- data.frame(a=1:4, b=LETTERS[1:4])
df$c <- list(data.frame(x=1:2),
             NULL,
             data.frame(x=1:3,y=LETTERS[7:9],
                        stringsAsFactors=FALSE),
             data.frame())

Trying to display the object doesn't raise an error like in the DataFrame case but doesn't really do a good job:

df
#   a b                c
# 1 1 A             1, 2
# 2 2 B             NULL
# 3 3 C 1, 2, 3, G, H, I
# 4 4 D             NULL

2D-style subsetting works and also returns a data frame wrapped in a list of length 1:

df[4, 3]
# [[1]]
# data frame with 0 columns and 0 rows

This behaves as expected if we think of 2D-style subsetting df[4, 3] as equivalent to df[[3]][4]. One could argue that this semantic is a little bit arbitrary and that we should rather think of it as equivalent to df[[3]][[4]] . However the df[[j]][[i]] semantic would not be desirable in certain situations e.g. when the j-th column of a DataFrame is an IRanges object. It would also cause some surprises e.g. when i is an integer vector that is the result of a computation and is expected to be of arbitrary length but ends up being of length 1 in some situations.

One can always work around the small inconvenience of the current semantic (df[[j]][i]) by doing df[[3]][[4]].

So it looks like all what needs to be fixed is the display of a DataFrame with columns that are lists of data-frame-like objects.

Cheers,

H.

ADD COMMENTlink modified 9 months ago • written 9 months ago by Hervé Pagès ♦♦ 13k

The display has been fixed in devel. The dropping behavior is already complex enough, so the goal is just consistency with data.frame.

 

ADD REPLYlink written 9 months ago by Michael Lawrence9.8k

Thanks Michael.

I forgot about df[[i, j]] (I never use it) but it works on ordinary data frames and does df[[j]][[i]]:

df[4, 3]
# [[1]]
# data frame with 0 columns and 0 rows

df[[4, 3]]
# data frame with 0 columns and 0 rows

Maybe DataFrame objects could support it too.

H.

ADD REPLYlink written 9 months ago by Hervé Pagès ♦♦ 13k

I also forgot about that, thanks for the reminder. Support added.

ADD REPLYlink written 9 months ago by Michael Lawrence9.8k

Great, thanks! I should probably do the same for DelayedArray objects.

H.

ADD REPLYlink written 9 months ago by Hervé Pagès ♦♦ 13k

Thank you Hervé for the explanation. It was very clarifying. I think the fix in development version worked for me. I agree with you and Michael about behavior of DataFrame being the consistent with base data.frame. 

ADD REPLYlink written 9 months ago by Welliton Souza70
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.2.0
Traffic: 123 users visited in the last hour