SV4Vectors DataFrame bug with haven_labelled data
2
0
Entering edit mode
@daniel-e-weeks-10677
Last seen 12 months ago
Pittsburgh, Pennsylvania, United States…

Creating a DataFrame of haven_labelled data results in a bug where the DataFrame won't print properly.

Minimal working example

library(S4Vectors)
library(haven)
dn <- labelled(c(rep(1,7),NA,rep(2,7)),label="Biological sex", labels=c(`N/A` = -2, Missing=-1,Male=1,Female=2))
attr(dn,"format.spss") <- "F3.0"
a <- data.frame(a=c(1:15),sex =labelled(c(rep("M",7),rep("F",8)), c(Male = "M", Female = "F")),dn=dn)
DataFrame(a)
DataFrame(zap_labels(a))
packageVersion("S4Vectors")
packageVersion("haven")
R.version$version.string
BiocManager::version()
sessionInfo()

Note the error generated (below) when we attempt to execute the DataFrame(a) command.

Also, as you can see from the example below, this error no longer occurs if we zap the labels, as DataFrame(zap_labels(a)) works fine.

Stepping through makeNakedCharacterMatrixForDisplay to see which line generates the error, it looks to me like it is generated by the head(x, nhead) part of this line:

            m <- rbind(makeNakedCharacterMatrixForDisplay(head(x, nhead)),
                       rbind(rep.int("...", x_ncol)),
                       makeNakedCharacterMatrixForDisplay(tail(x, ntail)))

So maybe the head method for a DataFrame doesn't like haven_labelled data?

Thank you for any guidance or insight you may be able to provide.

Results of running the minimal working example code above:

> library(S4Vectors)
> library(haven)
> dn <- labelled(c(rep(1,7),NA,rep(2,7)),label="Biological sex", labels=c(`N/A` = -2, Missing=-1,Male=1,Female=2))
> attr(dn,"format.spss") <- "F3.0"
> a <- data.frame(a=c(1:15),sex =labelled(c(rep("M",7),rep("F",8)), c(Male = "M", Female = "F")),dn=dn)
> DataFrame(a)
DataFrame with 15 rows and 3 columns
Error in h(simpleError(msg, call)) : 
  error in evaluating the argument 'x' in selecting a method for function 'makeNakedCharacterMatrixForDisplay': incorrect number of dimensions
> DataFrame(zap_labels(a))
DataFrame with 15 rows and 3 columns
            a         sex        dn
    <integer> <character> <numeric>
1           1           M         1
2           2           M         1
3           3           M         1
4           4           M         1
5           5           M         1
...       ...         ...       ...
11         11           F         2
12         12           F         2
13         13           F         2
14         14           F         2
15         15           F         2
> packageVersion("S4Vectors")
[1] ‘0.36.0’
> packageVersion("haven")
[1] ‘2.5.1’
> R.version$version.string
[1] "R version 4.2.2 (2022-10-31)"
> BiocManager::version()
[1] ‘3.16’
> sessionInfo()
R version 4.2.2 (2022-10-31)
Platform: aarch64-apple-darwin20 (64-bit)
Running under: macOS Big Sur 11.6.7

Matrix products: default
LAPACK: /Library/Frameworks/R.framework/Versions/4.2-arm64/Versions/4.2-arm64/Resources/lib/libRlapack.dylib

locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8

attached base packages:
[1] stats4    stats     graphics  grDevices utils     datasets 
[7] methods   base     

other attached packages:
[1] haven_2.5.1         S4Vectors_0.36.0    BiocGenerics_0.44.0

loaded via a namespace (and not attached):
 [1] fansi_1.0.3         utf8_1.2.2          digest_0.6.30      
 [4] lifecycle_1.0.3     magrittr_2.0.3      evaluate_0.18      
 [7] pillar_1.8.1        rlang_1.0.6         cli_3.4.1          
[10] rstudioapi_0.14     ellipsis_0.3.2      vctrs_0.5.1        
[13] rmarkdown_2.18      forcats_0.5.2       tools_4.2.2        
[16] glue_1.6.2          hms_1.1.2           xfun_0.35          
[19] yaml_2.3.6          fastmap_1.1.0       compiler_4.2.2     
[22] pkgconfig_2.0.3     BiocManager_1.30.19 htmltools_0.5.3    
[25] knitr_1.41          tibble_3.1.8
S4Vectors DataFrame haven_labelled haven • 2.8k views
ADD COMMENT
1
Entering edit mode
@james-w-macdonald-5106
Last seen 6 hours ago
United States

It's not really a bug if a package designed to be interoperable with the set of packages within Bioconductor is not interoperable with a CRAN package. As you have already found the solution (remove the labels that you added using the haven package), it seems you have already answered your own question?

ADD COMMENT
0
Entering edit mode

Understood, but I would certainly like to be able to carry through the haven-labelled data to later parts of our workflow instead of having to strip the labels out, as carrying this 'data dictionary' information forward would be helpful to us. Also, I'd like to understand why this behavior is happening.

ADD REPLY
1
Entering edit mode

It's hijacking the dispatch for the extractROWS function.

> library(S4Vectors)
> library(haven)
> dn <- labelled(c(rep(1,7),NA,rep(2,7)),label="Biological sex", labels=c(`N/A` = -2, Missing=-1,Male=1,Female=2))
> attr(dn,"format.spss") <- "F3.0"
> a <- data.frame(a=c(1:15),sex =labelled(c(rep("M",7),rep("F",8)), c(Male = "M", Female = "F")),dn=dn)
> showMethods(extractROWS)
Function: extractROWS (package S4Vectors)
x="ANY", i="ANY"
x="array", i="RangeNSBS"
x="data.frame", i="RangeNSBS"
x="DataFrame", i="ANY"
x="LLint", i="ANY"
x="LLint", i="NSBS"
x="LLint", i="RangeNSBS"
x="Rle", i="ANY"
x="Rle", i="NSBS"
x="Rle", i="RangeNSBS"
x="Rle", i="RleNSBS"
x="SortedByQueryHits", i="ANY"
x="TransposedDataFrame", i="ANY"
x="Vector", i="ANY"
x="vector_OR_factor", i="RangeNSBS"

## Seems OK so far
> DataFrame(a)
DataFrame with 15 rows and 3 columns
Error in h(simpleError(msg, call)) : 
  error in evaluating the argument 'x' in selecting a method for function 'makeNakedCharacterMatrixForDisplay': incorrect number of dimensions

> showMethods(extractROWS)
Function: extractROWS (package S4Vectors)
x="ANY", i="ANY"
x="array", i="RangeNSBS"
x="data.frame", i="RangeNSBS"
x="DataFrame", i="ANY"
x="DFrame", i="integer"
    (inherited from: x="DataFrame", i="ANY")
x="haven_labelled", i="NativeNSBS"   <------------------- It is now being dispatched to the generic x = "ANY", i = "ANY" method
    (inherited from: x="ANY", i="ANY")
x="integer", i="NativeNSBS"
    (inherited from: x="ANY", i="ANY")
x="LLint", i="ANY"
x="LLint", i="NSBS"
x="LLint", i="RangeNSBS"
x="Rle", i="ANY"
x="Rle", i="NSBS"
x="Rle", i="RangeNSBS"
x="Rle", i="RleNSBS"
x="SortedByQueryHits", i="ANY"
x="TransposedDataFrame", i="ANY"
x="Vector", i="ANY"
x="vector_OR_factor", i="RangeNSBS"

## Here's the last bit of traceback() after the error

> traceback()
28: h(simpleError(msg, call))
27: .handleSimpleError(function (cond) 
    .Internal(C_tryCatchHelper(addr, 1L, cond)), "incorrect number of dimensions", 
        base::quote(proxy[, ..., drop = FALSE]))
26: vec_index(x, i, ...)
25: `[.vctrs_vctr`(structure(c("M", "M", "M", "M", "M", "M", "M", 
    "F", "F", "F", "F", "F", "F", "F", "F"), labels = c(Male = "M", 
    Female = "F"), class = c("haven_labelled", "vctrs_vctr", "character"
    )), 1:5, drop = FALSE)
24: .Primitive("[")(structure(c("M", "M", "M", "M", "M", "M", "M", 
    "F", "F", "F", "F", "F", "F", "F", "F"), labels = c(Male = "M", 
    Female = "F"), class = c("haven_labelled", "vctrs_vctr", "character"
    )), 1:5, drop = FALSE)
23: do.call(`[`, args)
22: do.call(`[`, args)
21: FUN(X[[i]], ...)
20: FUN(X[[i]], ...)
19: lapply(as.list(x), extractROWS, i)
18: lapply(as.list(x), extractROWS, i)
17: extractROWS(x, i)

The extractROWS function comes from S4Vectors and is used by head (or show) to extract the first N rows from the various S4 objects defined in that package. But the columns of your 'a' data.frame aren't any of the listed objects, so they are given the default method, which in essence sends them off to the tidyverse vctrs package, which predictably blows up.

ADD REPLY
1
Entering edit mode

Perhaps you could explain your use-case? It's not apparent to me what sort of labeled data would best be encapsulated in a DataFrame, but perhaps I am missing something. If you are already a tidyverse aficionado is there a particular reason a tibble isn't a better data.frame like substance?

ADD REPLY
1
Entering edit mode

I mean

> tibble(a)
# A tibble: 15 × 3
       a sex        dn         
   <int> <chr+lbl>  <dbl+lbl>  
 1     1 M [Male]    1 [Male]  
 2     2 M [Male]    1 [Male]  
 3     3 M [Male]    1 [Male]  
 4     4 M [Male]    1 [Male]  
 5     5 M [Male]    1 [Male]  
 6     6 M [Male]    1 [Male]  
 7     7 M [Male]    1 [Male]  
 8     8 F [Female] NA         
 9     9 F [Female]  2 [Female]
10    10 F [Female]  2 [Female]
11    11 F [Female]  2 [Female]
12    12 F [Female]  2 [Female]
13    13 F [Female]  2 [Female]
14    14 F [Female]  2 [Female]
15    15 F [Female]  2 [Female]
ADD REPLY
0
Entering edit mode

Our use case is a Illumina EPIC chip analysis pipeline, where the sample data gets embedded in the rgSet object as a DataFrame. Many of our source data files are in SPSS *.sav format files, and so they are automatically read in as haven_labelled data, with nice embedded value labels that would be nice to carry along through our pipeline instead of removing. I think we should be able to do this by modifying the [ operator method:

setMethod("[", "DataFrame", ...

so it accomodate handle haven_labelled data by zapping the labels before handing off to extractROWS and extractCOLS.


Looks like DataFrame isn't happy with the tibble version either:

DataFrame(tibble(a))
DataFrame with 15 rows and 3 columns
 Error in h(simpleError(msg, call)) : 
error in evaluating the argument 'x' in selecting a method for function 'makeNakedCharacterMatrixForDisplay': incorrect number of dimensions
ADD REPLY
0
Entering edit mode

To attempt to fix this, I forked the S4Vectors library from BioConductor from here:

https://github.com/Bioconductor/S4Vectors

Then, in subsetting-utils.R, I changed the default_extractROWS function by adding a line at the beginning to zap the labels from x:

diff --git a/R/subsetting-utils.R b/R/subsetting-utils.R
index 2119c35..edb66d9 100644
--- a/R/subsetting-utils.R
+++ b/R/subsetting-utils.R
@@ -519,6 +519,9 @@ default_extractROWS <- function(x, i)
 {
   if (is.null(x) || missing(i))
     return(x)
+  if (any(sapply(x, inherits, "haven_labelled"))) {
+     x <- haven::zap_labels(x)
+  }
   ## dynamically call [i,,,..,drop=FALSE] with as many "," as length(dim)-1
   ndim <- max(length(dim(x)), 1L)
   i <- normalizeSingleBracketSubscript(i, x, allow.NAs=TRUE, allow.append=TRUE)

Using the modified S4Vectors package, a sub-settable rgSet can be created without needing to zap the labels from the haven_labelled columns.

However, these haven labels are preserved in the 'parent' rgSet but are not preserved in a subsetted 'child' rgSet:

> r1 <- rgSet[,c(1:10)]
> table(sapply(pData(rgSet), inherits, "haven_labelled"))

FALSE  TRUE 
  160    83 
> table(sapply(pData(r1), inherits, "haven_labelled"))

FALSE 
  243

So the goal of supporting haven_labelled data in rgSet's has only been partly achieved.

ADD REPLY
1
Entering edit mode

As I said, the right thing to do is to fix [on haven_labelled objects in the haven package itself. Any attempt to fix this elsewhere is not satisfactory and won't prevent those objects from causing problems. We must be able to use these objects anywhere double vectors are used (haven_labelled inherits from double). Right now we can't because they break on a common operation (x[i, drop=TRUE] or x[i, drop=FALSE]) that works just fine on double objects. It's impossible to know exactly how many places in the vast Bioconductor + CRAN ecosystem use x[i, drop=TRUE] or x[i, drop=FALSE] on double objects, but there are probably many many of them. This means that haven_labelled objects will cause problems in many many places, unless they can also handle x[i, drop=TRUE] and x[i, drop=FALSE].

ADD REPLY
1
Entering edit mode

Thank you for the additional explanation and insight - I will open an issue about this with the haven maintainers.

ADD REPLY
1
Entering edit mode

OK, opened an issue with the haven maintainers about this at:

https://github.com/tidyverse/haven/issues/698

Thank you!

ADD REPLY
2
Entering edit mode

Hi!

haven co-maintainer here - just wanted to add here for reference that this is an issue with the vctrs package, which we use to implement haven_labelled and haven_labelled_spss, and this error will likely also come up with other classes that use vctrs as a base.

I've opened an issue over there: https://github.com/r-lib/vctrs/issues/1751

Danny

ADD REPLY
0
Entering edit mode

Thank you for the helpful and clear explanation!

ADD REPLY
1
Entering edit mode
@herve-pages-1542
Last seen 15 hours ago
Seattle, WA, United States

Hi,

The reason this fails is because [ doesn't work properly on those haven_labelled objects:

library(haven)

dn <- labelled(c(rep(1,7),NA,rep(2,7)),label="Biological sex", labels=c(`N/A` = -2, Missing=-1,Male=1,Female=2))

dn[1:3]
# <labelled<double>[3]>: Biological sex
# [1] 1 1 1
# 
# Labels:
#  value   label
#     -2     N/A
#     -1 Missing
#      1    Male
#      2  Female

dn[1:3, drop=TRUE]
# Error in proxy[, ..., drop = FALSE] : incorrect number of dimensions

dn[1:3, drop=FALSE]
# Error in proxy[, ..., drop = FALSE] : incorrect number of dimensions

This is something that would need to be fixed in the haven package.

I'm not saying that using drop=TRUE or drop=FALSE is useful or should do something special when subsetting those objects, but at least specifying the drop argument should not cause an error. (Plus the error is really cryptic and misleading in the case of drop=TRUE.)

For example, on an ordinary vector, the argument is just ignored (i.e. has no effect):

letters[1:3, drop=TRUE]
# [1] "a" "b" "c"

letters[1:3, drop=FALSE]
# [1] "a" "b" "c"

In other words: haven_labelled objects should not break on that kind of operations if they want to be working seamlessly with the huge amount of code around that operates on vector-like objects (e.g. with S4Vectors::extractROWS() and many other things in the Bioconductor ecosystem and elsewhere).

Cheers,

H.

sessionInfo():

R version 4.2.0 Patched (2022-05-04 r82318)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Ubuntu 22.04.1 LTS

Matrix products: default
BLAS:   /home/hpages/R/R-4.2.r82318/lib/libRblas.so
LAPACK: /home/hpages/R/R-4.2.r82318/lib/libRlapack.so

locale:
 [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
 [3] LC_TIME=en_GB              LC_COLLATE=en_US.UTF-8    
 [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
 [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
 [9] LC_ADDRESS=C               LC_TELEPHONE=C            
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       

attached base packages:
[1] stats4    stats     graphics  grDevices utils     datasets  methods  
[8] base     

other attached packages:
[1] haven_2.5.1         S4Vectors_0.36.0    BiocGenerics_0.44.0

loaded via a namespace (and not attached):
 [1] fansi_1.0.3     utf8_1.2.2      lifecycle_1.0.3 magrittr_2.0.3 
 [5] pillar_1.8.1    rlang_1.0.6     cli_3.4.1       vctrs_0.5.1    
 [9] ellipsis_0.3.2  tools_4.2.0     forcats_0.5.2   glue_1.6.2     
[13] hms_1.1.2       compiler_4.2.0  pkgconfig_2.0.3 tcltk_4.2.0    
[17] tibble_3.1.8
ADD COMMENT
0
Entering edit mode

But it seems OK for the data.frame that the OP is using.

> a <- data.frame(a=c(1:15),sex =labelled(c(rep("M",7),rep("F",8)), c(Male = "M", Female = "F")),dn=dn)

> a[,3,drop = TRUE]
<labelled<double>[15]>: Biological sex
 [1]  1  1  1  1  1  1  1 NA  2  2  2  2  2  2  2

Labels:
 value   label
    -2     N/A
    -1 Missing
     1    Male
     2  Female
> a[1:5,3,drop = TRUE]
<labelled<double>[5]>: Biological sex
[1] 1 1 1 1 1

Labels:
 value   label
    -2     N/A
    -1 Missing
     1    Male
     2  Female

> a[1:5,3,drop = FALSE]
  dn
1  1
2  1
3  1
4  1
5  1

Where by 'OK' I mean it doesn't error out. I don't really get what the drop argument is doing in this instance, but not being a tidyverse person, I probably don't need to know.

ADD REPLY
0
Entering edit mode

Thank you for looking into this - yes, the source of the problem is that the SV4Vectors [ operator doesn't work properly on the haven_labelled objects. Looks to me like the [ operator is redefined within the SV4Vectors code, so probably simpler to modify the SV4Vectors [ method for DataFrame objects so that it can accommodate DataFrames that contain haven_labelled variables than to try to make a fix in the haven package.

Actually, the SV4Vectors documentation states some important points to be aware of regarding their [ operator: the man page for DataFrame says that when

x is a DataFrame.

x[i,j,drop]: Behaves very similarly to the [.data.frame method, except i can be a logical Rle object and subsetting by matrix indices is not supported. Indices containing NA's are also not supported.

ADD REPLY
1
Entering edit mode

There seems to be some confusion. It's not about using drop=TRUEor drop=FALSE on the data.frame or DataFrame that contains haven_labelled objects in their columns. It's about specifying the drop argument when subsetting the haven_labelled object _directly_. As you can see in the example I show above, this breaks [. This example does not involve the S4Vectors package, only the haven package.

So again, dn[1:3, drop=TRUE] should work on haven_labelled object dn, independently of what we think about the relevance or usefulness of specifying drop=TRUE in that case. That's because haven_labelled objects inherit from double objects, and doing x[1:3, drop=TRUE] on a double vector DOES work (drop=TRUE is ignored but it works). An important principle in OOP is that when your objects of class B extend class A, one should be able to use your objects in any place where A objects are used. In other words, you should make sure that B objects are compatible with A's API. You can extend the API, that is, you can introduce new operations for your B objects that are not supported by A objects, but you should never break A's API.

Cheers,

H.

ADD REPLY
0
Entering edit mode

Yes, I was confused. Thank you for taking the time to provide the additional explanation and clarification. I understand now.

ADD REPLY

Login before adding your answer.

Traffic: 659 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6