Creating a DataFrame of haven_labelled data results in a bug where the DataFrame won't print properly.
Minimal working example
library(S4Vectors)
library(haven)
dn <- labelled(c(rep(1,7),NA,rep(2,7)),label="Biological sex", labels=c(`N/A` = -2, Missing=-1,Male=1,Female=2))
attr(dn,"format.spss") <- "F3.0"
a <- data.frame(a=c(1:15),sex =labelled(c(rep("M",7),rep("F",8)), c(Male = "M", Female = "F")),dn=dn)
DataFrame(a)
DataFrame(zap_labels(a))
packageVersion("S4Vectors")
packageVersion("haven")
R.version$version.string
BiocManager::version()
sessionInfo()
Note the error generated (below) when we attempt to execute the DataFrame(a) command.
Also, as you can see from the example below, this error no longer occurs if we zap the labels, as DataFrame(zap_labels(a)) works fine.
Stepping through makeNakedCharacterMatrixForDisplay to see which line generates the error, it looks to me like it is generated by the head(x, nhead) part of this line:
m <- rbind(makeNakedCharacterMatrixForDisplay(head(x, nhead)),
rbind(rep.int("...", x_ncol)),
makeNakedCharacterMatrixForDisplay(tail(x, ntail)))
So maybe the head method for a DataFrame doesn't like haven_labelled data?
Thank you for any guidance or insight you may be able to provide.
Results of running the minimal working example code above:
> library(S4Vectors)
> library(haven)
> dn <- labelled(c(rep(1,7),NA,rep(2,7)),label="Biological sex", labels=c(`N/A` = -2, Missing=-1,Male=1,Female=2))
> attr(dn,"format.spss") <- "F3.0"
> a <- data.frame(a=c(1:15),sex =labelled(c(rep("M",7),rep("F",8)), c(Male = "M", Female = "F")),dn=dn)
> DataFrame(a)
DataFrame with 15 rows and 3 columns
Error in h(simpleError(msg, call)) :
error in evaluating the argument 'x' in selecting a method for function 'makeNakedCharacterMatrixForDisplay': incorrect number of dimensions
> DataFrame(zap_labels(a))
DataFrame with 15 rows and 3 columns
a sex dn
<integer> <character> <numeric>
1 1 M 1
2 2 M 1
3 3 M 1
4 4 M 1
5 5 M 1
... ... ... ...
11 11 F 2
12 12 F 2
13 13 F 2
14 14 F 2
15 15 F 2
> packageVersion("S4Vectors")
[1] ‘0.36.0’
> packageVersion("haven")
[1] ‘2.5.1’
> R.version$version.string
[1] "R version 4.2.2 (2022-10-31)"
> BiocManager::version()
[1] ‘3.16’
> sessionInfo()
R version 4.2.2 (2022-10-31)
Platform: aarch64-apple-darwin20 (64-bit)
Running under: macOS Big Sur 11.6.7
Matrix products: default
LAPACK: /Library/Frameworks/R.framework/Versions/4.2-arm64/Versions/4.2-arm64/Resources/lib/libRlapack.dylib
locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
attached base packages:
[1] stats4 stats graphics grDevices utils datasets
[7] methods base
other attached packages:
[1] haven_2.5.1 S4Vectors_0.36.0 BiocGenerics_0.44.0
loaded via a namespace (and not attached):
[1] fansi_1.0.3 utf8_1.2.2 digest_0.6.30
[4] lifecycle_1.0.3 magrittr_2.0.3 evaluate_0.18
[7] pillar_1.8.1 rlang_1.0.6 cli_3.4.1
[10] rstudioapi_0.14 ellipsis_0.3.2 vctrs_0.5.1
[13] rmarkdown_2.18 forcats_0.5.2 tools_4.2.2
[16] glue_1.6.2 hms_1.1.2 xfun_0.35
[19] yaml_2.3.6 fastmap_1.1.0 compiler_4.2.2
[22] pkgconfig_2.0.3 BiocManager_1.30.19 htmltools_0.5.3
[25] knitr_1.41 tibble_3.1.8

Understood, but I would certainly like to be able to carry through the haven-labelled data to later parts of our workflow instead of having to strip the labels out, as carrying this 'data dictionary' information forward would be helpful to us. Also, I'd like to understand why this behavior is happening.
It's hijacking the dispatch for the
extractROWSfunction.The
extractROWSfunction comes fromS4Vectorsand is used byhead(orshow) to extract the first N rows from the various S4 objects defined in that package. But the columns of your 'a'data.framearen't any of the listed objects, so they are given the default method, which in essence sends them off to the tidyversevctrspackage, which predictably blows up.Perhaps you could explain your use-case? It's not apparent to me what sort of labeled data would best be encapsulated in a
DataFrame, but perhaps I am missing something. If you are already a tidyverse aficionado is there a particular reason atibbleisn't a betterdata.framelike substance?I mean
Our use case is a Illumina EPIC chip analysis pipeline, where the sample data gets embedded in the
rgSetobject as aDataFrame. Many of our source data files are in SPSS *.sav format files, and so they are automatically read in ashaven_labelleddata, with nice embedded value labels that would be nice to carry along through our pipeline instead of removing. I think we should be able to do this by modifying the[operator method:so it accomodate handle
haven_labelleddata by zapping the labels before handing off toextractROWSandextractCOLS.Looks like
DataFrameisn't happy with thetibbleversion either:To attempt to fix this, I forked the
S4Vectorslibrary from BioConductor from here:https://github.com/Bioconductor/S4Vectors
Then, in
subsetting-utils.R, I changed thedefault_extractROWSfunction by adding a line at the beginning to zap the labels fromx:Using the modified
S4Vectorspackage, a sub-settable rgSet can be created without needing to zap the labels from thehaven_labelledcolumns.However, these haven labels are preserved in the 'parent' rgSet but are not preserved in a subsetted 'child' rgSet:
So the goal of supporting haven_labelled data in rgSet's has only been partly achieved.
As I said, the right thing to do is to fix
[onhaven_labelledobjects in the haven package itself. Any attempt to fix this elsewhere is not satisfactory and won't prevent those objects from causing problems. We must be able to use these objects anywheredoublevectors are used (haven_labelledinherits fromdouble). Right now we can't because they break on a common operation (x[i, drop=TRUE]orx[i, drop=FALSE]) that works just fine ondoubleobjects. It's impossible to know exactly how many places in the vast Bioconductor + CRAN ecosystem usex[i, drop=TRUE]orx[i, drop=FALSE]ondoubleobjects, but there are probably many many of them. This means thathaven_labelledobjects will cause problems in many many places, unless they can also handlex[i, drop=TRUE]andx[i, drop=FALSE].Thank you for the additional explanation and insight - I will open an issue about this with the
havenmaintainers.OK, opened an issue with the
havenmaintainers about this at:https://github.com/tidyverse/haven/issues/698
Thank you!
Hi!
haven co-maintainer here - just wanted to add here for reference that this is an issue with the vctrs package, which we use to implement
haven_labelledandhaven_labelled_spss, and this error will likely also come up with other classes that use vctrs as a base.I've opened an issue over there: https://github.com/r-lib/vctrs/issues/1751
Danny
Thank you for the helpful and clear explanation!