haven_labelled data results in a bug where the
DataFrame won't print properly.
Minimal working example
library(S4Vectors) library(haven) dn <- labelled(c(rep(1,7),NA,rep(2,7)),label="Biological sex", labels=c(`N/A` = -2, Missing=-1,Male=1,Female=2)) attr(dn,"format.spss") <- "F3.0" a <- data.frame(a=c(1:15),sex =labelled(c(rep("M",7),rep("F",8)), c(Male = "M", Female = "F")),dn=dn) DataFrame(a) DataFrame(zap_labels(a)) packageVersion("S4Vectors") packageVersion("haven") R.version$version.string BiocManager::version() sessionInfo()
Note the error generated (below) when we attempt to execute the
Also, as you can see from the example below, this error no longer occurs if we zap the labels, as
DataFrame(zap_labels(a)) works fine.
makeNakedCharacterMatrixForDisplay to see which line generates the error, it looks to me like it is generated by the
head(x, nhead) part of this line:
m <- rbind(makeNakedCharacterMatrixForDisplay(head(x, nhead)), rbind(rep.int("...", x_ncol)), makeNakedCharacterMatrixForDisplay(tail(x, ntail)))
So maybe the
head method for a
DataFrame doesn't like
Thank you for any guidance or insight you may be able to provide.
Results of running the minimal working example code above:
> library(S4Vectors) > library(haven) > dn <- labelled(c(rep(1,7),NA,rep(2,7)),label="Biological sex", labels=c(`N/A` = -2, Missing=-1,Male=1,Female=2)) > attr(dn,"format.spss") <- "F3.0" > a <- data.frame(a=c(1:15),sex =labelled(c(rep("M",7),rep("F",8)), c(Male = "M", Female = "F")),dn=dn) > DataFrame(a) DataFrame with 15 rows and 3 columns Error in h(simpleError(msg, call)) : error in evaluating the argument 'x' in selecting a method for function 'makeNakedCharacterMatrixForDisplay': incorrect number of dimensions > DataFrame(zap_labels(a)) DataFrame with 15 rows and 3 columns a sex dn <integer> <character> <numeric> 1 1 M 1 2 2 M 1 3 3 M 1 4 4 M 1 5 5 M 1 ... ... ... ... 11 11 F 2 12 12 F 2 13 13 F 2 14 14 F 2 15 15 F 2 > packageVersion("S4Vectors")  ‘0.36.0’ > packageVersion("haven")  ‘2.5.1’ > R.version$version.string  "R version 4.2.2 (2022-10-31)" > BiocManager::version()  ‘3.16’ > sessionInfo() R version 4.2.2 (2022-10-31) Platform: aarch64-apple-darwin20 (64-bit) Running under: macOS Big Sur 11.6.7 Matrix products: default LAPACK: /Library/Frameworks/R.framework/Versions/4.2-arm64/Versions/4.2-arm64/Resources/lib/libRlapack.dylib locale:  en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8 attached base packages:  stats4 stats graphics grDevices utils datasets  methods base other attached packages:  haven_2.5.1 S4Vectors_0.36.0 BiocGenerics_0.44.0 loaded via a namespace (and not attached):  fansi_1.0.3 utf8_1.2.2 digest_0.6.30  lifecycle_1.0.3 magrittr_2.0.3 evaluate_0.18  pillar_1.8.1 rlang_1.0.6 cli_3.4.1  rstudioapi_0.14 ellipsis_0.3.2 vctrs_0.5.1  rmarkdown_2.18 forcats_0.5.2 tools_4.2.2  glue_1.6.2 hms_1.1.2 xfun_0.35  yaml_2.3.6 fastmap_1.1.0 compiler_4.2.2  pkgconfig_2.0.3 BiocManager_1.30.19 htmltools_0.5.3  knitr_1.41 tibble_3.1.8
Understood, but I would certainly like to be able to carry through the haven-labelled data to later parts of our workflow instead of having to strip the labels out, as carrying this 'data dictionary' information forward would be helpful to us. Also, I'd like to understand why this behavior is happening.
It's hijacking the dispatch for the
extractROWSfunction comes from
S4Vectorsand is used by
show) to extract the first N rows from the various S4 objects defined in that package. But the columns of your 'a'
data.framearen't any of the listed objects, so they are given the default method, which in essence sends them off to the tidyverse
vctrspackage, which predictably blows up.
Perhaps you could explain your use-case? It's not apparent to me what sort of labeled data would best be encapsulated in a
DataFrame, but perhaps I am missing something. If you are already a tidyverse aficionado is there a particular reason a
tibbleisn't a better
Our use case is a Illumina EPIC chip analysis pipeline, where the sample data gets embedded in the
rgSetobject as a
DataFrame. Many of our source data files are in SPSS *.sav format files, and so they are automatically read in as
haven_labelleddata, with nice embedded value labels that would be nice to carry along through our pipeline instead of removing. I think we should be able to do this by modifying the
so it accomodate handle
haven_labelleddata by zapping the labels before handing off to
DataFrameisn't happy with the
To attempt to fix this, I forked the
S4Vectorslibrary from BioConductor from here:
subsetting-utils.R, I changed the
default_extractROWSfunction by adding a line at the beginning to zap the labels from
Using the modified
S4Vectorspackage, a sub-settable rgSet can be created without needing to zap the labels from the
However, these haven labels are preserved in the 'parent' rgSet but are not preserved in a subsetted 'child' rgSet:
So the goal of supporting haven_labelled data in rgSet's has only been partly achieved.
As I said, the right thing to do is to fix
haven_labelledobjects in the haven package itself. Any attempt to fix this elsewhere is not satisfactory and won't prevent those objects from causing problems. We must be able to use these objects anywhere
doublevectors are used (
double). Right now we can't because they break on a common operation (
x[i, drop=FALSE]) that works just fine on
doubleobjects. It's impossible to know exactly how many places in the vast Bioconductor + CRAN ecosystem use
doubleobjects, but there are probably many many of them. This means that
haven_labelledobjects will cause problems in many many places, unless they can also handle
Thank you for the additional explanation and insight - I will open an issue about this with the
OK, opened an issue with the
havenmaintainers about this at:
haven co-maintainer here - just wanted to add here for reference that this is an issue with the vctrs package, which we use to implement
haven_labelled_spss, and this error will likely also come up with other classes that use vctrs as a base.
I've opened an issue over there: https://github.com/r-lib/vctrs/issues/1751
Thank you for the helpful and clear explanation!