Creating a DataFrame
of haven_labelled
data results in a bug where the DataFrame
won't print properly.
Minimal working example
library(S4Vectors)
library(haven)
dn <- labelled(c(rep(1,7),NA,rep(2,7)),label="Biological sex", labels=c(`N/A` = -2, Missing=-1,Male=1,Female=2))
attr(dn,"format.spss") <- "F3.0"
a <- data.frame(a=c(1:15),sex =labelled(c(rep("M",7),rep("F",8)), c(Male = "M", Female = "F")),dn=dn)
DataFrame(a)
DataFrame(zap_labels(a))
packageVersion("S4Vectors")
packageVersion("haven")
R.version$version.string
BiocManager::version()
sessionInfo()
Note the error generated (below) when we attempt to execute the DataFrame(a)
command.
Also, as you can see from the example below, this error no longer occurs if we zap the labels, as DataFrame(zap_labels(a))
works fine.
Stepping through makeNakedCharacterMatrixForDisplay
to see which line generates the error, it looks to me like it is generated by the head(x, nhead)
part of this line:
m <- rbind(makeNakedCharacterMatrixForDisplay(head(x, nhead)),
rbind(rep.int("...", x_ncol)),
makeNakedCharacterMatrixForDisplay(tail(x, ntail)))
So maybe the head
method for a DataFrame
doesn't like haven_labelled
data?
Thank you for any guidance or insight you may be able to provide.
Results of running the minimal working example code above:
> library(S4Vectors)
> library(haven)
> dn <- labelled(c(rep(1,7),NA,rep(2,7)),label="Biological sex", labels=c(`N/A` = -2, Missing=-1,Male=1,Female=2))
> attr(dn,"format.spss") <- "F3.0"
> a <- data.frame(a=c(1:15),sex =labelled(c(rep("M",7),rep("F",8)), c(Male = "M", Female = "F")),dn=dn)
> DataFrame(a)
DataFrame with 15 rows and 3 columns
Error in h(simpleError(msg, call)) :
error in evaluating the argument 'x' in selecting a method for function 'makeNakedCharacterMatrixForDisplay': incorrect number of dimensions
> DataFrame(zap_labels(a))
DataFrame with 15 rows and 3 columns
a sex dn
<integer> <character> <numeric>
1 1 M 1
2 2 M 1
3 3 M 1
4 4 M 1
5 5 M 1
... ... ... ...
11 11 F 2
12 12 F 2
13 13 F 2
14 14 F 2
15 15 F 2
> packageVersion("S4Vectors")
[1] ‘0.36.0’
> packageVersion("haven")
[1] ‘2.5.1’
> R.version$version.string
[1] "R version 4.2.2 (2022-10-31)"
> BiocManager::version()
[1] ‘3.16’
> sessionInfo()
R version 4.2.2 (2022-10-31)
Platform: aarch64-apple-darwin20 (64-bit)
Running under: macOS Big Sur 11.6.7
Matrix products: default
LAPACK: /Library/Frameworks/R.framework/Versions/4.2-arm64/Versions/4.2-arm64/Resources/lib/libRlapack.dylib
locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
attached base packages:
[1] stats4 stats graphics grDevices utils datasets
[7] methods base
other attached packages:
[1] haven_2.5.1 S4Vectors_0.36.0 BiocGenerics_0.44.0
loaded via a namespace (and not attached):
[1] fansi_1.0.3 utf8_1.2.2 digest_0.6.30
[4] lifecycle_1.0.3 magrittr_2.0.3 evaluate_0.18
[7] pillar_1.8.1 rlang_1.0.6 cli_3.4.1
[10] rstudioapi_0.14 ellipsis_0.3.2 vctrs_0.5.1
[13] rmarkdown_2.18 forcats_0.5.2 tools_4.2.2
[16] glue_1.6.2 hms_1.1.2 xfun_0.35
[19] yaml_2.3.6 fastmap_1.1.0 compiler_4.2.2
[22] pkgconfig_2.0.3 BiocManager_1.30.19 htmltools_0.5.3
[25] knitr_1.41 tibble_3.1.8
Understood, but I would certainly like to be able to carry through the haven-labelled data to later parts of our workflow instead of having to strip the labels out, as carrying this 'data dictionary' information forward would be helpful to us. Also, I'd like to understand why this behavior is happening.
It's hijacking the dispatch for the
extractROWS
function.The
extractROWS
function comes fromS4Vectors
and is used byhead
(orshow
) to extract the first N rows from the various S4 objects defined in that package. But the columns of your 'a'data.frame
aren't any of the listed objects, so they are given the default method, which in essence sends them off to the tidyversevctrs
package, which predictably blows up.Perhaps you could explain your use-case? It's not apparent to me what sort of labeled data would best be encapsulated in a
DataFrame
, but perhaps I am missing something. If you are already a tidyverse aficionado is there a particular reason atibble
isn't a betterdata.frame
like substance?I mean
Our use case is a Illumina EPIC chip analysis pipeline, where the sample data gets embedded in the
rgSet
object as aDataFrame
. Many of our source data files are in SPSS *.sav format files, and so they are automatically read in ashaven_labelled
data, with nice embedded value labels that would be nice to carry along through our pipeline instead of removing. I think we should be able to do this by modifying the[
operator method:so it accomodate handle
haven_labelled
data by zapping the labels before handing off toextractROWS
andextractCOLS
.Looks like
DataFrame
isn't happy with thetibble
version either:To attempt to fix this, I forked the
S4Vectors
library from BioConductor from here:https://github.com/Bioconductor/S4Vectors
Then, in
subsetting-utils.R
, I changed thedefault_extractROWS
function by adding a line at the beginning to zap the labels fromx
:Using the modified
S4Vectors
package, a sub-settable rgSet can be created without needing to zap the labels from thehaven_labelled
columns.However, these haven labels are preserved in the 'parent' rgSet but are not preserved in a subsetted 'child' rgSet:
So the goal of supporting haven_labelled data in rgSet's has only been partly achieved.
As I said, the right thing to do is to fix
[
onhaven_labelled
objects in the haven package itself. Any attempt to fix this elsewhere is not satisfactory and won't prevent those objects from causing problems. We must be able to use these objects anywheredouble
vectors are used (haven_labelled
inherits fromdouble
). Right now we can't because they break on a common operation (x[i, drop=TRUE]
orx[i, drop=FALSE]
) that works just fine ondouble
objects. It's impossible to know exactly how many places in the vast Bioconductor + CRAN ecosystem usex[i, drop=TRUE]
orx[i, drop=FALSE]
ondouble
objects, but there are probably many many of them. This means thathaven_labelled
objects will cause problems in many many places, unless they can also handlex[i, drop=TRUE]
andx[i, drop=FALSE]
.Thank you for the additional explanation and insight - I will open an issue about this with the
haven
maintainers.OK, opened an issue with the
haven
maintainers about this at:https://github.com/tidyverse/haven/issues/698
Thank you!
Hi!
haven co-maintainer here - just wanted to add here for reference that this is an issue with the vctrs package, which we use to implement
haven_labelled
andhaven_labelled_spss
, and this error will likely also come up with other classes that use vctrs as a base.I've opened an issue over there: https://github.com/r-lib/vctrs/issues/1751
Danny
Thank you for the helpful and clear explanation!