Entering edit mode
Hin-Tak Leung
▴
30
@hin-tak-leung-3977
Last seen 10.3 years ago
Hi,
Commit r41352 from j.gosink broke flowFlowJo Bioc's nightly check for
most of summer/autumn 2009 until just before BioC 2.5 code freeze,
p.aboyoun committed r42419 which involves using iconv() to strip
multibyte data to make the nightly check pass. Unfortunately it
"fixes" some flowjo workspace files but breaks others. I finally find
the time to look at it - it is actually fairly serious and causes
silent data corruption and here is the fix - please review and commit.
The underlying issue is this: FlowJo workspaces files are, in
most(?all) cases, XML with iso8859-1 encoding (a.k.a. 'latin1'). With
win32 R which defaults to codepage 1252 (a superset of latin1), R
check passes - everything is in latin1 and the data stripping has no
effort. On Linux and other "modern" unix systems, which defaults to
UTF-8, R check fails - not all iso8859-1 text is valid UTF-8 text and
vice versa, and also, the multibyte data strip causes data corruption.
The proper fix is to query libxml2 about the xml encoding and set the
encoding explicitly - it is a substantial rewrite. As a side-effect,
the code possibly run faster as well - most of the gsub() don't not
need to be 'g'. The regular expressions are only concerned with
manipulating the header and only need to match the first instance.
Cheers,
Hin-Tak Leung