silent data corruption in flowFlowJo, and fix
0
0
Entering edit mode
@hin-tak-leung-3977
Last seen 10.3 years ago
Hi, Commit r41352 from j.gosink broke flowFlowJo Bioc's nightly check for most of summer/autumn 2009 until just before BioC 2.5 code freeze, p.aboyoun committed r42419 which involves using iconv() to strip multibyte data to make the nightly check pass. Unfortunately it "fixes" some flowjo workspace files but breaks others. I finally find the time to look at it - it is actually fairly serious and causes silent data corruption and here is the fix - please review and commit. The underlying issue is this: FlowJo workspaces files are, in most(?all) cases, XML with iso8859-1 encoding (a.k.a. 'latin1'). With win32 R which defaults to codepage 1252 (a superset of latin1), R check passes - everything is in latin1 and the data stripping has no effort. On Linux and other "modern" unix systems, which defaults to UTF-8, R check fails - not all iso8859-1 text is valid UTF-8 text and vice versa, and also, the multibyte data strip causes data corruption. The proper fix is to query libxml2 about the xml encoding and set the encoding explicitly - it is a substantial rewrite. As a side-effect, the code possibly run faster as well - most of the gsub() don't not need to be 'g'. The regular expressions are only concerned with manipulating the header and only need to match the first instance. Cheers, Hin-Tak Leung
flowFlowJo flowFlowJo • 823 views
ADD COMMENT

Login before adding your answer.

Traffic: 344 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6