rhdf5 32-bit unsigned int issue
8
0
Entering edit mode
wrob311 • 0
@wrob311-8268
Last seen 5.8 years ago
United States

Hi,

There is a bug in "rhdf5" library (Biocunductor 3.1) when handing 32-bit unsigned data types.

Create HDF dataset which has dataset having 32-bit unsigned int data type, i.e.

import h5py

f = h5py.File('t.h5', 'w')
ds = f.create_dataset('test', dtype='u4', shape=(4,))
ds[:] = [1, 2 ** 31 + 2, 3, 4]  # note 2nd value is 32-bit unsigned int equal to 2147483650
f.close() 

Read the dataset with "rhdf5" library in R

library(rhdf5)
f = h5read('t.h5', 'test')
print(f)
[1]          1 2147483647          3          4   # expected 2nd number is 2147483650

It seems like "u4" data type is read as 32-bit signed int type (BTW. using bit64conversion='int' does not help).

This was tried on both Linux 64-bit and Windows 64-bit, R 3.2.1 64-bit.

Regards,

Artur

rhdf5 bug • 1.7k views
ADD COMMENT
0
Entering edit mode
Bernd Fischer ▴ 540
@bernd-fischer-5348
Last seen 4.3 years ago
Germany / Heidelberg / DKFZ

Dear Artur,

I'm not a python user, therefore, I cannot reproduce the example. Further, I do not know exactly what the 'u4' datatype is. As far as I know, HDF5 does only provide 'unit8' as the smallest unsigned integer, right?

Can you send me the file 't.h5' for inspection (by Email)?

Bernd

ADD COMMENT
0
Entering edit mode
wrob311 • 0
@wrob311-8268
Last seen 5.8 years ago
United States

This is what h5dump says about the file

$ h5dump t.h5 
HDF5 "t.h5" {
GROUP "/" {
   DATASET "test" {
      DATATYPE  H5T_STD_U32LE
      DATASPACE  SIMPLE { ( 4 ) / ( 4 ) }
      DATA {
      (0): 1, 2147483650, 3, 4
      }
   }
}
}

The "u4" is unsigned 32-bit integer.

Sending the file via e-mail as well.

 

ADD COMMENT
0
Entering edit mode
Bernd Fischer ▴ 540
@bernd-fischer-5348
Last seen 4.3 years ago
Germany / Heidelberg / DKFZ

Dear Artur,

Yes, data is read in as signed integer. Therefore, numbers larger 2^31-1 are not read in correctly. Since R does not have an unsigned datatype, there will not be a solution for the problem, but I will introduce warnings whenever there is a number that cannot be represented in R. I will let you know, when it is ready.

Bernd

 

 

 

ADD COMMENT
0
Entering edit mode

Any chance to convert them to floating point numbers as suggested here

    http://r.789695.n4.nabble.com/Calling-C-code-fom-R-How-to-export-C-quot-unsigned-quot-integer-to-R-td799550.html

?

ADD REPLY
0
Entering edit mode
Bernd Fischer ▴ 540
@bernd-fischer-5348
Last seen 4.3 years ago
Germany / Heidelberg / DKFZ

The reading of integer has now been changed completely to fix bugs when reading unsigned integers. Now, unsigned integers (32-bit as well as 64-bit) are read in properly. However, since not all values can be represented in R,

1.) a section was added to the rhdf5 vignette explaining the ranges of the different types in R and HDF5.

2.) unsigned integers can now be read to a R-type double or integer64 (package bit64) using the additional parameter bit64conversion="double" or bit64conversion="bit64". For 32-bit unsigned integers this is always without loss of integer precision.

3.) a warning is thrown, whenever there is an integer overflow while reading the data.

4.) a warning is thrown as well, whenever the integer precision gets lost while reading to a double with bit64conversion="double".

I seems that the Bioconductor build pipeline didn't run last night. When ever the devel version of Bioconductor is build again, the updates will appear in the devel branch with the version number 2.13.3.

Best,

Bernd

ADD COMMENT
0
Entering edit mode
Bernd Fischer ▴ 540
@bernd-fischer-5348
Last seen 4.3 years ago
Germany / Heidelberg / DKFZ

...

5.) Values that cannot represented in the chosen R integer representation are replaced by NA.

ADD COMMENT
0
Entering edit mode
wrob311 • 0
@wrob311-8268
Last seen 5.8 years ago
United States

Thanks a lot.

ADD COMMENT
0
Entering edit mode
wrob311 • 0
@wrob311-8268
Last seen 5.8 years ago
United States

There is a problem when using "bit64conversion='bit64'" and loading a group of data.

For example loading single dataset works (the same file as in my first post)

> f = h5read('t.h5', '/test', bit64conversion='bit64')

But loading whole group gives always warning (despite specifying 'bit64conversion' option)

> f = h5read('t.h5', '/', bit64conversion='bit64')
Warning message:
In H5Dread(h5dataset = h5dataset, h5spaceFile = h5spaceFile, h5spaceMem = h5spaceMem,  :
  NAs produced by integer overflow while converting 64-bit integer or unsigned 32-bit integer from HDF5 to a 32-bit integer in R. Choose bit64conversion='bit64' or bit64conversion='double' to avoid data loss and see the vignette 'rhdf5' for more details about 64-bit integers.

 

ADD COMMENT
0
Entering edit mode
Bernd Fischer ▴ 540
@bernd-fischer-5348
Last seen 4.3 years ago
Germany / Heidelberg / DKFZ

I spotted the error and will fix it soon.

 

ADD COMMENT

Login before adding your answer.

Traffic: 438 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6