A python program (https://github.com/MannLabs/alphapeptdeep) created the following hdf5 file–> https://drive.google.com/file/d/1Ct2B7IU2WsqJfT3eGoR1xn3GSOffFqtN/view?usp=sharing
I can successfully open it up in HDFView, view metadata, and even view the data for floating point value fields.
However, for string fields, it simply gives an error (which I suspect is misleading).
For example, if one tries to view the data for this field:
/library/mod_seq_df/sequence
It gives the following (misleading?) error:
failed to read scalar dataset: Filter not available exception: Read Failed
I installed HDFView 3.1.4 on a clean debian 11 docker container. And I installed the HDF5 filter plugins as well from HDF5-1.14.0 installation scripts.
Thoughts?
2
Answers
After a little more investigation, I have some good news. I also found lots of challenges in the different HDF5 APIs.
First the good news. I can access the data in your file using h5py (a Python package). So, your HDF5 file appears to be fine. While the problems with HDFView are a headache, your errors are not caused by data corruption (or problems with compression filters).
This is what I have determined:
something_df
and have someattributes that look like they "could be" Pandas attributes, this
file was not created by Pandas. On closer inspection, several Pandas
attributes you would expect to find are missing.
is_pd_dataframe
that is saved as an 8-bit Enum (Boolean). Apparently PyTables doesn’t support that datatype.variable length strings are not supported yet
. This is consistent with the Pandas error message in my earlier comment, and further confirmation the file probably wasn’t created by Pandas.I included my Python code (below) that extracts some data. (I know you want to work in Java, but this confirms the data is accessible.) At this point I suggest 2 paths: 1) adding HDF5/Java tags to your SO question to see if the Java community has an answer, and/or 2) contact The HDF Group about HDFView problems (you can post a question on their forum at: https://forum.hdfgroup.org/).
Python/h5py solution:
Output looks like this:
Some information may help, we used python code, https://github.com/MannLabs/alphabase/blob/main/alphabase/io/hdf.py#L242 and https://github.com/MannLabs/alphabase/blob/main/alphabase/io/hdf.py#L213, to create datasets in the hdf file. Although we call it df in this hdf file, it is still a HDF group object containing pd.Series as HDF datasets and a
is_pd_dataframe
attribute.A string dataset is a
numpy.ndarray
withdtype('O')
, and no encoding (e.g..encode('ascii')
) is involved. I guess a string value in the string dataset is still 32-bit (unicode) array?