skip to Main Content

A python program (https://github.com/MannLabs/alphapeptdeep) created the following hdf5 file–> https://drive.google.com/file/d/1Ct2B7IU2WsqJfT3eGoR1xn3GSOffFqtN/view?usp=sharing

I can successfully open it up in HDFView, view metadata, and even view the data for floating point value fields.

However, for string fields, it simply gives an error (which I suspect is misleading).

For example, if one tries to view the data for this field:

/library/mod_seq_df/sequence

It gives the following (misleading?) error:

failed to read scalar dataset: Filter not available exception: Read Failed

I installed HDFView 3.1.4 on a clean debian 11 docker container. And I installed the HDF5 filter plugins as well from HDF5-1.14.0 installation scripts.

Thoughts?

2

Answers


  1. After a little more investigation, I have some good news. I also found lots of challenges in the different HDF5 APIs.

    First the good news. I can access the data in your file using h5py (a Python package). So, your HDF5 file appears to be fine. While the problems with HDFView are a headache, your errors are not caused by data corruption (or problems with compression filters).

    This is what I have determined:

    1. Although the groups are named something_df and have some
      attributes that look like they "could be" Pandas attributes, this
      file was not created by Pandas. On closer inspection, several Pandas
      attributes you would expect to find are missing.
    2. I opened with PyTables (the underlying HDF5 technology for Pandas), and got an error accessing Groups due to an attribute named is_pd_dataframe that is saved as an 8-bit Enum (Boolean). Apparently PyTables doesn’t support that datatype.
    3. When I used PyTables to read a dataset of string values, I get this error: variable length strings are not supported yet. This is consistent with the Pandas error message in my earlier comment, and further confirmation the file probably wasn’t created by Pandas.
    4. I don’t know if items 2 or 3 are related to your HDFView and Java problems. I can open and view these attributes on each group’s Object Attribute Info tab. (As I understand, HDFView is built with Java – so those 2 problems are consistent.)

    I included my Python code (below) that extracts some data. (I know you want to work in Java, but this confirms the data is accessible.) At this point I suggest 2 paths: 1) adding HDF5/Java tags to your SO question to see if the Java community has an answer, and/or 2) contact The HDF Group about HDFView problems (you can post a question on their forum at: https://forum.hdfgroup.org/).

    Python/h5py solution:

    import h5py
    with h5py.File('predict.speclib.hdf','r') as h5f:           
        # read group attributes:
        grp = h5f['/library/mod_seq_df']
        print(f"is_pd_dataframe attribute value: {grp.attrs['is_pd_dataframe']}")
        print(f"last_updated attribute value: {grp.attrs['last_updated']}")
        print()        
        # read varlength string dataset:
        ds = h5f['/library/mod_seq_df/sequence']
        print(ds.shape, ds.dtype)
        for i in range(0,5):
            print(f'{i}: {ds[i]}')
        for i in range(ds.shape[0]-5,ds.shape[0]):
            print(f'{i}: {ds[i]}')       
        print()        
        # read float32 dataset:
        ds = h5f['/library/fragment_intensity_df/y_z1']
        print(ds.shape, ds.dtype)
        for i in range(0,5):
            print(f'{i}: {ds[i]}')
        for i in range(ds.shape[0]-5,ds.shape[0]):
            print(f'{i}: {ds[i]}') 
    

    Output looks like this:

    is_pd_dataframe attribute value: True
    last_updated attribute value: Sat Dec 31 16:26:42 2022
    
    (21785,) object
    0: b'YLQEREQR'
    1: b'SMLRWMER'
    2: b'FIQERFER'
    3: b'ENFRECLR'
    4: b'FLRLCHFK'
    21780: b'SGSGNETPLALKSGGGGGGSQTPR'
    21781: b'AAPLLAALTALLAAAAAGGDAPPGK'
    21782: b'STAVPPVPGPGPGPGPGPGPGSTSR'
    21783: b'GDPGDVGGPGPPGASGEPGAPGPPGK'
    21784: b'GSIFGSGGGGMSGGGGGAGGGGGGSSHR'
    
    (277994,) float32
    0: 0.0
    1: 0.0649685338139534
    2: 0.012746012769639492
    3: 0.036795008927583694
    4: 0.12544597685337067
    277989: 0.10976477712392807
    277990: 0.06583086401224136
    277991: 0.0935806930065155
    277992: 0.08901204913854599
    277993: 0.2575165033340454
    
    Login or Signup to reply.
  2. Some information may help, we used python code, https://github.com/MannLabs/alphabase/blob/main/alphabase/io/hdf.py#L242 and https://github.com/MannLabs/alphabase/blob/main/alphabase/io/hdf.py#L213, to create datasets in the hdf file. Although we call it df in this hdf file, it is still a HDF group object containing pd.Series as HDF datasets and a is_pd_dataframe attribute.

    A string dataset is a numpy.ndarray with dtype('O'), and no encoding (e.g. .encode('ascii')) is involved. I guess a string value in the string dataset is still 32-bit (unicode) array?

    Login or Signup to reply.
Please signup or login to give your own answer.
Back To Top
Search