skip to Main Content

I have a dataset in a mysql DB, where there is no documentation nor anybody who could tell me why things have been done the way they are or how should I deal with certain things – I think one might call this, reverse engineering 🙂

There are strings that are structured using ASCII control characters. So the string will look like this for exapmle:

^AKEY^BVALUE^AKEY^BVALUE^AKEY^BVALUE

According to https://www.geeksforgeeks.org/control-characters/
^A is "Start of Header"
^B is "Start of Text"

to me, this seems a super old school way to structure data and it is the first time i see something like this.

I am looking for a way how to parse this data, prefarrably a python library that I could parse the string and get some key-value map that I can control programatically.

Any help on how to deal with such data is highly appreciated.

2

Answers


  1. Here’s an example of a solution in Python:

    I create a string like yours:

    >>> s="^AKEY1^BVALUE1^AKEY2^BVALUE2^AKEY3^BVALUE3"
    

    Then I use list comprehensions to split the string by the delimiters, and make that into a dict:

    >>> {k: v for (k, v) in (kv.split("^B") for kv in s.split("^A")[1:])}
    {'KEY1': 'VALUE1', 'KEY2': 'VALUE2', 'KEY3': 'VALUE3'}
    
    Login or Signup to reply.
  2. I call ^A, ^B and so on "caret values" – not sure if that is the correct/official term for them (see caret notation). This is how a tool such as vi displays the related ASCII control characters.

    ASSUMPTION: These are single characters – different from a ^ ASCII character followed by an A ASCII character. I am assuming that your source data contains these as control characters, not as ^ followed by A and so on. If my assumption is wrong, then the following will not help you.

    As you note, ^A is the caret value for SOH – start of header – which, in turn, can be represented by the hex string x01.

    For example, in Python that would be:

    my_soh = 'x01'
    

    You can use this (and x02 for ^B) with Python’s split() function.

    I don’t think you necessarily need a Python library for that. Just use:

    key_val_pairs = my_string.split('x01')
    

    to split the initial string into an array of individual key/value strings – and then split each of those into their separate key and value using:

    key_val_pair.split('x02')
    

    Or maybe create a dictionary:

    d = dict(s.split('x02') for s in key_val_pairs )
    

    If you expect to find missing keys and/or values, you would need code to handle those.

    There are various Stack Overflow questions which discuss splitting a Python string into key value pairs.

    Those typically use non-control characters for the separators. But you can replace those with the ones you need for your data.


    You can see a more complete list of "caret values" here for reference.

    Here are some of them:

    Name                Abbrev  Caret   Dec  Hex    Oct  Esc
    Null                NUL     ^@      0    00     000  
    Start of heading    SOH     ^A      1    01     001 
    Start of text       STX     ^B      2    02     002 
    Line feed           LF      ^J      10   0a     012  n
    Vertical tab        VT      ^K      11   0b     013  v
    

    Speculating here, since we don’t know the origins of your specific data set – but you do sometimes see these control characters used in this way, because these control characters do not typically occur in the data keys and values. They are therefore good candidates to be used as field separators, line terminators, and so on. I have seen this approach used in some enterprise software (Ab Initio is one).

    Login or Signup to reply.
Please signup or login to give your own answer.
Back To Top
Search