I have a dataset in a mysql DB, where there is no documentation nor anybody who could tell me why things have been done the way they are or how should I deal with certain things – I think one might call this, reverse engineering 🙂
There are strings that are structured using ASCII control characters. So the string will look like this for exapmle:
^AKEY^BVALUE^AKEY^BVALUE^AKEY^BVALUE
According to https://www.geeksforgeeks.org/control-characters/
^A is "Start of Header"
^B is "Start of Text"
to me, this seems a super old school way to structure data and it is the first time i see something like this.
I am looking for a way how to parse this data, prefarrably a python library that I could parse the string and get some key-value map that I can control programatically.
Any help on how to deal with such data is highly appreciated.
2
Answers
Here’s an example of a solution in Python:
I create a string like yours:
Then I use list comprehensions to split the string by the delimiters, and make that into a dict:
I call
^A
,^B
and so on "caret values" – not sure if that is the correct/official term for them (see caret notation). This is how a tool such as vi displays the related ASCII control characters.ASSUMPTION: These are single characters – different from a
^
ASCII character followed by anA
ASCII character. I am assuming that your source data contains these as control characters, not as^
followed byA
and so on. If my assumption is wrong, then the following will not help you.As you note,
^A
is the caret value for SOH – start of header – which, in turn, can be represented by the hex stringx01
.For example, in Python that would be:
You can use this (and
x02
for^B
) with Python’ssplit()
function.I don’t think you necessarily need a Python library for that. Just use:
to split the initial string into an array of individual key/value strings – and then split each of those into their separate key and value using:
Or maybe create a dictionary:
If you expect to find missing keys and/or values, you would need code to handle those.
There are various Stack Overflow questions which discuss splitting a Python string into key value pairs.
Those typically use non-control characters for the separators. But you can replace those with the ones you need for your data.
You can see a more complete list of "caret values" here for reference.
Here are some of them:
Speculating here, since we don’t know the origins of your specific data set – but you do sometimes see these control characters used in this way, because these control characters do not typically occur in the data keys and values. They are therefore good candidates to be used as field separators, line terminators, and so on. I have seen this approach used in some enterprise software (Ab Initio is one).