I have a json array that I need to pull a string from. We’re storing this data in Bigquery and for the most part it’s been pretty easy, except for extracting the id from the user_id field from the array. There are seven possible strings on how the user_id
could appear. For instance, in the json array, user_id
could be something like this: "user_id": "client-id-12345678-abcd-0926-zyxw-de07e3f9ce85"
or it could be "user_id": "operator-id-12345678-abcd-0926-zyxw-de07e3f9ce85"
or "user_id": "auto_agent_id_12345678-abcd-0926-zyxw-de07e3f9ce85"
Here is what the sample json looks like:
{
"members": [
{
"channel_count": 0,
"profile_url": "something.png",
"push_enabled": true,
"push_trigger_option": "default",
"state": "joined",
"user_id": "desk_agent_id_12345678-abcd-0926-zyxw-de07e3f9ce85"
}
]
}
I thought the best way to handle this originally was to run a sql replace like so:
replace(replace('client-us-operator-us-iddddddd','client-us-', '' ), 'operator-us-', '')
But because there are a couple different user_id prefixes it’s making it difficult. I am not sure if those are all the possible edge cases. If it’s just client and operator it’s 12345678-abcd-0926-zyxw-de07e3f9ce85
and that works as expected. When it’s "desk_agent_id_12345678-abcd-0926-zyxw-de07e3f9ce85"
it pulls in the whole string, and sometime the user_id
appears as "agent_id_12345678-abcd-0926-zyxw-de07e3f9ce85"
I just need everything to the right of the prefix. So in this case it would be 12345678-abcd-0926-zyxw-de07e3f9ce85
. However depending on if it’s an automated message, or a user responding the prefix length changes, as well as the hyphen or underscore.
I’m not the strongest when using regex, so I’m on here to see if there a better way of handling the extraction of this id from the whole user_id string.
2
Answers
REGEXP
is what you will want to use since it allows for a defined set of pattern matching both for including and excluding. Answer at bottom.Breakdown of what the functions are doing:
JSON_EXTRACT_SCALAR(json_string, json_path)
$.fieldname
to point to the root of the JSON object (the $), and the specific field to locate (.fieldname).JSON_EXTRACT_SCALAR(your_json_object, '$.user_id')
-From my json, find the root key value pair for user_id.
REGEXP_EXTRACT(string, regex_pattern)
:for a string, match a pattern.
JSON_EXTRACT_SCALAR()
r'^(?:client-id-|operator-id-|auto_agent_id_|desk_agent_id_)([a-zA-Z0-9-]+)$'
^
points to start of string(?:client-id-|operator-id-|auto_agent_id_|desk_agent_id_)
non capturing group.(?...)
syntax groups prefixes without capturing them in the output. This is where you would put your different prefix possibilities.([a-zA-Z0-9-]+)
This is the capturing group that you want in the output. from left to right it says: capture anything that is from a-z (lowercase) A-Z (uppercase), 0-9 (digits), – (hyphens). If you expect ids to have other captured formating, you would include it here. Such as a period or slash.$
says the end of the string.From my understanding of the question you simply need to fetch whatever is after id-xxx or id_xxxx in that case you can just look for the keyword id and then take everything unless there are more conditions in that case would be good if you add more details to the question:
Step 1: Locate the right JSON Block i.e Members,
Step 2: Access the member details using JSON_VALUE, specifically access the user_id key
Step 3: Apply the regex function on the user_id key to extract the values you need.
My regex captures the group id_ or id- and takes any letter or digit after it. You can modify the last part as you see fit