How to retrieve a fixed length string from json array

Kourvoisier26
August 24, 2024
153 views
0 votes
2 Answers

I have a json array that I need to pull a string from. We’re storing this data in Bigquery and for the most part it’s been pretty easy, except for extracting the id from the user_id field from the array. There are seven possible strings on how the user_id could appear. For instance, in the json array, user_id could be something like this: "user_id": "client-id-12345678-abcd-0926-zyxw-de07e3f9ce85" or it could be "user_id": "operator-id-12345678-abcd-0926-zyxw-de07e3f9ce85" or "user_id": "auto_agent_id_12345678-abcd-0926-zyxw-de07e3f9ce85"

Here is what the sample json looks like:

{
    "members": [
        {
            "channel_count": 0,
            "profile_url": "something.png",
            "push_enabled": true,
            "push_trigger_option": "default",
            "state": "joined",
            "user_id": "desk_agent_id_12345678-abcd-0926-zyxw-de07e3f9ce85"
        }
    ]
}

I thought the best way to handle this originally was to run a sql replace like so:
replace(replace('client-us-operator-us-iddddddd','client-us-', '' ), 'operator-us-', '') But because there are a couple different user_id prefixes it’s making it difficult. I am not sure if those are all the possible edge cases. If it’s just client and operator it’s 12345678-abcd-0926-zyxw-de07e3f9ce85 and that works as expected. When it’s "desk_agent_id_12345678-abcd-0926-zyxw-de07e3f9ce85" it pulls in the whole string, and sometime the user_id appears as "agent_id_12345678-abcd-0926-zyxw-de07e3f9ce85"

I just need everything to the right of the prefix. So in this case it would be 12345678-abcd-0926-zyxw-de07e3f9ce85. However depending on if it’s an automated message, or a user responding the prefix length changes, as well as the hyphen or underscore.

I’m not the strongest when using regex, so I’m on here to see if there a better way of handling the extraction of this id from the whole user_id string.

Answers

- hSin
- August 21, 2024 at 8:54 pm
- 0 votes
0
REGEXP is what you will want to use since it allows for a defined set of pattern matching both for including and excluding. Answer at bottom.

Breakdown of what the functions are doing:

JSON_EXTRACT_SCALAR(json_string, json_path)
- json_string: the json string where you are trying to extract a value.
- json_path: path to the value within the json, using $.fieldname to point to the root of the JSON object (the $), and the specific field to locate (.fieldname).
JSON_EXTRACT_SCALAR(your_json_object, '$.user_id')
-From my json, find the root key value pair for user_id.
- This also applies to the UNNEST of all the JSONs for members.
REGEXP_EXTRACT(string, regex_pattern):
for a string, match a pattern.
- string: the extracted portion of user_id from JSON_EXTRACT_SCALAR()
- regex_pattern: r'^(?:client-id-|operator-id-|auto_agent_id_|desk_agent_id_)([a-zA-Z0-9-]+)$'
  - ^ points to start of string
  - (?:client-id-|operator-id-|auto_agent_id_|desk_agent_id_) non capturing group. (?...) syntax groups prefixes without capturing them in the output. This is where you would put your different prefix possibilities.
  - ([a-zA-Z0-9-]+) This is the capturing group that you want in the output. from left to right it says: capture anything that is from a-z (lowercase) A-Z (uppercase), 0-9 (digits), – (hyphens). If you expect ids to have other captured formating, you would include it here. Such as a period or slash.
  - $ says the end of the string.
```
WITH sample_data AS (
  SELECT '''
  {
    "members": [
        {
            "channel_count": 0,
            "profile_url": "something.png",
            "push_enabled": true,
            "push_trigger_option": "default",
            "state": "joined",
            "user_id": "desk_agent_id_12345678-abcd-0926-zyxw-de07e3f9ce85"
        },
        {
            "channel_count": 1,
            "profile_url": "another.png",
            "push_enabled": false,
            "push_trigger_option": "custom",
            "state": "joined",
            "user_id": "operator-id-87654321-zyxw-0926-abcd-ce85de07e3f9"
        }
    ]
  }
  ''' AS json_data
)

SELECT 
  REGEXP_EXTRACT(JSON_EXTRACT_SCALAR(m, '$.user_id'), r'^(?:client-id-|operator-id-|auto_agent_id_|desk_agent_id_)([a-zA-Z0-9-]+)$') AS extracted_user_id
FROM 
  sample_data,
  UNNEST(JSON_EXTRACT_ARRAY(json_data, "$.members")) AS m
```
Login or Signup to reply.

From my understanding of the question you simply need to fetch whatever is after id-xxx or id_xxxx in that case you can just look for the keyword id and then take everything unless there are more conditions in that case would be good if you add more details to the question:

Step 1: Locate the right JSON Block i.e Members,
Step 2: Access the member details using JSON_VALUE, specifically access the user_id key
Step 3: Apply the regex function on the user_id key to extract the values you need.

My regex captures the group id_ or id- and takes any letter or digit after it. You can modify the last part as you see fit

WITH sample_table AS (
    SELECT 
        '''
        {
            "members": [
                {
                    "channel_count": 0,
                    "profile_url": "something.png",
                    "push_enabled": true,
                    "push_trigger_option": "default",
                    "state": "joined",
                    "user_id": "desk_agent_id_12345678-abcd-0926-zyxw-de07e3f9ce85"
                },
                {
                  
                    "user_id": "desk_agent_id-12345678-abcd-0926-zyxw-de07e3f9ce85"
                }
                ,

                 {
                  
                    "user_id": "xxxt-id-12345678-abcd-0926-zyxw-de07e3f9ce85"
                }
            ]
        }
        ''' AS json_data
)
SELECT 
    regexp_extract(JSON_VALUE(member_details, "$.user_id"),r'id[_-]([a-zA-Z0-9-]+)') AS user_id_string
FROM 
    sample_table,
    UNNEST(JSON_EXTRACT_ARRAY(json_data, '$.members')) AS member_details;

Please signup or login to give your own answer.

Click here to cancel reply.