skip to Main Content

I am trying to extract data from a complex data structure (example will follow). I am not sure what is the best method (e.g. Beautifullsoup) to parse the test and get the desired field in a table. I need for example the number, date and e-mail of each block:

var gIg35809469970000987890data = {
  "values": [
    [
      "33765",
      "33765",
      "06-03-2023",
      "[email protected]",
      "indoor 1",
      "",
      "1",
      "10",
      "16",
      "",
      "DELETE",
      "33765",
      {
        "salt": "abSaocf8wyyJVMYVCEyAlg",
        "protected": "/54hAxJ90PKjrfjGC3Y_a6vaKuq6wF2a3LCPRBN-RlVRZxzepbuNLBRmI2MPaiYoOPPI0miY-MTodCl2rrwBwAg",
        "rowVersion": "vCwg8r2ZJr9wjvHwoZVrvxkfaCrWiTnUcosn89iFOO2yV-UvFxc1oo9AWsJomlw1IpKd-IZTUHLJjefknOMc5g",
        "fields": {
          "DEL": {
            "url": "javascript:void(null);"
          }
        }
      }
    ],
    [
      "33623",
      "33623",
      "03-03-2023",
      "[email protected]",
      "indoor 1",
      "",
      "1",
      "10",
      "16",
      "",
      "DELETE",
      "33623",
      {
        "salt": "KHpHaz4-fwN4l3fLmPX6AQ",
        "protected": "/B-ZmlAlvRzPee4kU-QvteJQUy0aP89g08BkpdE5CE-i8_JcsN2sKLELqYh2ZZ9vWZTbp4DtWFYjfO5NDAoKsmA",
        "rowVersion": "3JRQAE4fTETSgERkg3kRCuW2nZiUL_jOcSvLGXNkV6-lpfLOLPhduXAlmgcqEI6gSWX-yI-Fd5uMBbU5iqFXZA",
        "fields": {
          "DEL": {
            "url": "javascript:void(null);"
          }
        }
      }
    ]
  ]
}

(Shortened the file)

I was thinking BeautifullSoup + Regex, but not sure.

Also I am not sure what is the data type: dictionary + array?

Thanks in advance.

3

Answers


  1. Chosen as BEST ANSWER

    I got it working, in an ugly way, but will try Pandas as well. Makes more sense as well, as I want to do some data manipulation.

    I will include my code below. Basically it takes the source html page, cuts the pieces I don't need and convert to JSON object.

    Suggestions are always welcome. Thanks

    import requests
    from bs4 import BeautifulSoup
    import re
    import json
    from prettytable import PrettyTable
    
    ### First open the file with test data
    
    with open('test_data_bookings.html','r') as f:
            html = f.read()
    
    ### Cut off stuff before search string (below could be done more elegant)
    
    output_json = html.split("data =",1)[1]
    print(type(output_json))
    
    ### Cut off stuf after search string
    
    res = output_json.partition("</script>")[0]
    final = res[0:-1]
    
    ### Create a function to cut off last line of string
    
    def remove_last_line_from_string(s):
        return s[:s.rfind('n')]
    
    ### Call the function two times (can be done more efficient)
    ### Optional: uncomment "print statements" to print the result
    
    string = final
    
    string = remove_last_line_from_string(string)
    # print('nn'+string)
    
    string = remove_last_line_from_string(string)
    # print('nn'+string)
    
    ### Convert STRING to JSON
    
    pretty = json.loads(string)
    
    ### Check whether our object is a Dictionary
    
    print(type(pretty))
    

  2. Based on the provided information there are many options. I’ll demonstrate python native package json :

    import json
    str_data = gIg35809469970000987890data # variable name from question. 
    obj_data = json.loads(str_data)
    # data is loaded into a dict
    print(obj_data['values'][0][0]) # returns 33765
    print(obj_data['values'][12]['salt']) # returns abSaocf8wyyJVMYVCEyAlg
    # it's a popo so you can
    obj_data['values'][12]['salt'] = 'abced8dbt6' 
    # when finished convert back to json:
    str_result = json.dumps(obj_data)
    

    Explanation of json notation
    Braces surround dictionaries; square brackets surround lists (just like in python.

    Login or Signup to reply.
  3. The data looks like a json text return from an api

    given:

    data_text = """
    {
      "values": [
        [
          "33765",
          "33765",
          "06-03-2023",
          "[email protected]",
          "indoor 1",
          "",
          "1",
          "10",
          "16",
          "",
          "DELETE",
          "33765",
          {
            "salt": "abSaocf8wyyJVMYVCEyAlg",
            "protected": "/54hAxJ90PKjrfjGC3Y_a6vaKuq6wF2a3LCPRBN-RlVRZxzepbuNLBRmI2MPaiYoOPPI0miY-MTodCl2rrwBwAg",
            "rowVersion": "vCwg8r2ZJr9wjvHwoZVrvxkfaCrWiTnUcosn89iFOO2yV-UvFxc1oo9AWsJomlw1IpKd-IZTUHLJjefknOMc5g",
            "fields": {
              "DEL": {
                "url": "javascript:void(null);"
              }
            }
          }
        ],
        [
          "33623",
          "33623",
          "03-03-2023",
          "[email protected]",
          "indoor 1",
          "",
          "1",
          "10",
          "16",
          "",
          "DELETE",
          "33623",
          {
            "salt": "KHpHaz4-fwN4l3fLmPX6AQ",
            "protected": "/B-ZmlAlvRzPee4kU-QvteJQUy0aP89g08BkpdE5CE-i8_JcsN2sKLELqYh2ZZ9vWZTbp4DtWFYjfO5NDAoKsmA",
            "rowVersion": "3JRQAE4fTETSgERkg3kRCuW2nZiUL_jOcSvLGXNkV6-lpfLOLPhduXAlmgcqEI6gSWX-yI-Fd5uMBbU5iqFXZA",
            "fields": {
              "DEL": {
                "url": "javascript:void(null);"
              }
            }
          }
        ]
      ]
    }
    """
    data = json.loads(data_text)
    

    You can probably load it directly into a dataframe using:

    import pandas
    print(pandas.DataFrame(data["values"]))
    

    You might also work with it to get the filtered columns more manumatically via:

    import json
    import pandas
    
    data_dict = [
        row[1:4]
        for row in
        data 
    ]
    print(data_dict)
    

    and then to a dataframe if you wished

    df = pandas.DataFrame(data_dict)
    print(df)
    
    Login or Signup to reply.
Please signup or login to give your own answer.
Back To Top
Search