skip to Main Content

I have been asking for help in importing JSON-formatted data from URLs (I am a newbie as far as dealing with JSON) and received a great answer in response to this question.

However, I have encountered a complication. Some of my property names contain spaces. For example, "Property1" and several other property names from my previous question might actually be "Property1_word1 Property1_word2." The current solution preserves only the first word of a property name. I could get away with that at first but now need all words. If anyone could please point me to any tips, I would be grateful. I haven’t managed to find any so far.


Edit (providing all information here so that there’s no need to refer to previous posts):

I want to import data from a website. First I save the contents (below) of the website as a file. In my previous question, each property name was made up of only one word. Now I’m dealing with property names that are made up of multiple words. I have provided an example below, where Property1, Property4, and Property8 have names with multiple words.

{
    "payload": {
        "allShortcutsEnabled": false,
        "fileTree": {
            "": {
                "items": [
                    {
                        "name": "thing",
                        "path": "thing",
                        "contentType": "directory"
                    },
                    {
                        "name": ".repurlignore",
                        "path": ".repurlignore",
                        "contentType": "file"
                    },
                    {
                        "name": "README.md",
                        "path": "README.md",
                        "contentType": "file"
                    },
                    {
                        "name": "thing2",
                        "path": "thing2",
                        "contentType": "file"
                    },
                    {
                        "name": "thing3",
                        "path": "thing3",
                        "contentType": "file"
                    },
                    {
                        "name": "thing4",
                        "path": "thing4",
                        "contentType": "file"
                    },
                    {
                        "name": "thing5",
                        "path": "thing5",
                        "contentType": "file"
                    },
                    {
                        "name": "thing6",
                        "path": "thing6",
                        "contentType": "file"
                    },
                    {
                        "name": "thing7",
                        "path": "thing7",
                        "contentType": "file"
                    },
                    {
                        "name": "thing8",
                        "path": "thing8",
                        "contentType": "file"
                    },
                    {
                        "name": "thing9",
                        "path": "thing9",
                        "contentType": "file"
                    },
                    {
                        "name": "thing10",
                        "path": "thing10",
                        "contentType": "file"
                    },
                    {
                        "name": "thing11",
                        "path": "thing11",
                        "contentType": "file"
                    }
                ],
                "totalCount": 500
            }
        },
        "fileTreeProcessingTime": 5.262188,
        "foldersToFetch": [],
        "reducedMotionEnabled": null,
        "repo": {
            "id": 1234567,
            "defaultBranch": "main",
            "name": "repository",
            "ownerLogin": "contributor",
            "currentUserCanPush": false,
            "isFork": false,
            "isEmpty": false,
            "createdAt": "2023-10-31",
            "ownerAvatar": "https://avatars.repurlusercontent.com/u/98765432?v=1",
            "public": true,
            "private": false,
            "isOrgOwned": false
        },
        "symbolsExpanded": false,
        "treeExpanded": true,
        "refInfo": {
            "name": "main",
            "listCacheKey": "v0:13579",
            "canEdit": false,
            "refType": "branch",
            "currentOid": "identifier"
        },
        "path": "thing2",
        "currentUser": null,
        "blob": {
            "rawLines": [
                "        C_1H_4   Methane                  ",
                "            5.00000        Property1_word1 Property1_word2                              ",
                "             20.00000        Property2                     ",
                "           500.66500        Property3                              ",
                "           100.00000        Property4_word1 Property4_word2                                           ",
                "         -4453.98887        Property5                                      ",
                "           100.48200        Property6                                   ",
                "            59.75258        Property7                                         ",
                "             5.33645        Property8_word1 Property8_word2                                         ",
                "             0.00000        Property9         ",
                "           645.07777        Property10                                       ",
                "             0.00000        Property11                           ",
                "             0.00000        Property12                           ",
                "             0.00000        Property13                             ",
                "             0.00000        Property14                             ",
                "             0.00000        Property15                             ",
                "             0.00000        Property16                             ",
                "             0.00000        Property17                   ",
                "             0.00000        Property18                            ",
                "             0.00000        Property19                   ",
                "             0.00000        Property20                             ",
                "             0.00000        Property21                   ",
                "             0.00000        Property22                             ",
                "             0.00000        Property23                   ",
                "             0.00000        Property24                    ",
                "             0.00000        Property25                    ",
                "             0.57876        Property26                                           ",
                "             4.00000        Property27                                               ",
                "             0.00000        Property28                    ",
                "             0.00000        Property29               ",
                "             0.00000        Property30                  ",
                "             0.00000        Property31            ",
                "             0.00000        Property32                  ",
                "             1.00000        Property33                         ",
                "             0.00000        Property34                       ",
                "            26.00000        Property35                             ",
                "             1.44571        Property36                               ",
                "             1.08756        Property37                            ",
                "             0.00000        Property38                          ",
                "             0.00000        Property39                        ",
                "             0.00000        Property40                        ",
                "             6.00000        Property41                       ",
                "             9.00000        Property42                                         ",
                "             0.00000        Property43                                         "
            ],
            "stylingDirectives": [
                [],
                [],
                [],
                [],
                [],
                [],
                [],
                [],
                [],
                [],
                [],
                [],
                [],
                [],
                [],
                [],
                [],
                [],
                [],
                [],
                [],
                [],
                [],
                [],
                [],
                [],
                [],
                [],
                [],
                [],
                [],
                [],
                [],
                [],
                [],
                [],
                [],
                [],
                [],
                [],
                [],
                [],
                [],
                []
            ],
            "csv": null,
            "csvError": null,
            "dependabotInfo": {
                "showConfigurationBanner": false,
                "configFilePath": null,
                "networkDependabotPath": "/contributor/repository/network/updates",
                "dismissConfigurationNoticePath": "/settings/dismiss-notice/dependabot_configuration_notice",
                "configurationNoticeDismissed": null,
                "repoAlertsPath": "/contributor/repository/security/dependabot",
                "repoSecurityAndAnalysisPath": "/contributor/repository/settings/security_analysis",
                "repoOwnerIsOrg": false,
                "currentUserCanAdminRepo": false
            },
            "displayName": "thing2",
            "displayUrl": "https://repurl.com/contributor/repository/blob/main/thing2?raw=true",
            "headerInfo": {
                "blobSize": "3.37 KB",
                "deleteInfo": {
                    "deleteTooltip": "You must be signed in to make or propose changes"
                },
                "editInfo": {
                    "editTooltip": "XXX"
                },
                "ghDesktopPath": "https://desktop.repurl.com",
                "repurlLfsPath": null,
                "onBranch": true,
                "shortPath": "5678",
                "siteNavLoginPath": "/login?return_to=identifier",
                "isCSV": false,
                "isRichtext": false,
                "toc": null,
                "lineInfo": {
                    "truncatedLoc": "33",
                    "truncatedSloc": "33"
                },
                "mode": "executable file"
            },
            "image": false,
            "isCodeownersFile": null,
            "isPlain": false,
            "isValidLegacyIssueTemplate": false,
            "issueTemplateHelpUrl": "https://docs.repurl.com/articles/about-issue",
            "issueTemplate": null,
            "discussionTemplate": null,
            "language": null,
            "languageID": null,
            "large": false,
            "loggedIn": false,
            "newDiscussionPath": "/contributor/repository/issues/new",
            "newIssuePath": "/contributor/repository/issues/new",
            "planSupportInfo": {
                "repoOption1": null,
                "repoOption2": null,
                "requestFullPath": "/contributor/repository/blob/main/thing2",
                "repoOption4": null,
                "repoOption5": null,
                "repoOption6": null,
                "repoOption7": null
            },
            "repoOption8": {
                "repoOption9": "/settings/dismiss-notice/repoOption10",
                "releasePath": "/contributor/repository/releases/new=true",
                "repoOption11": false,
                "repoOption12": false
            },
            "rawBlobUrl": "https://repurl.com/contributor/repository/raw/main/thing2",
            "repoOption13": false,
            "richText": null,
            "renderedFileInfo": null,
            "shortPath": null,
            "tabSize": 8,
            "topBannersInfo": {
                "overridingGlobalFundingFile": false,
                "universalPath": null,
                "repoOwner": "contributor",
                "repoName": "repository",
                "repoOption14": false,
                "citationHelpUrl": "https://docs.repurl.com/en/repurl/archiving/about",
                "repoOption15": false,
                "repoOption16": null
            },
            "truncated": false,
            "viewable": true,
            "workflowRedirectUrl": null,
            "symbols": {
                "timedOut": false,
                "notAnalyzed": true,
                "symbols": []
            }
        },
        "collabInfo": null,
        "collabMod": false,
        "wtsdf_signifier": {
            "/contributor/repository/branches": {
                "post": "identifier"
            },
            "/repos/preferences": {
                "post": "identifier"
            }
        }
    },
    "title": "repository/thing2 at main \u0000 contributor/repository"
}

Here is the code that deals with property names made up of one word (the command that strips whitespace mans that only the first word of names made up of multiple words is imported):

import json
import pandas as pd

f = open("yourJson.json", "r")
data = json.load(f)
f.close()

# Get what we want to extract from the json
to_extract = data["payload"]["blob"]["rawLines"]

# Remove useless whitespace
stripped = [e.strip() for e in to_extract]
trimmed = [" ".join(e.split()) for e in stripped]

# Transform the list of string to a dict
as_dict = {e.split(' ')[0]: e.split(' ')[1] for e in trimmed}

# Load the dict with pandas
df = pd.DataFrame(as_dict.items(), columns=['Value', 'Property'])

I have experimented with various solutions (e.g., not stripping whitespace, specifying the exact property names associated with the data I need) but am so lost as far as JSON that the errors are not meaningful.

2

Answers


  1. You can use spaces in json keys, it is not invalid if that was your question.

    {
        "My name is": "Efe"
    }
    

    Also if you want to remove unwanted spaces from your string you can use this:

    mystring = " Hello "
    mystring = mystring.strip()
    
    #'Hello'
    

    And if you can edit the question with all the material in one question without referring to older ones it would be easier to see the problem and the code.

    Login or Signup to reply.
  2. Let’s break down your example to just two lines of data.

    to_extract = [
        "        C_1H_4   Methane                  ",
        "            5.00000        Property1_word1 Property1_word2                              ",
    ]
    stripped = [e.strip() for e in to_extract]
    trimmed = [" ".join(e.split()) for e in stripped]
    print(f"{trimmed=}")
    

    This gives us the cleaned data:

    trimmed=['C_1H_4 Methane', '5.00000 Property1_word1 Property1_word2']

    In the next part of your code you split the strings in this list and construct the dictionary. Let’s see what we get here:

    for e in trimmed:
        print(e.split(' '))
    

    The resulting lists look like this

    ['C_1H_4', 'Methane']
    ['5.00000', 'Property1_word1', 'Property1_word2']
    

    As you can see, the second string was split into a list with 3 parts and the third part (index 2) gets lost in your code. You could join together the parts again, but there is an easier way. The split method has a maxsplit parameter that we can use to do only one split.

    for e in trimmed:
        print(e.split(' ', 1))
    

    Both lists now have only 2 entries.

    ['C_1H_4', 'Methane']
    ['5.00000', 'Property1_word1 Property1_word2']
    

    So you just have to change your old code

    as_dict = {e.split(' ')[0]: e.split(' ')[1] for e in trimmed}
    to

    as_dict = {e.split(' ')[0]: e.split(' ', 1)[1] for e in trimmed}.


    Additionally: I don’t like that we do split two times. And splitting first and then rejoining the strings when constructing trimmed seems to be too much work too.

    We can throw out the intermediate creation of stripped and trimmed and boil all of this down to:

    as_dict = dict(line.strip().split(None, 1) for line in to_extract)

    The result is:

    {'C_1H_4': 'Methane', '5.00000': 'Property1_word1 Property1_word2'}

    Login or Signup to reply.
Please signup or login to give your own answer.
Back To Top
Search