skip to Main Content

So using a script, with Python, in my CSV file, I want to see if IP and timestamp values exists in some way in the line entries of the JSON log file, and if so, return that specific JSON log entry to another file. I tried to make it universal so it’s applicable to all IP addresses. Here’s what the sample CSV file would look like;

"clientip",""destip","dest_hostname","timestamp"
 "127.0.0.1","0.0.0.0","randomhost","2023-09-09T04:18:22.542Z"

A sample line entry from the Json Log File

{"log": "09-Sept-2023 rate-limit: info: client @xyz 127.0.0.1, "stream":"stderr", "time": 2023-09-09T04:18:22.542Z"}

It’s the lines from the JSON log file we want to return in the output.txt file when there’s a match. The JSON file doesn’t have the same fields and organization like the CSV does (with clientip, destip, dest_hostname, timestamp, but I was hoping that I could still at least return lines from the JSON log files to a new file that had matches on the clientip (like we see here with 127.0.0.1 in "info: client @xyz 127.0.0.1) and maybe the timestamp.

I tried shell previously but could not get any matches. I tried the join command join file.csv xyz-json.log > output.txt but it didn’t yield anything, neither did awk with specification like "NR==FR".

That’s why I’m trying to get this done in Python now. I’m new to Python as well, but this is what I roughly had in mind, ignoring indentation for now.

import csv
for line in csv
for line in json-logs
if csv == json-logs
print l1 == l2

I would appreciate any help/assistance with this!

2

Answers


  1. One possibility would be to read both csv and json files into a dataframe; extract any ip values from the json log then do an inner merge from the json file on ip and time and output rows remaining after the merge:

    dfc = pd.read_csv('test.csv')
    dfj = pd.read_json('test.jsonl', lines=True)
    dfj['ip'] = dfj['log'].str.extract(r'(d+(?:.d+){3})')
    res = dfj.merge(dfc, left_on=['ip', 'time'], right_on=['clientip','timestamp'],how='inner')
    res[['log', 'stream', 'time']].to_json('result.jsonl', orient='records', lines=True)
    
    Login or Signup to reply.
  2. Here is some code structure to get you started. I had to take some liberties with the CSV and JSON you provided as they don’t appear to be well formed (as already mentioned in comments), so I coerced them to make them usable.

    import io
    import csv
    import json
    
    csv_str = '"clientip","destip","dest_hostname","timestamp"n' + 
        '"127.0.0.1","0.0.0.0","randomhost","2023-09-09T04:18:22.542Z"'
    json_str = '[{"log": "09-Sept-2023", "rate-limit": "somethingelse?", ' + 
        '"info": "client @xyz 127.0.0.1", "stream":"stderr", ' + 
        '"time": "2023-09-09T04:18:22.542Z"}]'
    
    # https://stackoverflow.com/a/3305964
    # Read the CSV data (this can read from the file directly instead)
    csv_data = csv.DictReader(io.StringIO(csv_str))
    # Read the JSON data (this can also be read from the file directly instead)
    json_data = json.loads(json_str)
    
    # Check each line in the CSV
    for csv_line in csv_data:
        # Check each line the JSON
        for json_line in json_data:
            # Check if this line matches, and act if it did
            if csv_line['timestamp'] == json_line['time']:
                # Do whatever you want with the information
                print(json_line)
    
    

    If you are able to post some better examples of the CSV and JSON I can update this to match, presuming the structures I guessed are inaccurate. Feel free to ask for additional guidance on any part of this.

    Login or Signup to reply.
Please signup or login to give your own answer.
Back To Top
Search