How to parse log text file, parse datetimes, and get sum of timedeltas - Photoshop

SimranSharma
September 8, 2020
103 views
3 votes
3 Answers

I tried various methods to open the file and pass it as a whole. But I am unable to do it. Either the output is zero or Empty set.

I have a log file containing data such as :

Time Log Nitrogen:
5/1/12: 3:39am - 4:43am data file study
        3:57pm - 5:06pm bg ui, combo boxes
        7:44pm - 8:50pm bg ui with scaler; slider
        10:30pm - 12:48am state texts; slider
5/2/12: 10:00am - 12:00am discuss with Blanca about the data file
5/8/12: 11:00pm - 11:40pm mapMC,
5/9/12: 3:05pm - 3:42pm wholeMapMC, subMapMC, AS3 functions reading
        10:35pm - 1:33am whole view data; scrollpane; 
5/10/12: 6:10pm - 8:13pm blue slider
5/11/12: 8:45am - 12:10pm purple slider
         1:30pm - 5:00pm Nitrate bar
         11:18pm - 12:03am change NitrogenViewBase to static
5/12/12: 8:06am - 9:47am correct data and change NitrogenViewBase to static
         5:45pm - 8:00pm costs bar, embed font
         9:51pm - 12:31am costs bar
5/13/12: 7:45am - 8:45am read the Nitrogen Game doc
5/15/12: 2:07am - 5:09am corn
         2:06pm - 5:11pm hypoxic zone
5/16/12: 2:53pm - 5:09pm data re-structure
         7:00pm - 9:10pm sub sections watershed data
5/17/12: 12:30am - 2:32am sub sections sliders
         10:30am - 11:45am meet with Dr. Lant and Blanca
         3:09pm - 5:05pm crop yield and sub sections pink bar
         7:00pm - 7:50pm sub sections nitrate to gulf bar
5/18/12: 3:15pm - 3:52pm sub sections slider legend
5/27/12: 5:46pm - 7:30pm feedback fixes
6/20/12: 2:57pm - 5:00pm Teachers' feedback fixes
         7:30pm - 8:30pm 
6/22/12: 3:40pm - 5:00pm
6/25/12: 3:24pm - 5:00pm
6/26/12: 11:24am - 12:35pm
7/4/12:  1:00pm - 10:00pm research on combobox with dropdown subitem - to no avail
7/5/12:  1:30am - 3:00am continue the research
         9:31am - 12:45pm experiment on the combobox-subitem concept
         3:45pm - 5:00pm
         6:23pm - 8:14pm give up
         8:18pm - 10:00pm zone change
         11:07pm - 12:00am
7/10/12: 11:32am - 12:03pm added BASE_X and BASE_Y to the NitrogenSubView
         4:15pm - 5:05pm fine-tune the whole view map
         7:36pm - 8:46pm 
7/11/12: 1:38am - 4:42am
7/31/12: 11:26am - 1:18pm study photoshop path shape
8/1/12:  2:00am - 3:41am collect the coordinates of wetland shapes
         10:31am - 11:40am restorable wetlands implementation
         4:00pm - 5:00pm 
8/2/12:  12:20am - 4:42am
8/10/12: 2:30am - 4:55am sub watersheds color match; wetland color & size change 
3/13/13: 6:06pm - 6:32pm Make the numbers in the triangle sliders bigger and bolder; Larger font on "Crop Yield Reduction"

How to calculate the total time spent by parsing the time log file? I am unable to parse the file as a whole.

I tried :

  import re
    import datetime
    
    text="""5/1/12: 3:39am - 4:43am data file study
        3:57pm - 5:06pm bg ui, combo boxes
        7:44pm - 8:50pm bg ui with scaler; slider
        10:30pm - 12:48am state texts; slider
5/2/12: 10:00am - 12:00am discuss with Blanca about the data file
5/8/12: 11:00pm - 11:40pm mapMC,"""
    
    total=re.findall("(d{1,2}:d{1,2}[ap]m)s*-s*(d{1,2}:d{1,2}[ap]m)",text)
    
    print(sum([datetime.datetime.strptime(t[1],"%I:%M%p")-datetime.datetime.strptime(t[0],"%I:%M%p") for t in total],datetime.timedelta()))

Executing this I get the time in negative format. How to work over it?

Answers

To account for time overlapping days, you have to calculate duration for both days separately and add it together.
Please refer below code

import re
from datetime import datetime as dt, timedelta as td
strp=dt.strptime
with open("log.txt","r") as f:
    total=re.findall("(d{1,2}:d{1,2}[ap]m)s*-s*(d{1,2}:d{1,2}[ap]m)",f.read())
    print(sum([strp(t[1],"%I:%M%p")-strp(t[0],"%I:%M%p") if strp(t[1],"%I:%M%p")>strp(t[0],"%I:%M%p") else (strp("11:59pm","%I:%M%p")-strp(t[0],"%I:%M%p"))+(strp(t[1],"%I:%M%p")-strp("12:00am","%I:%M%p"))+td(minutes=1) for t in total],td()))

Output

4 days, 9:13:00

You could parse your log file in a Panda dataframe and then easily make your calculations:

import pandas as pd 
import dateparser

x="""5/1/12: 3:39am - 4:43am data file study
            3:57pm - 5:06pm bg ui, combo boxes
            7:44pm - 8:50pm bg ui with scaler; slider
            10:30pm - 12:48am state texts; slider
    5/2/12: 10:00am - 12:00am discuss with Blanca about the data file
    5/8/12: 11:00pm - 11:40pm mapMC,
    5/9/12: 3:05pm - 3:42pm wholeMapMC, subMapMC, AS3 functions reading
            10:35pm - 1:33am whole view data; scrollpane; 
    5/10/12: 6:10pm - 8:13pm blue slider
    5/11/12: 8:45am - 12:10pm purple slider
             1:30pm - 5:00pm Nitrate bar
             11:18pm - 12:03am change NitrogenViewBase to static
    5/12/12: 8:06am - 9:47am correct data and change NitrogenViewBase to static
             5:45pm - 8:00pm costs bar, embed font
             9:51pm - 12:31am costs bar
    5/13/12: 7:45am - 8:45am read the Nitrogen Game doc
    5/15/12: 2:07am - 5:09am corn
             2:06pm - 5:11pm hypoxic zone
    5/16/12: 2:53pm - 5:09pm data re-structure
             7:00pm - 9:10pm sub sections watershed data
    5/17/12: 12:30am - 2:32am sub sections sliders
             10:30am - 11:45am meet with Dr. Lant and Blanca
             3:09pm - 5:05pm crop yield and sub sections pink bar
             7:00pm - 7:50pm sub sections nitrate to gulf bar
    5/18/12: 3:15pm - 3:52pm sub sections slider legend
    5/27/12: 5:46pm - 7:30pm feedback fixes
    6/20/12: 2:57pm - 5:00pm Teachers' feedback fixes
             7:30pm - 8:30pm 
    6/22/12: 3:40pm - 5:00pm
    6/25/12: 3:24pm - 5:00pm
    6/26/12: 11:24am - 12:35pm
    7/4/12:  1:00pm - 10:00pm research on combobox with dropdown subitem - to no avail
    7/5/12:  1:30am - 3:00am continue the research
             9:31am - 12:45pm experiment on the combobox-subitem concept
             3:45pm - 5:00pm
             6:23pm - 8:14pm give up
             8:18pm - 10:00pm zone change
             11:07pm - 12:00am
    7/10/12: 11:32am - 12:03pm added BASE_X and BASE_Y to the NitrogenSubView
             4:15pm - 5:05pm fine-tune the whole view map
             7:36pm - 8:46pm 
    7/11/12: 1:38am - 4:42am
    7/31/12: 11:26am - 1:18pm study photoshop path shape
    8/1/12:  2:00am - 3:41am collect the coordinates of wetland shapes
             10:31am - 11:40am restorable wetlands implementation
             4:00pm - 5:00pm 
    8/2/12:  12:20am - 4:42am
    8/10/12: 2:30am - 4:55am sub watersheds color match; wetland color & size change 
    3/13/13: 6:06pm - 6:32pm Make the numbers in the triangle sliders bigger and bolder; Larger font on "Crop Yield Reduction"
"""



#We will store records there
records = []

#Loop through lines
for line in x.split("n"):
    
    #Look for a date in line
    match_date = re.search(r'(d+/d+/d+)',line)
    
    if match_date!=None:
        #If a date exists, store it in a variable
        date = match_date.group(1)
    #Extract times
    times =  re.findall("(d{1,2}:d{1,2}[ap]m)s*-s*(d{1,2}:d{1,2}[ap]m)",line)
    #if there's no valid time in the line, skip it
    if len(times) == 0: continue
    #parse dates
    start = dateparser.parse(date + " " + times[0][0], languages=['en'])
    end = dateparser.parse(date + " " + times[0][1], languages=['en'])
    content =line.split(times[0][1])[-1].strip()
    #Append records
    records.append(dict(date=date, start= start, end = end, content =content))
    
df = pd.DataFrame(records)

#Correct end time if it's lower than start time 
df.loc[df.start>df.end,"end"] = df[df.start>df.end].end + timedelta(days=1)

print("Total spent time :", (df.end - df.start).sum())

Output

Total spent time : 4 days 09:13:00

You already have two interesting and working solutions from Liju and Sebastien D. Here I propose two new variants that, while similar, have important performance advantages.

The two current solutions approach the problem in this way:

Solution by one_pass, proposed by Liju: takes the regex matches and sums a list created by list comprehension. During that comprehension, it parses the same two strings to datetime three times (to evaluate >, to output if, or to output else).
Solution by dateparser, proposed by Sebastien D: takes each line of text and tries to regex a date out of the line, then tries finds the start/end times from that same line (could be improved to a single regex, but the regex is not this solution’s bottleneck). It then uses dateparser to combine date and time and also collect the text description. This would be more akin to a full fledged parser, but for the purposes of time tests I removed the description functionality.

The two new solutions are similar:

Solution by two_pass: similar to one_pass but in the first pass it just parses the strings to datetime and in the second pass it evaluates start > end and sums the correct timedelta. The main advantage is that it only parses dates once, with the downside of having to iterate twice.
Solution by pure_pandas: similar to dateparser, but only calls regex once and uses pandas’ built-in to_datetime for parsing.

If we compare the performance of all these solutions with different text lengths, we can see that w_dateparser is by far the least performant solution.

If we zoom in to compare the other three solutions, we see that w_pure_pandas is a little slower than the other solutions for smaller text lengths, but it excels in comparing longer entries by taking advantage of numpy C-implementations (as opposed to list comprehensions used by the other solutions). Secondly, two_pass is generally faster than one_pass, and increasingly faster for longer texts.

The code for two_pass and w_pure_pandas:

def two_pass(text):
    total = re.findall(r"(d{1,2}:d{1,2}[ap]m)s*-s*(d{1,2}:d{1,2}[ap]m)", text)
    total = [
        (datetime.datetime.strptime(t[0], '%I:%M%p'),
            datetime.datetime.strptime(t[1], '%I:%M%p'))
        for t in total
    ]
    return sum(
        (
            end - start if end > start
            else end - start + datetime.timedelta(days=1)
            for start, end in total
        )
        , datetime.timedelta()
    )


def w_pure_pandas(text):
    import pandas as pd
    
    total = re.findall(r"(d{1,2}:d{1,2}[ap]m)s*-s*(d{1,2}:d{1,2}[ap]m)", text)
    df = pd.DataFrame(total, columns=['start', 'end'])
    for col in df:
        # pandas.to_datetime has issues with date compatibility
        # but since we only care for time deltas,
        # we can just use the default behavior
        df[col] = pd.to_datetime(df[col])
    
    df.loc[df.start > df.end, 'end'] += datetime.timedelta(days=1)
    
    return df.diff(axis=1).sum()['end']

The full code for all solutions and time testing:

import re
import datetime
import timeit
from matplotlib import pyplot as plt

text = '''
Time Log Nitrogen:
5/1/12: 3:39am - 4:43am data file study
        3:57pm - 5:06pm bg ui, combo boxes
        7:44pm - 8:50pm bg ui with scaler; slider
        10:30pm - 12:48am state texts; slider
5/2/12: 10:00am - 12:00am discuss with Blanca about the data file
5/8/12: 11:00pm - 11:40pm mapMC,
5/9/12: 3:05pm - 3:42pm wholeMapMC, subMapMC, AS3 functions reading
        10:35pm - 1:33am whole view data; scrollpane; 
5/10/12: 6:10pm - 8:13pm blue slider
5/11/12: 8:45am - 12:10pm purple slider
         1:30pm - 5:00pm Nitrate bar
         11:18pm - 12:03am change NitrogenViewBase to static
5/12/12: 8:06am - 9:47am correct data and change NitrogenViewBase to static
         5:45pm - 8:00pm costs bar, embed font
         9:51pm - 12:31am costs bar
5/13/12: 7:45am - 8:45am read the Nitrogen Game doc
5/15/12: 2:07am - 5:09am corn
         2:06pm - 5:11pm hypoxic zone
5/16/12: 2:53pm - 5:09pm data re-structure
         7:00pm - 9:10pm sub sections watershed data
5/17/12: 12:30am - 2:32am sub sections sliders
         10:30am - 11:45am meet with Dr. Lant and Blanca
         3:09pm - 5:05pm crop yield and sub sections pink bar
         7:00pm - 7:50pm sub sections nitrate to gulf bar
5/18/12: 3:15pm - 3:52pm sub sections slider legend
5/27/12: 5:46pm - 7:30pm feedback fixes
6/20/12: 2:57pm - 5:00pm Teachers' feedback fixes
         7:30pm - 8:30pm 
6/22/12: 3:40pm - 5:00pm
6/25/12: 3:24pm - 5:00pm
6/26/12: 11:24am - 12:35pm
7/4/12:  1:00pm - 10:00pm research on combobox with dropdown subitem - to no avail
7/5/12:  1:30am - 3:00am continue the research
         9:31am - 12:45pm experiment on the combobox-subitem concept
         3:45pm - 5:00pm
         6:23pm - 8:14pm give up
         8:18pm - 10:00pm zone change
         11:07pm - 12:00am
7/10/12: 11:32am - 12:03pm added BASE_X and BASE_Y to the NitrogenSubView
         4:15pm - 5:05pm fine-tune the whole view map
         7:36pm - 8:46pm 
7/11/12: 1:38am - 4:42am
7/31/12: 11:26am - 1:18pm study photoshop path shape
8/1/12:  2:00am - 3:41am collect the coordinates of wetland shapes
         10:31am - 11:40am restorable wetlands implementation
         4:00pm - 5:00pm 
8/2/12:  12:20am - 4:42am
8/10/12: 2:30am - 4:55am sub watersheds color match; wetland color & size change 
    3/13/13: 6:06pm - 6:32pm Make the numbers in the triangle sliders 
bigger and bolder; Larger font on "Crop Yield Reduction"
'''

def one_pass(text):
    total = re.findall(r"(d{1,2}:d{1,2}[ap]m)s*-s*(d{1,2}:d{1,2}[ap]m)", text)
    return sum(
        [
            datetime.datetime.strptime(t[1], '%I:%M%p')
            - datetime.datetime.strptime(t[0], '%I:%M%p')
            if datetime.datetime.strptime(t[1], '%I:%M%p') >
                datetime.datetime.strptime(t[0], '%I:%M%p')
            else
            datetime.datetime.strptime('11:59pm', '%I:%M%p')
            - datetime.datetime.strptime(t[0], '%I:%M%p')
            + datetime.datetime.strptime(t[1], '%I:%M%p')
            - datetime.datetime.strptime('12:00am', '%I:%M%p')
            + datetime.timedelta(minutes=1)
            for t in total
        ]
        , start=datetime.timedelta()
    )


def w_dateparser(text):
    import pandas as pd
    import dateparser
    
    #We will store records there
    records = []
    #Loop through lines
    # t0 = t1 = t2 = 0
    for line in text.split("n"):
        #Look for a date in line
        # t0 = time() - t0
        match_date = re.search(r'(d+/d+/d+)',line)
        if match_date!=None:
            #If a date exists, store it in a variable
            date = match_date.group(1)
        #Extract times
        times =  re.findall("(d{1,2}:d{1,2}[ap]m)s*-s*(d{1,2}:d{1,2}[ap]m)",line)
        # t0 = time() - t0
        #if there's no valid time in the line, skip it
        if len(times) == 0: continue
        # t1 = time() - t1
        #parse dates
        start = dateparser.parse(date + " " + times[0][0], languages=['en'])
        end = dateparser.parse(date + " " + times[0][1], languages=['en'])
        # content = line.split(times[0][1])[-1].strip()
        # t1 = time() - t1
        #Append records
        # records.append(dict(date=date, start= start, end = end, content =content))
        records.append(dict(date=date, start= start, end = end))
        
    # t2 = time() - t2
    df = pd.DataFrame(records)
    # print(df)
    #Correct end time if it's lower than start time 
    df.loc[df.start>df.end,"end"] = df[df.start>df.end].end + datetime.timedelta(days=1)
    # t2 = time() - t2
    # print(t0, t1, t2)
    return (df.end - df.start).sum()


def two_pass(text):
    total = re.findall(r"(d{1,2}:d{1,2}[ap]m)s*-s*(d{1,2}:d{1,2}[ap]m)", text)
    total = [
        (datetime.datetime.strptime(t[0], '%I:%M%p'),
            datetime.datetime.strptime(t[1], '%I:%M%p'))
        for t in total
    ]
    return sum(
        (
            end - start if end > start
            else end - start + datetime.timedelta(days=1)
            for start, end in total
        )
        , datetime.timedelta()
    )


def w_pure_pandas(text):
    import pandas as pd
    
    total = re.findall(r"(d{1,2}:d{1,2}[ap]m)s*-s*(d{1,2}:d{1,2}[ap]m)", text)
    df = pd.DataFrame(total, columns=['start', 'end'])
    for col in df:
        # pandas.to_datetime has issues with date compatibility
        # but since we only care for time deltas,
        # we can just use the default behavior
        df[col] = pd.to_datetime(df[col])
    
    df.loc[df.start > df.end, 'end'] += datetime.timedelta(days=1)
    
    return df.diff(axis=1).sum()['end']

timings = {}
for l in [1, 5, 10, 50, 100]:
    text_long = text * l
    n = 2
    timings[l] = {}
    for func in ['two_pass', 'one_pass', 'w_pure_pandas', 'w_dateparser']:
        t = timeit.timeit(f"{func}(text_long)", number=n, globals=globals()) / n
        timings[l][func] = t

timings = pd.DataFrame(timings).T
timings.info()
print(timings)

timings.plot()
plt.xlabel('multiplier for lines of text')
plt.ylabel('runtime (s)')
plt.grid(True)
plt.show()
plt.close('all')

timings[['two_pass', 'one_pass', 'w_pure_pandas']].plot()
plt.xlabel('multiplier for lines of text')
plt.ylabel('runtime (s)')
plt.grid(True)
plt.show()
plt.close('all')

Please signup or login to give your own answer.

Click here to cancel reply.

How to parse log text file, parse datetimes, and get sum of timedeltas – Photoshop

Answers