Extract rows from CSV based on column data - CentOS

cuddlydingo
March 25, 2021
160 views
0 votes
2 Answers

I have a report that is generated at the beginning of each month, in .csv format. Currently, the report contains a series of columns with assorted data; one of the columns is an ‘add_date’ field containing data in "YYYY-mm-dd HH:MM:SS" format.

My end goal is to parse this source CSV so that only rows containing ‘add_date’ cells with dates from the previous month remain. So for example, if the script were run on February 1st 2021, only the rows containing dates from January 2021 would remain in the output CSV file.

This is an example of the source CSV contents:

Name,Data1,add_date
jasmine,stuff ,2021-01-26 17:29:46
ariel,things,2021-01-26 17:48:04
ursula,foo,2016-11-02 19:32:09
belle,bar,2016-01-21 18:47:33

and this is the python script I have so far:

#!/usr/bin/env python3

import csv
filtered_rows = []

with open('test123.csv', newline='') as csvfile:
    rowreader = csv.reader(csvfile, delimiter=',')
    for row in rowreader:
        if row["2021-01"] in csvfile.add_date:
            filtered_rows.append(row) 
        print(filtered_rows)

which I call with the following command:

./testscript.py > testfile.csv

Currently, when I run the above command I am greeted with the following error message:

Traceback (most recent call last):
  File "./testscript.py", line 9, in <module>
    if row["2021-01"] in csvfile.add_date:
TypeError: list indices must be integers or slices, not str

My current Python version is Python 3.6.4, running in CentOS Linux release 7.6.1810 (Core).

Tags: csv python

Answers

- Arturo
- March 25, 2021 at 11:46 pm
- 0 votes
0
If I undestood well, you can do something like this:
```
import pandas as pd
from datetime import datetime
df= pd.read_csv('test.csv',sep=',',header=0)
df['add_date']= pd.to_datetime(df['add_date'])
filtered=df[(df.add_date >= datetime.strptime('2021-01-01','%Y-%m-%d')) & (df.add_date <= datetime.strptime('2021-01-31','%Y-%m-%d')) ]
```
Login or Signup to reply.

To do this properly you need to determine the previous month and year, then compare that to add_date field of each row. The year is important to handle December →
January (as well as the possibility of multi-year) transitions.

Here’s what I mean.

import csv
import datetime

filename = 'test123.csv'
ADD_DATE_COL = 2

# Determine previous month and year.
first = datetime.date.today().replace(day=1)
last = first - datetime.timedelta(days=1)
previous_month, previous_year = last.month, last.year

# Extract rows for previous month.
filtered_rows = []
with open(filename, newline='') as csvfile:
    reader = csv.reader(csvfile, delimiter=',')
    next(reader)  # Ignore  header row.
    for row in reader:
        add_date = datetime.datetime.strptime(row[ADD_DATE_COL], '%Y-%m-%d %H:%M:%S')
        if add_date.month == previous_month and add_date.year == previous_year:
            filtered_rows.append(row)

print(filtered_rows)

I got the basic idea of how to determine the date of the previous month from @bgporter’s answer to the question How to determine date of the previous month?.

Please signup or login to give your own answer.

Click here to cancel reply.

Extract rows from CSV based on column data – CentOS

Answers