I am generating a python code that automatically processes and combines JSON datasets.
Meanwhile, when I access each folder, there are two JSON datasets in a folder, which are, for example
- download/2019/201901/dragon.csv
- download/2019/201901/kingdom.csv
and the file names are the same across all folders. In other words, each folder has two datasets with the name above.
in the ‘download’ folder, there are 4 folders, 2019, 2020, 2021, 2022, and
in the folder of each year, there are folders for each month, e.g., 2019/201901, 2019/201902, ~~
In this situation, I want to process only ‘dragon.csv’s. I wonder how I can do it. my current code is
import os
import pandas as pd
import numpy as np
path = 'download/2019'
save_path = 'download'
class Preprocess:
def __init__(self, path, save_path):
self.path = path
self.save_path = save_path
after finishing processing,
def save_dataset(path, save_path):
for dir in os.listdir(path):
for file in os.listdir(os.path.join(path, dir)):
if file[-3:] == 'csv':
df = pd.read_csv(os.path.join(path, dir, file))
print(f'Reading data from {os.path.join(path, dir, file)}')
print('Start Preprocessing...')
df = preprocessing(df)
print('Finished!')
if not os.path.exists(os.path.join(save_path, dir)):
os.makedirs(os.path.join(save_path, dir))
df.to_csv(os.path.join(save_path, dir, file), index=False)
save_dataset(path, save_path)
2
Answers
You can use pathlib’s glob method:
dragons_paths
contains a generator that will point to all thedragons.csv
files underdownload
folder.PS. You should avoid shadowing
dir
, maybe call your variabledir_
ord
.If I understand your question, you only want to process files that include the substring "dragon". You could do this by adding a conditional to your if-clause. So instead of writing
if file[-3:] == 'csv'
simply writeif file[-3:] == 'csv' and 'dragon' in file