skip to Main Content

I am generating a python code that automatically processes and combines JSON datasets.
Meanwhile, when I access each folder, there are two JSON datasets in a folder, which are, for example

  • download/2019/201901/dragon.csv
  • download/2019/201901/kingdom.csv

and the file names are the same across all folders. In other words, each folder has two datasets with the name above.
in the ‘download’ folder, there are 4 folders, 2019, 2020, 2021, 2022, and
in the folder of each year, there are folders for each month, e.g., 2019/201901, 2019/201902, ~~
In this situation, I want to process only ‘dragon.csv’s. I wonder how I can do it. my current code is

import os
import pandas as pd
import numpy as np

path = 'download/2019'
save_path = 'download'

class Preprocess:
    
    def __init__(self, path, save_path):  
        self.path = path
        self.save_path = save_path

after finishing processing,

def save_dataset(path, save_path):

    for dir in os.listdir(path):
        for file in os.listdir(os.path.join(path, dir)):
            if file[-3:] == 'csv':
                df = pd.read_csv(os.path.join(path, dir, file))
                print(f'Reading data from {os.path.join(path, dir, file)}')

                print('Start Preprocessing...')
                df = preprocessing(df)
                print('Finished!')
                
                if not os.path.exists(os.path.join(save_path, dir)):
                    os.makedirs(os.path.join(save_path, dir))
                df.to_csv(os.path.join(save_path, dir, file), index=False)

save_dataset(path, save_path)

2

Answers


  1. You can use pathlib’s glob method:

    from pathlib import Path
    
    p = Path()  # nothing if you're in the folder containing `download` else point to that folder
    
    dragons_paths = p.glob("download/**/dragons.csv")
    

    dragons_paths contains a generator that will point to all the dragons.csv files under download folder.

    PS. You should avoid shadowing dir, maybe call your variable dir_ or d.

    Login or Signup to reply.
  2. If I understand your question, you only want to process files that include the substring "dragon". You could do this by adding a conditional to your if-clause. So instead of writing if file[-3:] == 'csv' simply write if file[-3:] == 'csv' and 'dragon' in file

    Login or Signup to reply.
Please signup or login to give your own answer.
Back To Top
Search