skip to Main Content

I have ~100GB csv file with following columns:

sex;name;dob;hash

This files was created after some processing of another .csv file. And it can contain tuples, that’s why there is this hash column. What I need is to delete duplicates from this file based on ‘hash’ column and save new data into output.csv.

I’m looking for both speed and efficiency. I’ve tried uploading to PostgreSQL using Polars, then select only unique columns:

    df.write_database(
        table_name=input_table,
        connection=connection_for_db,
        if_table_exists='append'
    )

    SELECT sex, name, dob
      FROM (
          SELECT DISTINCT ON (hash)
              hash, sex, name, dob
          FROM input_table
      ) with_hash

But the uploading to DB is way too slow!
I’ve tried to read this .csv using Polars LazyFrame:

    df = pl.scan_csv(input.csv, separator=';', infer_schema_length=0,
                     ignore_errors=True, rechunk=False)
    uni_df = df.unique(subset=['hash']).drop('hash')
    uni_df.sink_csv(output.csv, separator=';', maintain_order=False)

But python takes too much memory as a result my PC run out of memory and the script got killed. The last thing I’ve tried to do was to load and clean data using Dask:

    ddf = dsk.read_csv(input.csv, sep=';')
    ddf = ddf.drop_duplicates(subset=['hash'])
    ddf.to_csv(output.csv, sep=';')

It’s working but it is also taking too much system memory. So I’m really out of options here and will be very glad if someone can help me to solve this issue. As I can have bigger files in future and would like to know how to deal with them.

Thanks in advance!

2

Answers


  1. You can use =UNIQUE in Excel or the like (if the file opens at all in Excel), see this example: Delete row if the cell name already exists in previous row in Excel

    This would empty duplicate cells and then you can remove all rows where a given cell is empty, see https://superuser.com/questions/474530/excel-delete-row-if-cell-in-certain-column-is-blank.

    If this is not feasible for you, then you can do inserts into the database, but I would not do it with distinct values. Instead, I would create a unique index in the database table on your hash and do an insert which does nothing when the constraint is violated. Example:

    INSERT INTO target_table (hash, sex, name, dob) 
    SELECT 'myhash', 'M', 'John', '2020-01-01 0:00:0'
    FROM input_table
    ON CONFLICT (field_one) DO NOTHING;
    

    If the hash is computed based on these three fields, then you probably do not need a hash, but a unique(sex, name, dob) in your target database.

    Login or Signup to reply.
  2. I’m assuming that the crux of the issue is that even maintaining a set() of hashes might not be practical as a 100g file might potentially have 25g or more of hex string hash data.

    If this is the case, we can take our original 100g file and break it into chunks (temp files) based on the partial value of each row’s hash. Then once we have a set of temp files, we just iteratively use them to construct the final distinct file but using only 1/16th the memory to maintain the set()

    Note: VERY lightly tested

    import contextlib
    import csv
    import glob
    import os
    
    filename_in = "./my_huge_file.csv"
    filename_out = "./my_huge_file.distinct.csv"
    scratch_folder = "./scratch"
    
    ## ----------------------
    ## Ensure a clean slate
    ## ----------------------
    if os.path.exists(scratch_folder):
        raise FileExistsError(f"The folder {scratch_folder} already exists.")
    os.mkdir(scratch_folder)
    ## ----------------------
    
    ## ----------------------
    ## Split our monster file into chunks based on the last character in "hash"
    ## ----------------------
    writers = {}
    with contextlib.ExitStack() as stack:
        for row in csv.reader(stack.enter_context(open(filename_in, "r")), delimiter=";"):
            key = row[3][-1] # the last character of the hash
    
            ## ----------------------
            ## Get or create the writer for this key
            ## ----------------------
            writer = writers.setdefault(
                key,
                csv.writer(stack.enter_context(open(f"{scratch_folder}/{key}.csv", "w", newline="")), delimiter=";")
            )
            ## ----------------------
    
            writer.writerow(row)
    ## ----------------------
    
    ## ----------------------
    ## Now we apply our "seen" test to each of the scratch files
    ## in turn.
    ## ----------------------
    with open(filename_out, "w", newline="") as file_out:
        writer = csv.writer(file_out, delimiter=";")
        for filename in glob.glob(f"{scratch_folder}/*.csv"):
            seen_hashes = set()
            with open(filename, "r") as file_in:
                for row in csv.reader(file_in, delimiter=";"):
                    key = row[3]
                    if key not in seen_hashes:
                        seen_hashes.add(key)
                        writer.writerow(row)
            os.remove(filename)
    ## ----------------------
    
    ## ----------------------
    ## clean up the scratch folder
    ## ----------------------
    os.rmdir(scratch_folder)
    ## ----------------------
    
    Login or Signup to reply.
Please signup or login to give your own answer.
Back To Top
Search