I have ~100GB csv file with following columns:
sex;name;dob;hash
This files was created after some processing of another .csv file. And it can contain tuples, that’s why there is this hash column. What I need is to delete duplicates from this file based on ‘hash’ column and save new data into output.csv.
I’m looking for both speed and efficiency. I’ve tried uploading to PostgreSQL using Polars, then select only unique columns:
df.write_database(
table_name=input_table,
connection=connection_for_db,
if_table_exists='append'
)
SELECT sex, name, dob
FROM (
SELECT DISTINCT ON (hash)
hash, sex, name, dob
FROM input_table
) with_hash
But the uploading to DB is way too slow!
I’ve tried to read this .csv using Polars LazyFrame:
df = pl.scan_csv(input.csv, separator=';', infer_schema_length=0,
ignore_errors=True, rechunk=False)
uni_df = df.unique(subset=['hash']).drop('hash')
uni_df.sink_csv(output.csv, separator=';', maintain_order=False)
But python takes too much memory as a result my PC run out of memory and the script got killed. The last thing I’ve tried to do was to load and clean data using Dask:
ddf = dsk.read_csv(input.csv, sep=';')
ddf = ddf.drop_duplicates(subset=['hash'])
ddf.to_csv(output.csv, sep=';')
It’s working but it is also taking too much system memory. So I’m really out of options here and will be very glad if someone can help me to solve this issue. As I can have bigger files in future and would like to know how to deal with them.
Thanks in advance!
2
Answers
You can use
=UNIQUE
in Excel or the like (if the file opens at all in Excel), see this example: Delete row if the cell name already exists in previous row in ExcelThis would empty duplicate cells and then you can remove all rows where a given cell is empty, see https://superuser.com/questions/474530/excel-delete-row-if-cell-in-certain-column-is-blank.
If this is not feasible for you, then you can do inserts into the database, but I would not do it with distinct values. Instead, I would create a
unique
index in the database table on your hash and do an insert which does nothing when the constraint is violated. Example:If the hash is computed based on these three fields, then you probably do not need a hash, but a
unique(sex, name, dob)
in your target database.I’m assuming that the crux of the issue is that even maintaining a
set()
of hashes might not be practical as a 100g file might potentially have 25g or more of hex string hash data.If this is the case, we can take our original 100g file and break it into chunks (temp files) based on the partial value of each row’s hash. Then once we have a set of temp files, we just iteratively use them to construct the final distinct file but using only 1/16th the memory to maintain the
set()
Note: VERY lightly tested