skip to Main Content

I have a not so massive table. About 500k records.
It comes as a JSON file.
I have to load it, parse it and flatten the dict (each row is a dict)
This takes quite some time, but manageable (maybe around 4-5 minutes)

However, at some point I need to take all these rows and uj them

 (uj/) enlist each row // this takes quite a bit

I only have 2 cores to use in dev and maybe 4 in prod/dr

Using 2 cores I saved some time, but not something that motivated me to update our infrastructure.

Is there anything I am doing wrong ?
Anything I am missing ?

I know you could use some data, but creating a synthetic table won’t be much help.

Should I maybe consider starting 10 new secondary procs and split the ~500k in 50k sets and pass them to these processes to be processed ?

This table has 492000 records and 103 columns.

2

Answers


  1. When working with a table of around 500k records and 103 columns, parsing and flattening JSON into a manageable format can be time-consuming, especially with limited computational resources. Given that you’ve already tried using 2 cores, and the time saved wasn’t significant, it’s understandable to wonder if there’s a more efficient approach.
    
    Here’s a few thoughts:
    
    Parallel Processing: Splitting the data into smaller chunks (e.g., 50k records) and processing them in parallel using multiple processes could help. If you can spin up 10 secondary processes, that could speed things up. However, keep in mind that there’s overhead in managing multiple processes, and the actual speedup depends on how well your code scales with parallelism. For 2 cores in dev, parallelism might not yield massive gains, but in prod with 4 cores, it could be more effective.
    
    Optimizing the Parsing/Flattening: If the bottleneck is in parsing and flattening the JSON, see if you can optimize that step. Using libraries like pandas or dask in Python, which are optimized for handling larger datasets, could provide a speed boost. Sometimes, just changing the way data is processed (e.g., using vectorized operations) can make a big difference.
    
    Infrastructure Consideration: While optimizing the code is crucial, sometimes the problem is simply resource-bound. If you’re consistently working with larger datasets and performance is a recurring issue, it might be worth considering an infrastructure upgrade, even if the gains seem modest. Dev might not need the power, but production environments often justify the expense.
    
    In summary, splitting the workload across more processes could help, but ensure the overhead doesn’t outweigh the benefits. Also, consider optimizing the JSON parsing/flattening process itself or re-evaluating your infrastructure if this is a frequent task.
    
    Login or Signup to reply.
  2. A few questions: How exactly are you parsing the Json file? Are you using the inbuild jason parser https://code.kx.com/q/ref/dotj/ ? Are the dictionaries conform? i.e do they have the same keys? If so, you don’t have to enlist them, a list of conform dictionaries is basically a table. You can read more about dictionaries and tables here: https://www.defconq.tech/docs/concepts/dictionariesTables

    Login or Signup to reply.
Please signup or login to give your own answer.
Back To Top
Search