This is more of a thought experiment than something I’m actually working on. But I have a script that is processing a very large (300 GB) text file and I was thinking of ways I might make my script multithreaded. You can ‘hack’ your way to multithreaded easy enough as long as your primary script can call some child scripts, and as long as these scripts can send messages to each other. Short messages indicating state can easily be done by just creating blank files with touch( ) and having the parent check for them with file_exists( ).
Anyway, to the point… in order for this particular setup to work, I’d still need the parent script to be able to send some large chunks of data (figure between 10 MB and 100MB) to child scripts. Are there any good ways of doing this?
-
My first thought was just include the "payload" data in the exec( ) call as a command line argument. But turns out the limit for my OS is about 250 KB and that would not be ideal. That would cause too much disk IO and things would likely bog way down.
-
The second option I thought of was just writing the payload data to a separate file, so the child script could read it and delete the temp file once it starts processing it. This would also significantly increase disk IO though, because you’re reading and writing every bit of data twice. This would be less of an issue on a super fast SSD but in my case, I’m actually working off a hard drive so limiting IO is beneficial.
-
The third and best way I thought of was to use SQL! Just create a memory table and use that for thread communication and for loading the data payloads and having the children scoop them up and begin processing them. I actually love this idea, however in the hypothetical machine I’d be running this on, no database is installed so doing it without sql would be preferred.
Are there any other ways to do this, ways that won’t add extra IO? I guess technically I could use my second idea, but create a RAM disk on the system so those temp files are being written there. But I’m hoping there’s a more direct way to send the data. At least 10 MB chunks worth, but ideally larger.
Also I call this a thought experiment because I’m not actually making this script. I did already make the normal single-threaded version, and I’m not going to go through all this work to make a multi-threaded version as this is a one time use script and it will finish before I’m done re-writing 🙂 But I’m just curious how you COULD do it. I can see this being really useful nowadays when most machines have tons of cores. If this was a script I’d be using regularly, I’d definitely go through with it.
2
Answers
This would better fit having a scanner main process which looks through the data file, splitting it into chunks by just examining the data rather than reading arrays etc. Then the bounds of the chunks can be passed onto separate reader processes to just work on a chunk of the original file.
To make it simpler, I’ve put in some control variables. A chunk size which is the size you want each file to process and a guess at a line size. This doesn’t have to be accurate, but it just gives it a part of the chunk to read to look for the last delimiter in that chunk.
It then reads file file, skipping to the segment just before the chunksize where it should find a delimiter, reads that last part and scans backwards from the end for the delimiter.
This then forms the end of the record and is adjusted to give the start of the next record. Continue until you read no more data…
and to test this, use a small file and check that it’s split correctly…
So the array
$parts
now has a list of chunks, pass them in turn to your read processes, they can use code similar to the test code (with anfseek
to start the reading).Maybe read the file in one script, line by line (or some chunks of given size), and then use forks to process them like this: PHP fork limit childs in the task ?
When running processes count is less than expected to process at once, create another one and feed it with the next line (chunk) from main script.