I want to store 10,000 sudoku puzzles, along with their solutions, so that’s 20,000 x 81 digits.
This is just for storage, I will parse it into something else to actually work with the data, so it can be obfuscated if necessary.
If I store a puzzle as a string like:
".........4...56.....6...95..4...8....925..8.15..19.4.23...7..9.6.9.....8.8....1.."
it will have 81 chars and be 162 bytes with default UTF-16 encoding. That’s 3.2mb of data in total.
Another option is to store a puzzle as an array of 9 (64-bit) numbers, like:
[
000000000,
000800050,
810020304,
380001060,
600009007,
004000000,
500400002,
400000000,
762005900
]
That is 72 bytes per puzzle, or 1.4mb total.
What better ways are there to store sudokus?
2
Answers
The raw file size isn’t all that huge compared to modern media so you might want to think about how small does it really need to be. Pure text based you can swap .. for : use other symbols for longer runs and also / for blanks to end of line. Take a look at FEN chess puzzle encoding for inspiration.
Provided that your Sudoku puzzles will all have one unique correct answer you don’t need to store the solutions at all! A fairly simple bitwise logic algorithm will solve clean Sudoku puzzles which have just one unique solution. They get stuck if there is a loop of values that could be either A or B and either choice is good or an unresolved loop A,B,C where only one choice is good. That can be programmed around but handling ambiguous special cases can get complicated.
Since you need to store 10 states per cell BCD decimal would be an obvious natural choice if you wanted to choose a machine supported format. That gets you down to 40.5 bytes per puzzle with a simple fixed length record (or 81 bytes to store a puzzle solution pair if that is really what you want to do).
If you are prepared to work a bit harder you could store them as 9x uint32 in 36 bytes with the digit position being significant (so 72 bytes per pair). You would have to be careful to rotate them so that no digits above 3 occur in the leading digit position though!
After that then you are into compression and there isn’t much that can be done with the solutions as they will have same frequency of every symbol and in random orders. The start positions will a priori have an insanely large number of blanks so you can exploit this and encode it as a single 0 bit and then allocate 1bbbb (or you could Huffman code it) but my guess is that the frequency of symbols 1 through 9 will be very similar. That should get the typical puzzle down to about 16 bytes or so.
I think uint32 might be the least painful way and not human readable but beware of overflowing maxint!
Each 3×3 square has a permutation of all 9 digits, and there are only 362880 of those, and that fits in 19 bits. Each digit may be shown in the puzzle or not, so you need one more bit for each of them, for a total of 28 bits.
Since strings in JavaScript are stored in UTF-16, you need two characters to store 28 bits, so you can fit each 3×3 square into 2 characters. That’s 18 characters per puzzle, or 36 bytes each, in a format that’s pretty easy to deal with.