I’ve got a PHP+mySQL website that shows data of 300,000 products: name, description, how to fix it, where to buy it, etc.
Initially, the mySQL table is designed in the following way:
- product_id: MEDIUMINT
- category_id: SMALLINT
- weight_kg: SMALLINT
- height_cm: SMALLINT
- width_cm: SMALLINT
- depth_cm: SMALLINT
- name: VARCHAR(100)
- label: VARCHAR(100)
- short_description: VARCHAR(200)
- long_description: VARCHAR(9999)
- how_to_fix_it: VARCHAR(9999)
- where_to_buy_it: VARCHAR(9999)
- similar_products: VARCHAR(9999) <--- it stores "ACME 12|acme-12#ACME 555|acme-555" to link directly other products, avoiding to do subsequent queries to find the names and labels of the similar products.
I’ve found that the size of the table is huge, mainly due to the existence of the text fields (VARCHAR-9999), and this can slow down queries and website speed. I think that these VARCHAR-9999 fields are not necessary to be inside the table, since I do not perform SQL operations with them. I only query them when displaying information on a single product.
I’m considering the creation of 300,000 text files (product_id.txt) to be read by PHP, each one storing the text information (one field per line):
long_description: blah blah
how_to_fix_it: blah blah
where_to_buy_it: blah blah
similar_products: blah blah
Each time I display information of a single product, I would PHP-read the text file from disk, and I think it would be faster.
I would appreciate any similar experience dealing with this issues. Is there any smart way to store the 300,000 files? I was thinking about 300 subdirectories (‘000’, ‘001’, …, ‘299’) to store 1,000 files in each one, to make the read faster.
2
Answers
Unless you are padding those strings out to 9999 characters, your VARCHAR(9999) does not use more space to store a 34-character string ‘ACME 12|acme-12#ACME 555|acme-555’ than a VARCHAR(256) would use. That’s the VAR part — it stores a variable-length string in a variable amount of space, but no more space than needed. The 9999 only allows longer strings, it doesn’t use that much space for every short string.
You can verify the average row length using
SHOW TABLE STATUS LIKE 'mytable'G
in the MySQL client. That reports an estimate of the row count and the average row size, based on a sample of the pages in your table. It’s approximate, but it’s usually pretty close. I predict it will show that your rows are not as large as you think they are.That solution will not be faster for what I would expect your typical queries are. Reading files from disk is slower than reading from RAM by several orders of magnitude, and MySQL caches pages of your database in RAM as much as possible. It’s pretty good at this, and represents many years of fine-tuning by hundreds of engineers. I doubt you can do better in your lifetime (I’m not insulting you, I don’t think any single developer can do that much work alone).
That said, it might be faster for certain specific uses of the data (for example counting products). You haven’t described any queries, so we can’t guess how you will use the data.
Any optimization strategy depends on which queries you will be executing, because every optimization strategy optimizes for one type of query at the expense of making others less optimal.
By all means, it’s a good learning experience to implement your own database, and make it as reliable and flexible as an RDBMS. You will learn a lot.
Yes — in a database! 🙂
It depends on the filesystem. Some have a big performance problem with hundreds of thousands of files. I did once implement a system to collate thousands of files into a hierarchy of subdirectories … that was in 1992, but I dare say filesystems have advanced since then! Modern filesystems should be able to handle much greater numbers of files.
It’s not necessarily a win to separate your data into 300,000 files. What if you need to query the average size of your products? You’d have to open 300,000 files and read all of them. Have you measured the overhead of opening that many file descriptors on your operating system? Would it not be better to store them all in a single file if that’s a query you need to execute, and opening that many files is too time-consuming?
Can you even open that many files in a single process? For example in Linux this is limited by the ulimit setting on your operating system. What is your ulimit value?
This is what I mean by the optimization depends on which queries you need.
Look at it this way. Fetching stuff from disk involves these steps:
For a file on disk, steps 1,2,3,4,5 are needed. Steps 1 and 2 might be cached by the OS.
For a row in a database table, the table is already open, so 1 and 2 are already done (except when you first start the system). 3 and 4 become "Drill down the BTree to the row desired. 3 is likely to be cached.
Using a database is a slam-dunk winner.
If you are having performance problems, we really need to see the queries that are so slow. Perhaps a
FULLTEXT
index would be very beneficial.