I am building a web app for a client who has ~1M rows of data in their MySQL database. The data looks approximately like this:
Products
+---------+-----------+---------+-------------+---------+----------+
| Id | Name | Size | Colour | 4 Other attributes |
+---------+-----------+---------+-------------+---------+----------+
| 1 | product 1 | S,M,L | red, green | ... other options |
| 2 | product 2 | XL | blue, green | ... other options |
| ................................................................ |
On the Products page in my app, there is:
- Sidebar column that holds filters (6 filter groups from the attributes above with their respective options)
- Main area where users see the product results (50 at a time with paging)
When the user clicks on a filter option (e.g. colour=blue), I a query to update the results in the main area (e.g. all products that have colour=blue).
What query can I perform to fetch the remaining filter options for the other filter groups?
For example, if the user selects colour=blue, that will reduce the number of options in other filter groups such as size (if you look at the sample data above, the only option for size that will remain will be XL).
I can’t get the filter options from all results of the colour=blue query, because there are 100k+ of these and I only fetch 50 at a time.
Step 1: All options available.
Sizes
[ ] S
[ ] M
[ ] L
[ ] XL
Colours
[ ] red
[ ] green
[ ] blue
Step 2: User selects "blue", sizes options are updated.
Sizes
[ ] XL
Colours
[ ] red
[ ] green
[x] blue
Here is an example from another site doing something similar. (This site also has millions of products listed.)
Step 1: Initial load, no filters selected.
Step 2: One of the filter options is selected. The results on the right update and the filter options in the other filter groups in the sidebar update too.
3
Answers
The question "What query can I perform to fetch the remaining filter options for the other filter groups?"
Short answer
Use a brute force query that fetches the data and tallies the results.
Long Answer
Here are some tips on what needs to be done. It provides some suggestions on improving performance.
PRIMARY KEY
and (2) limit the heavy queries that will follow.The performance will not be great. But I suggested some things to avoid a big table scan:
Maybe…
For counts (left side of page)
For the first 20 to show (main part of page)
This is, however, complicated by not needing/wanting a column
color
for CDs, nor a columngenre
for most other items. I would solve this in either of 2 ways:100 search tables (one per category), each with appropriate search categories as columns (with suitable datatypes).
A single table with most of the search criteria in a JSON string.
Notes:
In either approach, there would probably be a column for
price
, assuming that most categories need a filter on such.In either approach, the filtering by table or by PK would shrink down the search to about 1/100 of the rows.
A third option is to
PARTITION BY
either KEY or LIST and have 100 partitions. Perhaps fewer; for example, various types of clothing might work well in the same grouping. Or media (CD, etc) or vehicles (cars, trucks).To restrict the list of sizes if the user selects a colour that does not have all sizes, generate a filter query that holds all possible combinations.
You would get back JSON that looks like:
On the client have an array of selected colours and an array of selected sizes. When these get updated redraw the filter lists using filterOptions.filter(f => {..only give me f where both the fields exist in their respective filter arrays}) and then use this filtered array to filter again into the relevant dimensions. If you have more than two dimensions this can grow quite quickly so if the dimension are different depending on product type you might need another layer of abstraction to make the filtering generic.
Perform 2 queries, calling the same stored routine. The first (param_query_mode=1) is called to get the main result set, and the second (param_query_mode=2) is called to build the new filter set. They happen simultaneously as AJAX queries.
Both queries use the same initial stage which builds a set of IDs from the product table based on the initial search criteria and the current filters. The results represent all matching records and are stored in a temp table containing just IDs.
Then the first query (param_query_mode=1) builds a result set based on the temp table but filtering out all but the required 50 (or whatever–parameter controlled). The result set contains what fields are needed to display the these 50 records.
The second query (param_query_mode=2) also builds its result set based on the temp table, but this result set contains what is needed to display the filters for all the results from stage one.
I do not have to deal with tables with millions of records, but the approach works well in my context.