skip to Main Content

I am building a web app for a client who has ~1M rows of data in their MySQL database. The data looks approximately like this:

Products

+---------+-----------+---------+-------------+---------+----------+
| Id      | Name      | Size    | Colour      | 4 Other attributes |
+---------+-----------+---------+-------------+---------+----------+
|    1    | product 1 |  S,M,L  | red, green  | ... other options  |
|    2    | product 2 |  XL     | blue, green | ... other options  |
| ................................................................ |

On the Products page in my app, there is:

  • Sidebar column that holds filters (6 filter groups from the attributes above with their respective options)
  • Main area where users see the product results (50 at a time with paging)

When the user clicks on a filter option (e.g. colour=blue), I a query to update the results in the main area (e.g. all products that have colour=blue).

What query can I perform to fetch the remaining filter options for the other filter groups?

For example, if the user selects colour=blue, that will reduce the number of options in other filter groups such as size (if you look at the sample data above, the only option for size that will remain will be XL).

I can’t get the filter options from all results of the colour=blue query, because there are 100k+ of these and I only fetch 50 at a time.

Step 1: All options available.

Sizes
[ ] S
[ ] M
[ ] L
[ ] XL

Colours
[ ] red
[ ] green
[ ] blue

Step 2: User selects "blue", sizes options are updated.

Sizes
[ ] XL

Colours
[ ] red
[ ] green
[x] blue

Here is an example from another site doing something similar. (This site also has millions of products listed.)

Step 1: Initial load, no filters selected.

enter image description here

Step 2: One of the filter options is selected. The results on the right update and the filter options in the other filter groups in the sidebar update too.

enter image description here

3

Answers


  1. The question "What query can I perform to fetch the remaining filter options for the other filter groups?"

    Short answer

    Use a brute force query that fetches the data and tallies the results.

    Long Answer

    Here are some tips on what needs to be done. It provides some suggestions on improving performance.

    1. Categorize the products into maybe 100 different groupings (CDs, dresses, cameras, auto parts, etc). This will (1) become the first column of the PRIMARY KEY and (2) limit the heavy queries that will follow.
    2. Study the data to find what groupings are relevant (price, size, color, F-stop, etc). Be sure to re-study the data periodically. It will look stupid if the RAM chips start at 1MB and stop at 256MB. Be aware that words like "size" have different meanings for T-shirts vs disk drives, and no meaning for things with a single size.
    3. Brute force search the entire group whenever the checks (or unchecks) a filtering box.
    4. Split the data into 2 tables (and lots of ancillary tables). One has just the columns that are deemed worth filtering on. One has all the other info (description, URLs of images, etc). The first table will be used for searching and for rebuilding the groupings.
    5. Keep track of what people filter on. (Who cares about the color the RAM chips!) Use this to eliminate useless filters. It will be harder to discover new filters (eg, when disk drives added "SSD" option).
    6. Be aware that users will want to check multiple checkboxes in each filter. (eg, 2 price ranges or 2 sizes)
    7. Be sure to plan the "pagination" carefully; that can be a performance killer.
    8. For filtering, don’t "normalize". That keep "XL" as "XL" not a number. However, you may need a number to order the options in a sensible way (XS<S<M<L<XL). Do keep numbers as numbers (eg, megabytes) so that they can be easily tossed into the buckets.

    The performance will not be great. But I suggested some things to avoid a big table scan:

    • Force the user to start with a grouping that drastically limits the amount of data to scan.
    • Index it appropriately so that the scan is of "consecutive" rows.
    • Vertically split — search values vs the rest of the stuff.

    Maybe…

    For counts (left side of page)

    SELECT  COUNT(*) AS total,
            SUM(genre = 'Hip Hop') AS "Hip Hop",
            ...
        FROM search_table
        WHERE category = 'CD'
          AND artist = '...'   -- already picked items
    

    For the first 20 to show (main part of page)

    SELECT  i.*
        FROM search_table AS s
        JOIN info AS i  ON s.id = s.id
        WHERE s.category = 'CD'
          AND s.artist = '...'   -- already picked items
        ORDER BY s.price ASC     -- user-chosen sort order
        LIMIT 0, 20              -- for pagination
    

    This is, however, complicated by not needing/wanting a column color for CDs, nor a column genre for most other items. I would solve this in either of 2 ways:

    • 100 search tables (one per category), each with appropriate search categories as columns (with suitable datatypes).

        PRIMARY KEY (price, id),
        INDEX(id)
      
    • A single table with most of the search criteria in a JSON string.

        PRIMARY KEY (category, price, id),
        INDEX(id)
      

    Notes:

    • In either approach, there would probably be a column for price, assuming that most categories need a filter on such.

    • In either approach, the filtering by table or by PK would shrink down the search to about 1/100 of the rows.

    A third option is to PARTITION BY either KEY or LIST and have 100 partitions. Perhaps fewer; for example, various types of clothing might work well in the same grouping. Or media (CD, etc) or vehicles (cars, trucks).

    PRIMARY KEY(price, cat, id),  -- cat not needed first
    INDEX(id)
    
    Login or Signup to reply.
  2. To restrict the list of sizes if the user selects a colour that does not have all sizes, generate a filter query that holds all possible combinations.

    You would get back JSON that looks like:

    {filters:[
      {
        color:green,
        size:sm
      },
      {
        color:green,
        size:md
      },
      {
        color:green,
        size:lg
      },
      {
        color:green,
        size:xl
      },
      {
        color:blue,
        size:xl
      },
      ...
    ]
    

    On the client have an array of selected colours and an array of selected sizes. When these get updated redraw the filter lists using filterOptions.filter(f => {..only give me f where both the fields exist in their respective filter arrays}) and then use this filtered array to filter again into the relevant dimensions. If you have more than two dimensions this can grow quite quickly so if the dimension are different depending on product type you might need another layer of abstraction to make the filtering generic.

    Login or Signup to reply.
  3. Perform 2 queries, calling the same stored routine. The first (param_query_mode=1) is called to get the main result set, and the second (param_query_mode=2) is called to build the new filter set. They happen simultaneously as AJAX queries.

    Both queries use the same initial stage which builds a set of IDs from the product table based on the initial search criteria and the current filters. The results represent all matching records and are stored in a temp table containing just IDs.

    Then the first query (param_query_mode=1) builds a result set based on the temp table but filtering out all but the required 50 (or whatever–parameter controlled). The result set contains what fields are needed to display the these 50 records.

    The second query (param_query_mode=2) also builds its result set based on the temp table, but this result set contains what is needed to display the filters for all the results from stage one.

    I do not have to deal with tables with millions of records, but the approach works well in my context.

    Login or Signup to reply.
Please signup or login to give your own answer.
Back To Top
Search