skip to Main Content

Background

I’m providing a map-based search for paddle boards, including searching by date – think "check in" / "check out" dates in a hotel search, where only hotels available for that entire date range are included in the results.

I have availability data for a couple years, (updated daily) for each paddle board. A paddle board is either available on a day (for 12 noon pick-up) or not available (meaning somebody is picking it up at 12 noon and will drop it off the next morning).

There are hundreds of thousands of paddle boards, so storing separate records for each day’s availability for each is untenable (e.g., 350k paddle boards w/ 2 years availability data results in over 250 million rows).

Technology

I’m using Postgres 15, and I can add extensions, if necessary. I also have Redis available in the stack, in case the best solution involves non-RDBMS.

The app is very read-heavy, so large / expensive indexes are OK, within reason.

Initial Idea

My initial idea for storage and efficient searching involves storing availability data in a char field, for each paddle board, in a format like this: AAANNAAAANA (730 chars long, where A represents "available" and N represents "not available", with a static starting date (e.g., Jan 1, 2023)). Given a search date range (e.g., pick up on Jan 20 / drop off on the morning of Jan 25), I can construct a regexp search (e.g., ^.{19}A{5}) and achieve decent performance… but… I feel like there’s something more ideal.

For example, if I was working with a nested tree structure and wanted to get all offspring nodes of a given node, I could use an algorithm, like Modified Preorder Tree Traversal, to make easy work of an otherwise nightmarishly slow problem. Is there no such concept / mathematical solution to the kind of problem I’m describing?

Question

My question is, aside from the regexp solution I’ve come up with, are there any algorithmic solutions that are widely used to solve problems like this? For example, I looked into segment trees and some other data structures, not completely diving into any particular one, but I couldn’t find anything that I deemed to be at least a semi-ideal solution to the problem. I’m looking for a solution that wouldn’t involve really low-level hacking; something that’s relatively expressive / maintainable, and where the main filtering work takes place in the database (vs in application space).

2

Answers


  1. There are hundreds of thousands of paddle boards, so storing separate records for each day’s availability for each is untenable (e.g., 350k paddle boards w/ 2 years availability data results in over 250 million rows).

    250 million rows can be handled, with some care, and we can likely reduce that by at least an order of magnitude.

    Your proposed solution is, effectively, trying to make your own indexing system. Regex searches over a database are typically quite slow. And by cramming all checkouts into one row you cannot store any information about each checkout: who checked it out, what price did they pay, what condition was the board returned, etc.

    Instead, use a conventional table and rely on indexes and 50 years of cumulative relational database development.

    We can drastically reduce the number of rows by assuming paddleboards will spend most of their time sitting around waiting to be checked out (you can verify this assumption). Instead of storing the state of each paddleboard for each day, only store the dates when they are checked out. Use the daterange type. This lets us use range operations.

    (Note: I would recommend against datemultirange. Any solution which tries to cram all checkouts into one row means you cannot store information about each checkout. Who checked it out? What condition was it returned? What price did they pay? Were they late?)

    (Note: Consider using a tsrange, timestamp range, instead and defaulting the time to noon. Assuming the pickup/dropoff time will always be noon is a business requirement and can change at any time. Avoid hard coding it into your schema. While you can change the column type later and convert the data, it will be painful with that much data.)

    create table paddleboards (
      id serial primary key
    
      -- and other information about each paddleboard
      -- who owns the paddleboard?
      -- where is it located?
      -- what condition is it in?
    );
    
    create table paddleboard_checkouts (
      paddleboard_id integer not null references paddleboards(id),
      in_use daterange not null
    
      -- and other information about this checkout
      -- who checked it out?
      -- what condition was it returned?
    );
    

    Then index in_use and paddleboard_id. These indexes will make these queries fast even as the amount of data gets larger. Instead of scanning the whole table, it can search the index. They are (often literally) the search trees you’re wanting.

    paddleboard_id can be a normal b-tree index. However, in_use should use a GIST index because it supports important range operations like intersection (&&).

    create index paddleboard_checkouts_fk on paddleboard_checkouts (paddleboard_id);
    create index paddleboard_in_use using gist (in_use);
    

    Here are some examples of queries the indexes will cover.

    -- When is a particular paddleboard in use?
    -- This will use paddleboard_checkouts_fk
    select in_use
    from paddleboard_checkouts
    where paddleboard_id = 12345;
    
    -- What paddleboards are checked out this week?
    -- This will use paddleboard_in_use
    select paddleboard_id
    from paddleboard_checkouts
    where in_use && daterange('2023-05-29', '2023-06-02');
    
    -- What paddleboards are available this week?
    -- This will use paddleboard_in_use
    select paddleboard_id
    from paddleboard_checkouts
    where not in_use && daterange('2023-05-29', '2023-06-02');
    
    -- Is this paddleboard available this week?
    -- This will probably use paddleboard_checkouts_fk to narrow
    -- the rows to just one paddleboard then linear search the few
    -- remaining rows.
    select paddleboard_id
    from paddleboard_checkouts
    where paddleboard_id = 12345
      and not in_use && daterange('2023-05-29', '2023-06-02');
    

    Since your data will continuously grow, you may wish to consider partitioning the table by year or perhaps month. This won’t necessarily improve performance, but it will make it easier to drop old data and keep the table from continuously growing.

    The complete data can be mirrored into a data warehouse. This is a database which is optimized for storing and searching extremely large amounts of data. For example, Google BigQuery.

    Finally, if necessary for performance, you can add a cache for the results of common queries. It’s likely most of your queries will cluster around a few months or weeks in the future.

    Use the data warehouse for high-latency business intelligence tasks which need to see the whole range of data such as reports, statistics, and so on. Use the PostgreSQL database for low-latency tasks such as serving web requests. Use the cache to decrease latency even further.

    Login or Signup to reply.
  2. My initial idea for storage and efficient searching involves storing availability data in a char field, for each paddle board, in a format like this: AAANNAAAANA (730 chars long, where A represents "available" and N represents "not available", with a static starting date)

    Using a character string for this is rather unwieldy. Nicest to work with, and following the best practice of database normalisation, would be to store the availability data in a separate table – with one entry per availability, per non-availability, or per date for which the status is known, depending on your modeling.

    However, for your use case – in particular, never updating individual availabilities but always the whole series, always reading the whole (or at least a large part of the) series, and not ever needing to store additional data per checkout – a non-atomic representation might be ok and likely even faster.

    • One approach would be to use a bit string, which is smaller and has more appropriate operators available than the character string. Afaics, it does not provide suitable indexing capabilities.
    • Next would be using an array of dates, i.e. date[], each representing a non-availability (which I guess are less common than availabilities). This probably takes much more space, but you can build a GIN index to search for arrays containing specific dates or overlapping with specific date arrays.
    • Finally, Postgres 14 introduced multiranges, representing a set of non-overlapping ranges (or, if you want, a range with gaps). Using a datemultirange would be fitting especially if most checkouts of boards are for multiple sequential days, not just a single day. They offer similarly useful operators, and can be indexed to search for containment and overlaps with other (multi)ranges.
    Login or Signup to reply.
Please signup or login to give your own answer.
Back To Top
Search