Background
I’m providing a map-based search for paddle boards, including searching by date – think "check in" / "check out" dates in a hotel search, where only hotels available for that entire date range are included in the results.
I have availability data for a couple years, (updated daily) for each paddle board. A paddle board is either available on a day (for 12 noon pick-up) or not available (meaning somebody is picking it up at 12 noon and will drop it off the next morning).
There are hundreds of thousands of paddle boards, so storing separate records for each day’s availability for each is untenable (e.g., 350k paddle boards w/ 2 years availability data results in over 250 million rows).
Technology
I’m using Postgres 15, and I can add extensions, if necessary. I also have Redis available in the stack, in case the best solution involves non-RDBMS.
The app is very read-heavy, so large / expensive indexes are OK, within reason.
Initial Idea
My initial idea for storage and efficient searching involves storing availability data in a char field, for each paddle board, in a format like this: AAANNAAAANA
(730 chars long, where A
represents "available" and N
represents "not available", with a static starting date (e.g., Jan 1, 2023)). Given a search date range (e.g., pick up on Jan 20 / drop off on the morning of Jan 25), I can construct a regexp search (e.g., ^.{19}A{5}
) and achieve decent performance… but… I feel like there’s something more ideal.
For example, if I was working with a nested tree structure and wanted to get all offspring nodes of a given node, I could use an algorithm, like Modified Preorder Tree Traversal, to make easy work of an otherwise nightmarishly slow problem. Is there no such concept / mathematical solution to the kind of problem I’m describing?
Question
My question is, aside from the regexp solution I’ve come up with, are there any algorithmic solutions that are widely used to solve problems like this? For example, I looked into segment trees and some other data structures, not completely diving into any particular one, but I couldn’t find anything that I deemed to be at least a semi-ideal solution to the problem. I’m looking for a solution that wouldn’t involve really low-level hacking; something that’s relatively expressive / maintainable, and where the main filtering work takes place in the database (vs in application space).
2
Answers
250 million rows can be handled, with some care, and we can likely reduce that by at least an order of magnitude.
Your proposed solution is, effectively, trying to make your own indexing system. Regex searches over a database are typically quite slow. And by cramming all checkouts into one row you cannot store any information about each checkout: who checked it out, what price did they pay, what condition was the board returned, etc.
Instead, use a conventional table and rely on indexes and 50 years of cumulative relational database development.
We can drastically reduce the number of rows by assuming paddleboards will spend most of their time sitting around waiting to be checked out (you can verify this assumption). Instead of storing the state of each paddleboard for each day, only store the dates when they are checked out. Use the daterange type. This lets us use range operations.
(Note: I would recommend against
datemultirange
. Any solution which tries to cram all checkouts into one row means you cannot store information about each checkout. Who checked it out? What condition was it returned? What price did they pay? Were they late?)(Note: Consider using a tsrange, timestamp range, instead and defaulting the time to noon. Assuming the pickup/dropoff time will always be noon is a business requirement and can change at any time. Avoid hard coding it into your schema. While you can change the column type later and convert the data, it will be painful with that much data.)
Then index
in_use
andpaddleboard_id
. These indexes will make these queries fast even as the amount of data gets larger. Instead of scanning the whole table, it can search the index. They are (often literally) the search trees you’re wanting.paddleboard_id
can be a normal b-tree index. However,in_use
should use a GIST index because it supports important range operations like intersection (&&
).Here are some examples of queries the indexes will cover.
Since your data will continuously grow, you may wish to consider partitioning the table by year or perhaps month. This won’t necessarily improve performance, but it will make it easier to drop old data and keep the table from continuously growing.
The complete data can be mirrored into a data warehouse. This is a database which is optimized for storing and searching extremely large amounts of data. For example, Google BigQuery.
Finally, if necessary for performance, you can add a cache for the results of common queries. It’s likely most of your queries will cluster around a few months or weeks in the future.
Use the data warehouse for high-latency business intelligence tasks which need to see the whole range of data such as reports, statistics, and so on. Use the PostgreSQL database for low-latency tasks such as serving web requests. Use the cache to decrease latency even further.
Using a character string for this is rather unwieldy. Nicest to work with, and following the best practice of database normalisation, would be to store the availability data in a separate table – with one entry per availability, per non-availability, or per date for which the status is known, depending on your modeling.
However, for your use case – in particular, never updating individual availabilities but always the whole series, always reading the whole (or at least a large part of the) series, and not ever needing to store additional data per checkout – a non-atomic representation might be ok and likely even faster.
date[]
, each representing a non-availability (which I guess are less common than availabilities). This probably takes much more space, but you can build a GIN index to search for arrays containing specific dates or overlapping with specific date arrays.datemultirange
would be fitting especially if most checkouts of boards are for multiple sequential days, not just a single day. They offer similarly useful operators, and can be indexed to search for containment and overlaps with other (multi)ranges.