skip to Main Content

So I’ve been building an analytics tool in crypto that requires ingesting a few billion rows of data across a couple tables. Sales, Listings, NFTs, Metadata, Contract info, etc. I take that data and machine learn on it, outputting quite a bit of data for every contract every hour into another table. The data is extremely relational and the end result is trying to output large aggregates of data based on complex queries with many joins (think taking every asset in the 500 projects with the most sales in the last 3 days, joining to their metadata attributes, joining those attributes to their most recent mle results, and then slicing that data 20 different ways).

I currently have an 8 vCPU Postgresql RDS with 32 GB RAM and with 5-6 materialized views on it, it runs at 75%+ average CPU util and takes 10-15 mins+ to update some of the views, so anything that then queries those views takes an eternity to respond basically half of the time. I’ve looked into cube, timescale, redshift, etc. The stream based dbs seem quite difficult to set up highly related historical data with and the relational database don’t have the concept of incrementally updated materialized views yet.

What should I do? I’d be willing to pay up to $1000/mo to solve the problem in a scalable way (maybe 10-100 billion rows across a few tables that can query from views that don’t hold up reads and writes) Hopefully I can up that number soon but for now I’m fronting all the costs with no income from this yet

3

Answers


  1. Unfortunately PostGresql does not have technics ans functionnalities to help to have good performances about your problems.

    • PostGreSQL does not have "In Memory" tables (that reduce contention by having no access to data disks nor transaction log)
    • PostGreSQL does not have native C compiled stored procedure (that accelerate execution of complex logic)
    • PostGreSQL does not have "Columnar" indexes (that reduce index size and acces for big tables)
    • PostGreSQL is very limited to parallelize queries (only 4 operations supports multi threading and you have to do it manualy…)
    • PostGreSQL does not have data compression for tables or indexes (that reduce access time without the need to decompress to access to the data)
    • PostGreSQL does not have Indexed View (that are always synchronized to source data and does not need any refresh)
    • PostGreSQL does not have a "batch" processing mode to access data globally by "block" instead of rows into indexes
    • and finally PostGreSQL does not have a specialized data engine for analytics that automatically process cubes for BI

    But thoses technics and functionnalities exists in some other RDBMS like Microsoft SQL Server (auto parallelization of "heavy" queries, ColumStore Indexes, In Memory table, native procedure, indexed views, compression, batch processing mode, SSAS engine for BI…)

    Of course you need to pay a licence for all this stuff, but you will have to pay much more to have a super computer for PostGreSQL without having the same performances.

    As an example, I have done some performances tests on some very simple queries that demontsrates how slow is PostGreSQL in front of SQL Server :

    PostGreSQL vs Microsoft SQL Server – Comparison part 2 : COUNT performances
    This paper demonstrate that PostGreSQL is on average 90 times slower that SQL Server and in the worst case (with columnstore indexes) 1500 time slower…

    PostGreSQL vs Microsoft SQL Server – Comparison part 1 : DBA command performances
    This paper demonstrate that PostGreSQL is about 20 time sower in basic DBA operations

    Ask you why StackOverflow runs on Microsoft SQL Server and not on PostGreSQL ?

    Login or Signup to reply.
  2. I’d look hard at Redshift. It’s columnar and designed for large analytic queries. It’s SQL is based on postgres so the port effort could be manageable. Some aspect of your current solution may not fit well on Redshift but it should be able to hit the cost and analytic performance goals you are stating.

    Login or Signup to reply.
  3. If you want to process data in real-time you should try publishing your data into a technology designed for the job. Something like Confluent Kafka.

    Once the data is being streamed to your message broker you can then process and handle the data in any way you choose. You can have multiple small services for each of your business needs. You will probably have to write data to a database in the end, but this will likely be a much smaller volume contain only the results you really need to store.

    Login or Signup to reply.
Please signup or login to give your own answer.
Back To Top
Search