Dependency DAG
Description
Pretty straight forward, basically, I am reading some parquet files from disk using polars which are the source of data. Doing some moderately heavy duty processing (a few million rows) to generate an intermediate data frame, then generating two results which need to be written back to some database
Technology Stack
- Ubuntu 22.04
- Python 3.10
- Polars 1.2.1
Question
Polars recommends using lazy evaluations as far as possible to optimise the execution. Now, the final results (result_1
and result_2
) obviously have to be materialised.
But if I call these two in sequence
#! /usr/bin/env python3
# encoding: utf-8
import polars as pl
...
result_1.collect() # Materialise result 1
result_2.collect() # Materialise result 2
Is the transformation from the source to intermediate frame (common ancestor) repeated? If so, it is clearly undesirable. In that case, I have to materialise the intermediate frame and then do the rest of the processing in eager mode.
Any documentation from polars on the expected behaviour and recommended practices around this scenario?
2
Answers
Try
pl.collect_all
to collect multiple dataframes.Reference: https://docs.pola.rs/api/python/stable/reference/api/polars.collect_all.html
Honestly, I think for production code your best bet is to
collect()
intermediate results and then reuse them inresult_1
andresult_2
. It would be nice itcollect_all()
could find some common subgraphs of the calculation and cached them, but I don’t think it’s happening (although I haven’t really checked rust code).You could probably try some workaround via
polars.concat()
:You can see that intermediate part is cached during calculation: