How do I configure a redshift lambda UDF to batch requests?
On this page Creating a scalar Lambda UDF – Amazon Redshift it says in the note section:
You can configure batching of multiple invocations of your Lambda function to improve performance and lower costs.
I’m testing with a hello world lambda that simply returns the input given. Here is the SQL ddl I’m using:
CREATE OR REPLACE EXTERNAL FUNCTION hello_world (varchar)
RETURNS varchar IMMUTABLE
LAMBDA 'redshift_udf_testy'
IAM_ROLE '<redacted>';
My UDF works fine, however it doesn’t seem to batch requests. I would expect the following query:
select hello_world(generate_series(1, 500)::text);
to pass multiple rows at a time to hello_world (since the lambda UDF JSON api specifies that it be able to handle arrays of arguments). But instead it performs 500 separate invocations of my lambda function (every lambda invocation has a single row passed in),
which seems totally incorrect.
Any idea how I can configure it to batch? The docs mention it in passing but i can’t find anything concrete.
2
Answers
you can set the maximum number of rows that Amazon Redshift sends in a single batch request for a single lambda invocation and the maximum size of the data payload that Amazon Redshift sends in a single batch request for a single lambda invocation by configuring the MAX_BATCH_ROWS and MAX_BATCH_SIZE parameters respectively. Public documentation can be found at: https://docs.aws.amazon.com/redshift/latest/dg/r_CREATE_EXTERNAL_FUNCTION.html
Make a physical table of 500 rows and pass the column from the table instead. generate_series, just like cte’s and temp tables, make it a leader node only process (I think). I passed back the length of the arguments array passed to the lambda to confirm this. Just a few minutes ago actually… So it kept calling it with 1 argument per call when I used a cte. Then I switched to a physical table and it switched over to 200-500 argument array lengths. It only batches when it distributes across the nodes. Also make sure to use multiprocessor pool in the python lambda to get more bang for your buck.