Matillion: How to identify performance bottleneck - CentOS

dovregubben
November 29, 2021
263 views
0 votes
2 Answers

We’re running Matillion (v1.54) on an AWS EC2 instance (CentOS), based on Tomcat 8.5.
We have developped a few ETL jobs by now, and their execution takes quite a lot of time (that is, up to hours). We’d like to speed up the execution of our jobs, and I wonder how to identify the bottle neck.

What confuses me is that both the m5.2xlarge EC2 instance (8 vCPU, 32G RAM) and the database (Snowflake) don’t get very busy and seem to be sort of idle most of the time (regarding CPU and RAM usage as shown by top).

Our environment is configured to use up to 16 parallel connections.
We also added JVM options -Xms20g -Xmx30g to /etc/sysconfig/tomcat8 to make sure the JVM gets enough RAM allocated.

Our Matillion jobs do transformations and loads into a lot of tables, most of which can (and should) be done in parallel. Still we see, that most of the tasks are processed in sequence.

How can we enhance this?

Answers

- peterb
- November 29, 2021 at 6:27 pm
- 0 votes
0
Because the Matillion server is just generating SQL statements and running them in Snowflake, the Matillion server is not likely to be the bottleneck. You should make sure that your orchestration jobs are submitting everything to Snowflake at the same time and there are no dependencies (unless required) built into your flow.
These steps will be done in sequence:

These steps will be done in parallel (and will depend on Snowflake warehouse size to scale):

Also – try the Alter Warehouse Component with a higher concurrency level

Login or Signup to reply.

- 53epo
- November 30, 2021 at 7:19 am
- 0 votes
0
By default there is only one JDBC connection to Snowflake, so your transformation jobs might be getting forced serial for that reason.

You could try bumping up the number of concurrent connections under the Edit Environment dialog, like this:

There is more information here about concurrent connections.

If you do that, a couple of things to avoid are:
- Transactions (begin, commit etc) will force transformation jobs to
  run in serial again
- If you have a parameterized transformation job,
  only one instance of it can ever be running at a time. More information on that subject is here
Login or Signup to reply.

Please signup or login to give your own answer.

Click here to cancel reply.

Matillion: How to identify performance bottleneck – CentOS

Answers