skip to Main Content

I have Clickhouse-server and Airflow all are Docker containers in WSL. They’re both running and i even managed to connect to Clickhouse somehow, but when trying to query i get the following traceback in Airflow logs:

[2024-06-04, 12:17:49 UTC] {logging_mixin.py:188} INFO - >>> Running 'etl' function. Logged at 2024-06-04 12:17:49.678034+00:00
.....
[2024-06-04, 12:17:53 UTC] {logging_mixin.py:188} INFO - >>> Running 'clickhouse_connect' function. Logged at 2024-06-04 12:17:53.590231+00:00
[2024-06-04, 12:17:53 UTC] {logging_mixin.py:188} INFO - Connecting to ClickHouse Cluster
[2024-06-04, 12:17:53 UTC] {base.py:84} INFO - Using connection ID 'my_clickhouse' for task execution.
[2024-06-04, 12:17:53 UTC] {logging_mixin.py:188} INFO - 8123
[2024-06-04, 12:17:53 UTC] {logging_mixin.py:188} INFO - Connected successfully
[2024-06-04, 12:17:53 UTC] {logging_mixin.py:188} INFO - Querying CHDB
[2024-06-04, 12:17:53 UTC] {connection.py:408} WARNING - Failed to connect to localhost:9000
Traceback (most recent call last):
.....
clickhouse_driver.errors.NetworkError: Code: 210. Connection refused (localhost:9000)

It seems that Airflow is trying to connect to localhost:9000 which is supposed for only CLI usage, and not to localhost:8123 which is for the http web-interface.

What should I do to force 8123 port connection or am I getting it wrong? Anyways, the main problem is: I can’t work with Clickhouse via Airflow.

PS: I use clickhouse_driver.Client for connection inside PythonOperator.

And this code to connect:

class ClickHouseConnection:
    connection = None
    def get_connection(connection_name='my_clickhouse'):
        from clickhouse_driver import Client

        if ClickHouseConnection.connection:
            return connection
        db_props = BaseHook.get_connection(connection_name)
        ClickHouseConnection.connection = Client(db_props.host)
        return ClickHouseConnection.connection, db_props

@logger
def clickhouse_connect():
    print("Connecting to ClickHouse Cluster")
    ch_connection, db_props = ClickHouseConnection.get_connection()
    print(db_props.port)
    print("Connected successfully")
    filename = '/opt/airflow/data/test_csv_file.csv'
    if ch_connection:
        print("Querying CHDB")
        ch_connection.execute(
            f"INSERT INTO maindb.monitor FROM INFILE {filename} FORMAT CSV"
        )

Thank you in advance!

I tried to change port to 9000 in Airflow’s Connections and restart containers, but nothing has changed.

2

Answers


  1. Chosen as BEST ANSWER

    The issue was in host name of clickhouse connection. Don't quite remember where I read it, but I should've used host name exactly like host.docker.internal (as long as I run docker multi-container application) and port 9000 is somehow worked for me.


  2. Diplo, clickhouse_driver default port is 9000 and in the code that u post (that say u can connect), you do not changed the port in Client class, so it’s using the Default port.

    As an alternative to clickhouse_driver maybe u can use the Clickhouse community plugin to Airflow:
    https://github.com/bryzgaloff/airflow-clickhouse-plugin/tree/master

    Or trying to use BashOperator/SSHOperator to interact directly with the Clickhouse CLI.

    Or using requests:

    import requests
    
    response = requests.get(
        'http://ch_host:8123',
        params={
            'query': 'SELECT version()',
        },
        headers={
            'X-ClickHouse-User': '<DB user name>',
            'X-ClickHouse-Key': '<DB user password>',
        })
    
    response.raise_for_status() print(response.text)
    

    Are you sure that u can connect to 8123 any outside IP ? Sometimes the access to 8123 is limited to certain hosts (depending on your infrastructure).

    Try a simple curl to make sure that u can access it from outside.

    curl 'http://CH_HOST:8123/?user=dev&password=dev123'
    
    Login or Signup to reply.
Please signup or login to give your own answer.
Back To Top
Search