I have Clickhouse-server and Airflow all are Docker containers in WSL. They’re both running and i even managed to connect to Clickhouse somehow, but when trying to query i get the following traceback in Airflow logs:
[2024-06-04, 12:17:49 UTC] {logging_mixin.py:188} INFO - >>> Running 'etl' function. Logged at 2024-06-04 12:17:49.678034+00:00
.....
[2024-06-04, 12:17:53 UTC] {logging_mixin.py:188} INFO - >>> Running 'clickhouse_connect' function. Logged at 2024-06-04 12:17:53.590231+00:00
[2024-06-04, 12:17:53 UTC] {logging_mixin.py:188} INFO - Connecting to ClickHouse Cluster
[2024-06-04, 12:17:53 UTC] {base.py:84} INFO - Using connection ID 'my_clickhouse' for task execution.
[2024-06-04, 12:17:53 UTC] {logging_mixin.py:188} INFO - 8123
[2024-06-04, 12:17:53 UTC] {logging_mixin.py:188} INFO - Connected successfully
[2024-06-04, 12:17:53 UTC] {logging_mixin.py:188} INFO - Querying CHDB
[2024-06-04, 12:17:53 UTC] {connection.py:408} WARNING - Failed to connect to localhost:9000
Traceback (most recent call last):
.....
clickhouse_driver.errors.NetworkError: Code: 210. Connection refused (localhost:9000)
It seems that Airflow is trying to connect to localhost:9000 which is supposed for only CLI usage, and not to localhost:8123 which is for the http web-interface.
What should I do to force 8123 port connection or am I getting it wrong? Anyways, the main problem is: I can’t work with Clickhouse via Airflow.
PS: I use clickhouse_driver.Client for connection inside PythonOperator.
And this code to connect:
class ClickHouseConnection:
connection = None
def get_connection(connection_name='my_clickhouse'):
from clickhouse_driver import Client
if ClickHouseConnection.connection:
return connection
db_props = BaseHook.get_connection(connection_name)
ClickHouseConnection.connection = Client(db_props.host)
return ClickHouseConnection.connection, db_props
@logger
def clickhouse_connect():
print("Connecting to ClickHouse Cluster")
ch_connection, db_props = ClickHouseConnection.get_connection()
print(db_props.port)
print("Connected successfully")
filename = '/opt/airflow/data/test_csv_file.csv'
if ch_connection:
print("Querying CHDB")
ch_connection.execute(
f"INSERT INTO maindb.monitor FROM INFILE {filename} FORMAT CSV"
)
Thank you in advance!
I tried to change port to 9000 in Airflow’s Connections and restart containers, but nothing has changed.
2
Answers
The issue was in host name of clickhouse connection. Don't quite remember where I read it, but I should've used host name exactly like
host.docker.internal
(as long as I run docker multi-container application) and port 9000 is somehow worked for me.Diplo, clickhouse_driver default port is 9000 and in the code that u post (that say u can connect), you do not changed the port in Client class, so it’s using the Default port.
As an alternative to clickhouse_driver maybe u can use the Clickhouse community plugin to Airflow:
https://github.com/bryzgaloff/airflow-clickhouse-plugin/tree/master
Or trying to use BashOperator/SSHOperator to interact directly with the Clickhouse CLI.
Or using requests:
Are you sure that u can connect to 8123 any outside IP ? Sometimes the access to 8123 is limited to certain hosts (depending on your infrastructure).
Try a simple curl to make sure that u can access it from outside.