apache-spark Questions

Amazon web services – AWS Glue Pyspark Job is not ending

January 6, 2025
RushHour
2 Answers

I am trying to read the data from RDS Postgres via PySpark 3.3 and AWS Glue 5.0 versions using the below command. df = ( self.config.spark_details.spark.read.format("jdbc") .option( "url", f"jdbc:postgresql://{self.postgres_host}:{self.postgres_port}/{self.postgres_database}", ) .option("driver", "org.postgresql.Driver") .option("user", self.postgres_username) .option("password", self.postgres_password) .option("query", query) .load() )…

VIEW QUESTION

Ubuntu – unionByName is only using a single core in apache spark

December 26, 2024
Rockstar5645
2 Answers

This has been driving me a little crazy so any help is greatly appreciated I have a list of dataframes df_list with maybe around 500 small dataframes (I ingested csvs and wrote them as parquets, then I read each one…

VIEW QUESTION

Azure Synapse pipeline with dataflows failing randomly

December 4, 2024
NITHIN B
2 Answers

I am having issues with a series of pipelines that build our data platform Spark databases hosted in Azure Synapse. The pipelines host dataflows which have 'recreate table' enabled. The dataflows extract data and are supposed to recreate the tables…

VIEW QUESTION

Amazon web services – Pyspark error: " Class org.apache.hadoop.fs.s3a.S3AFileSystem not found" in EMR 7.0.0

November 15, 2024
TripleH
2 Answers

I am using EMR 7.0.0 version, which has python 3.9, spark 3.5.0, Hadoop 3.3.6 in AWS. I got the error: File "/usr/local/lib/python3.9/site-packages/pyspark/python/lib/pyspark.zip/pyspark/sql/readwriter.py", line 740, in csv File "/usr/local/lib/python3.9/site-packages/pyspark/python/lib/py4j-0.10.9.7-src.zip/py4j/java_gateway.py", line 1322, in __call__ File "/usr/local/lib/python3.9/site-packages/pyspark/python/lib/pyspark.zip/pyspark/errors/exceptions/captured.py", line 179, in deco File "/usr/local/lib/python3.9/site-packages/pyspark/python/lib/py4j-0.10.9.7-src.zip/py4j/protocol.py",…

VIEW QUESTION

Docker – Can't connect/write stream from spark container to table in cassandra container

November 15, 2024
user28291353
2 Answers

I am composing these services in separate docker containers all on the same confluent network: broker: image: confluentinc/cp-server:7.4.0 hostname: broker container_name: broker depends_on: zookeeper: condition: service_healthy ports: - "9092:9092" - "9101:9101" environment: KAFKA_BROKER_ID: 1 KAFKA_ZOOKEEPER_CONNECT: 'zookeeper:2181' KAFKA_LISTENER_SECURITY_PROTOCOL_MAP: PLAINTEXT:PLAINTEXT,PLAINTEXT_HOST:PLAINTEXT KAFKA_ADVERTISED_LISTENERS: PLAINTEXT://broker:29092,PLAINTEXT_HOST://localhost:9092…

VIEW QUESTION

How to set up a connection between Spark code and Spark container using Docker?

September 19, 2024
sadegh
2 Answers

I am working with a Docker setup for Hadoop and Spark using the following repository: docker-hadoop-spark. My Docker Compose YAML configuration is working correctly, and I am able to run the containers without issues. Here is the YAML configuration: version:…

VIEW QUESTION

Phpmyadmin – Pyspark stream kafka debezium topic Error format, ETL

August 21, 2024
NV0C
2 Answers

I have successfully created a mariadb database connection using debezium and kafka When I tried to stream the topic using pyspark this is the output that I get ------------------------------------------- Batch: 0 ------------------------------------------- +------+--------------------------------------------------------------------------------------------------------------------------+ |key |value | +------+--------------------------------------------------------------------------------------------------------------------------+ ||MaxDoe1.4.2.Finalnmysqlmariadbbtruebasecampemployees mysql-bin.000032�r�ȯݭd |…

VIEW QUESTION

Amazon web services – Join Two 100k table taking longer than half hours

August 16, 2024
TripleH
2 Answers

I am using pyspark to join two tables with 100k rows for each (so not skewed join). It takes longer than 30mins even an hour which I think something is wrong here. The code is just regular join a =…

VIEW QUESTION

Spark Code Completion in Visual Studio Code

August 13, 2024
Bob
2 Answers

I am trying to use Visual Studio Code for Spark development. My PySpark code all runs fine, but there is no code completion/hints. How can I add these features for VS Code? I have the Python and Pylance extensions installed.…

VIEW QUESTION

Azure – How to create a continuous sequence id irrespective of the runs in Databricks

July 24, 2024
Rocking Surya
2 Answers

I have a Databricks DataFrame with Columns : tno,data_value Output of first Databricks run: tno, data_value 1,hdjsjsjnsns 2,dhjdjdjsnsn 3,jdjsjsjsjsjjs When I run again the same notebook after some time it should generate as: tno, data_value 4,hdjsjsjnsns 5,dhjdjdjsnsn 6,jdjsjsjsjsjjs Just like…

VIEW QUESTION

Page 1
Page 2
Page 3
Page 4
…
Page 10
Next