pyspark Questions

Postgresql – SQL vs PySpark/Spark SQL

August 17, 2022
Goutham Nandan
3 Answers

Could someone please help me understand why we need to use PySpark or SprakSQL etc if the source and target of my data is the same DB? For example, lets say I need to load data to table X in…

VIEW QUESTION

Error when AWS Glue while writing to snowflake table – Amazon web services

August 10, 2022
Raju
3 Answers

I was trying to read from a table in snowflake and manipulate data and trying to write back ! I was able to connect to snow flake , read data as dataframe but cannot write back to the table code…

VIEW QUESTION

Azure – Index out of bounds for various spark operations

August 8, 2022
Skrettinga
2 Answers

I'm working on some data (~200GB) using spark in azure databricks. I am able to read the dataset (from blob storage) and modify it in various ways. However, every time I try to store it, either through .saveAsTable() or .csv()…

VIEW QUESTION

Amazon web services – Why does AWS EMR PySpark get stuck when I try to aggregate dataframe

August 8, 2022
Rinze
2 Answers

I'm running a Spark application in AWS EMR. The code is like this: with SparkSession.builder.appName(f"Spark App").getOrCreate() as spark: dataframe = spark.read.format('jdbc').options( ... ).load() print("Log A") max_date_result = dataframe.agg(max_(date_format('date', 'yyyy-MM-dd')).alias('max_date')).collect()[0] print("Log B") This application always gets stuck for a long time…

VIEW QUESTION

How to mount file as a file object using PySpark in Azure Synapse

August 1, 2022
Vithal Rd
3 Answers

I have an azure storage account (Storage gen2) and need to copy files like config.yaml, text files, gz files to reference them inside my code. I have tried the steps listed in https://learn.microsoft.com/en-us/azure/synapse-analytics/spark/synapse-file-mount-api, but what this does is to mount…

VIEW QUESTION

Azure – How to save the file name with current date and time in PySpark?

June 29, 2022
sthambi
2 Answers

I have a data frame in PySpark and would like to save the file as a CSV with the current timestamp as a file name. I am executing this in Azure Synapse Notebook and would like to run the notebook…

VIEW QUESTION

Azure – read csv file which has columns shuffled

June 28, 2022
Vaishnavi S
2 Answers

I m trying to read csv file in databricks using pyspark where it has columns shuffled instead of A ,B, C it will randomly arranged like C,A,B i tried using map() , it throws error 'cannot pickle '_thread.RLock' object' i…

VIEW QUESTION

Azure – Save a PySpark dataframe in a SQL database in Synapse gives the error "IllegalArgumentException: KrbException: Cannot locate default realm"

June 21, 2022
Quynh-Mai Chu
2 Answers

I tried to save a PySpark dataframe in a SQL database in Synapse: test = spark.createDataFrame([Row("Sarah", 28), Row("Anne", 5)], ["Name", "Age"]) test.write .format("jdbc") .option("url", "jdbc:sqlserver://XXXX.sql.azuresynapse.net:1433;database=azlsynddap001;encrypt=true;trustServerCertificate=false;hostNameInCertificate=*.sql.azuresynapse.net;loginTimeout=30;Authentication=ActiveDirectoryIntegrated") .option("forwardSparkAzureStorageCredentials", "true") .option("dbTable", "test_CP") .save() I got the following error: IllegalArgumentException: KrbException: Cannot locate default…

VIEW QUESTION

Snowflake Pyspark: Failed to find data source: snowflake – Docker

March 1, 2022
Nick
2 Answers

I'm unable to connect to snowflake via a dockerized pyspark container. I do not find the snowflake documentation to be helpful nor the pyspark documentation at this point in time. I'm using the following configuration installed & can be seen…

VIEW QUESTION

After building dockerfile: ModuleNotFoundError: No module named 'numpy'

February 23, 2022
M_Gh
3 Answers

I have to run the python program in Redhat8. So I pull Redhat docker image and write a Dockerfile which is in the following: FROM redhat/ubi8:latest RUN echo "nameserver 9.9.9.9" >> /etc/resolv.conf && mkdir /home/spark && mkdir /home/spark/spark && mkdir…

VIEW QUESTION