Postgresql – SQL vs PySpark/Spark SQL
Could someone please help me understand why we need to use PySpark or SprakSQL etc if the source and target of my data is the same DB? For example, lets say I need to load data to table X in…
Could someone please help me understand why we need to use PySpark or SprakSQL etc if the source and target of my data is the same DB? For example, lets say I need to load data to table X in…
I was trying to read from a table in snowflake and manipulate data and trying to write back ! I was able to connect to snow flake , read data as dataframe but cannot write back to the table code…
I'm working on some data (~200GB) using spark in azure databricks. I am able to read the dataset (from blob storage) and modify it in various ways. However, every time I try to store it, either through .saveAsTable() or .csv()…
I'm running a Spark application in AWS EMR. The code is like this: with SparkSession.builder.appName(f"Spark App").getOrCreate() as spark: dataframe = spark.read.format('jdbc').options( ... ).load() print("Log A") max_date_result = dataframe.agg(max_(date_format('date', 'yyyy-MM-dd')).alias('max_date')).collect()[0] print("Log B") This application always gets stuck for a long time…
I have an azure storage account (Storage gen2) and need to copy files like config.yaml, text files, gz files to reference them inside my code. I have tried the steps listed in https://learn.microsoft.com/en-us/azure/synapse-analytics/spark/synapse-file-mount-api, but what this does is to mount…
I have a data frame in PySpark and would like to save the file as a CSV with the current timestamp as a file name. I am executing this in Azure Synapse Notebook and would like to run the notebook…
I m trying to read csv file in databricks using pyspark where it has columns shuffled instead of A ,B, C it will randomly arranged like C,A,B i tried using map() , it throws error 'cannot pickle '_thread.RLock' object' i…
I tried to save a PySpark dataframe in a SQL database in Synapse: test = spark.createDataFrame([Row("Sarah", 28), Row("Anne", 5)], ["Name", "Age"]) test.write .format("jdbc") .option("url", "jdbc:sqlserver://XXXX.sql.azuresynapse.net:1433;database=azlsynddap001;encrypt=true;trustServerCertificate=false;hostNameInCertificate=*.sql.azuresynapse.net;loginTimeout=30;Authentication=ActiveDirectoryIntegrated") .option("forwardSparkAzureStorageCredentials", "true") .option("dbTable", "test_CP") .save() I got the following error: IllegalArgumentException: KrbException: Cannot locate default…
I'm unable to connect to snowflake via a dockerized pyspark container. I do not find the snowflake documentation to be helpful nor the pyspark documentation at this point in time. I'm using the following configuration installed & can be seen…
I have to run the python program in Redhat8. So I pull Redhat docker image and write a Dockerfile which is in the following: FROM redhat/ubi8:latest RUN echo "nameserver 9.9.9.9" >> /etc/resolv.conf && mkdir /home/spark && mkdir /home/spark/spark && mkdir…