pyspark Questions

How to search and delete specific lines from a parquet file in pyspark? (data purge) – Amazon Web Sevices

I'm starting a project to adjust the data lake for the specific purge of data, to comply with data privacy legislation. Basically the owner of the data opens a call requesting the deletion of records for a specific user, and…

VIEW QUESTION

How to split Json string column in Pandas Dataframe with multiple lists to multiple columns?

February 14, 2023
randomguy2443
3 Answers

I have a json string column in a dataframe that looks like this. {"columns":["ApplicationNum","eads59Us01S","HouseDeal_flag","Liability_Asset_Ratio","CBRAvailPcnt","CMSFairIsaacScore","OweTaxes_or_IRAWithdrawalHistry","eads14Fi02S","GuarantorCount","CBRRevMon","CBRInstalMon","CMSApprovedToRequested","SecIncSource","eads59Us01S_4","Liability_Asset_Ratio_40_90","CBRAvailPcnt_20_95","CMSFairIsaacScore_Fund","eads14Fi02S_2","InstalMonthlyPayments_400_3k","RevolvingMonthlyPayments_1k_cap","ApprovedToRequested_0_100","NoSecIncome","coef_eads59Us01S_4","coef_HouseDeal_flag","coef_Liability_Asset_Ratio_40_90","coef_CBRAvailPcnt_20_95","coef_CMSFairIsaacScore_Fund","coef_OweTaxes_or_IRAWithdrawalHistry","coef_eads14Fi02S_2","coef_GuarantorCount","coef_RevolvingMonthlyPayments_1k_cap","coef_InstalMonthlyPayments_400_3k","coef_ApprovedToRequested_0_100","coef_NoSecIncome","coef_Intercept"],"data":[[569325.0,2,0.0,1,92,825,0.0,4,1.0,74,854,0.51,2,2.0,0.9,92.0,825.0,4.0,854.0,1000.0,0.51,0.0,0.11716245,0.299528064,0.392119645,-0.010826643,-0.004957868,0.339407077,0.061509795,0.3685047,0.000167603,0.000225742,0.902205454,-0.371734864,2.788087559]]} I have a columns tag in there with a list of column values, and a data tag in there with the corresponding list of values for…

VIEW QUESTION

DOCKER: Pyspark reading from Postgresql doesn't show data

January 28, 2023
Luis Felipe
2 Answers

I am trying to read data from a table in a postgresql database and proceed with an ETL project. I have an Docker enviroment using this docker-compose: version: "3.3" services: spark-master: image: docker.io/bitnami/spark:3.3 ports: - "9090:8080" - "7077:7077" volumes: -…

VIEW QUESTION

Azure – Pyspark – Expand column with struct of arrays into new columns

January 26, 2023
coding
2 Answers

I have a DataFrame with a single column which is a struct type and contains an array. users_tp_df.printSchema() root |-- x: struct (nullable = true) | |-- ActiveDirectoryName: string (nullable = true) | |-- AvailableFrom: string (nullable = true) |…

VIEW QUESTION

Unable to read bigquery table with JSON/RECORD column type into spark dataframe. ( java.lang.IllegalStateException: Unexpected type: JSON)

January 4, 2023
Nandha
2 Answers

we are trying to read a table from Bigquery to spark dataframe. Strucute of the table is Following pyspark code is used for reading the data. from google.oauth2 import service_account from google.cloud import bigquery import json import base64 as bs…

VIEW QUESTION

Read Json in Pyspark

December 20, 2022
Douglas Oliveira
2 Answers

I want to read a JSON file in PySpark, but the JSON file is in this format (without comma and square brackets): {"id": 1, "name": "jhon"} {"id": 2, "name": "bryan"} {"id": 3, "name": "jane"} Is there an easy way to…

VIEW QUESTION

Azure – to_csv "No Such File or Directory" But the directory does exist – Databricks on ADLS

December 14, 2022
Cole1998
2 Answers

I've seen many iterations of this question but cannot seem to understand/fix this behavior. I am on Azure Databricks working on DBR 10.4 LTS Spark 3.2.1 Scala 2.12 trying to write a single csv file to blob storage so that…

VIEW QUESTION

Pyspark – Flatten nested json

December 14, 2022
Felipe Lopes
2 Answers

I have a json that looks like this: [ { "event_date": "20221207", "user_properties": [ { "key": "user_id", "value": { "set_timestamp_micros": "1670450329209558" } }, { "key": "doc_id", "value": { "set_timestamp_micros": "1670450329209558" } } ] }, { "event_date": "20221208", "user_properties": [ {…

VIEW QUESTION

.jpg file not loading in databricks from blob storage (Azure data lake)

December 12, 2022
hkay
2 Answers

I have the .jpg pictures in the data lake in my blob storage. I am trying to load the pictures and display them for testing purposes but it seems like they can't be loaded properly. I tried a few solutions…

VIEW QUESTION

extract multiple columns from a json string

December 6, 2022
Diwakar Jha
2 Answers

I have a JSON data that I want to represent in a tabular form and later write it to a different format (parquet) Schema root |-- : string (nullable = true) sample data +----------------------------------------------+ +----------------------------------------------+ |{"deviceTypeId":"A2A","deviceId":"123","geo...| |{"deviceTypeId":"A2B","deviceId":"456","geo...| +----------------------------------------------+ Expected Output…

VIEW QUESTION