Question posted in Json
Our archive of expertly curated questions and answers provides insights and solutions to common problems related to this popular data interchange format. From parsing and manipulating JSON data to integrating it with various programming languages and web services, our archive has got you covered. Start exploring today and take your JSON skills to the next level

Can I read a Json from a Spark Dataframe Column w/o RDD/collect()?

noobCoder50
May 18, 2023
141 views
0 votes
2 Answers

I have a dataframe that looks like this,

+-------------------------------------------+
|                                     output|
+-------------------------------------------+
|{"COLUMN1": "123", "COUMN2": {"A":1 "B":2}}|
+-------------------------------------------+

And i just want to read the json as a string or dictionary in a variable so that i could do further manipulations on it.

Problems are –

Apparently when you use unity catalogue on databricks you are not allowed to use rdd’s or methods like .iterrows,.collect,etc. (ref – https://community.databricks.com/s/question/0D58Y00009yKdeHSAS/cannot-use-rdd-and-cannot-set-sparkdatabrickspysparkenablepy4jsecurity-false-for-cluster)
And using something like .asDict or .first() is converting into Rows datatype and am not able to convert it back into json.
EG.

Row(output=Row(COLUMN1='123', ...

How was the df created ?

nextdf = df.select(struct(col("COLUMN1"),col("COLUMN2"),col("COLUMN3")).alias("output"))

OUTPUT SHOULD BE –
{"COLUMN1": "123", "COUMN2": {"A":1 "B":2}}

Please let me know what can i try ?

Answers

You can use toJSON() for this case.

Example:

from pyspark.sql.functions import *
df = spark.createDataFrame([(1, "foo"),(2, "bar"),],["id", "label"])

df1= df.withColumn("temp", concat_ws(" ", *df.columns)).groupBy(lit(1)).agg(array_join(collect_list(col("temp"))," ").alias("new_column")).drop("1")

print(df1.select(struct(col("new_column")).alias("new")).toJSON().collect()[0])
#{"new":{"new_column":"1 foo 2 bar"}}

To get things without collecting then use

Save as Text:

Use save as text file with header flag false to escape the column name from the output file.

df.coalesce(1).write.format("text").option("header", "false").save("output.txt")

- Kashyap
- May 18, 2023 at 4:33 pm
- 0 votes
0
RDD is the old interface, DataFrame is the new/replacement interface. Use DataFrame methods.
```
my_var = nextdf.collect()[0].asDict()['output']
print(my_var)
# should print `{"COLUMN1": "123", "COUMN2": {"A":1 "B":2}}`
```
Login or Signup to reply.

Please signup or login to give your own answer.

Click here to cancel reply.