skip to Main Content

I have a dataframe that looks like this,

+-------------------------------------------+
|                                     output|
+-------------------------------------------+
|{"COLUMN1": "123", "COUMN2": {"A":1 "B":2}}|
+-------------------------------------------+

And i just want to read the json as a string or dictionary in a variable so that i could do further manipulations on it.

Problems are –

Row(output=Row(COLUMN1='123', ...

How was the df created ?

nextdf = df.select(struct(col("COLUMN1"),col("COLUMN2"),col("COLUMN3")).alias("output"))

OUTPUT SHOULD BE –
{"COLUMN1": "123", "COUMN2": {"A":1 "B":2}}

Please let me know what can i try ?

2

Answers


  1. You can use toJSON() for this case.

    Example:

    from pyspark.sql.functions import *
    df = spark.createDataFrame([(1, "foo"),(2, "bar"),],["id", "label"])
    
    df1= df.withColumn("temp", concat_ws(" ", *df.columns)).groupBy(lit(1)).agg(array_join(collect_list(col("temp"))," ").alias("new_column")).drop("1")
    
    print(df1.select(struct(col("new_column")).alias("new")).toJSON().collect()[0])
    #{"new":{"new_column":"1 foo 2 bar"}}
    

    To get things without collecting then use

    Save as Text:

    Use save as text file with header flag false to escape the column name from the output file.

    df.coalesce(1).write.format("text").option("header", "false").save("output.txt")
    
    Login or Signup to reply.
  2. RDD is the old interface, DataFrame is the new/replacement interface. Use DataFrame methods.

    my_var = nextdf.collect()[0].asDict()['output']
    print(my_var)
    # should print `{"COLUMN1": "123", "COUMN2": {"A":1 "B":2}}`
    
    Login or Signup to reply.
Please signup or login to give your own answer.
Back To Top
Search