I have a list of .json files
that contain person information. One file contains info of one person. I want to load this data into table using pyspark
in an Azure Databricks notebook.
Let’s say the files are built like this:
{
"id": 1,
"name": "Homer",
"address": {
"street": "742 Evergreen Terrace"
"city": "Springfield"
}
}
Fairly simple json here, which i can read into a datafrom with this code:
from pyspark.sql.functions import *
sourcejson = spark.read.json("path/to/json")
df = (
sourcejson.select(
col('id'),
col('name'),
col('address.street').alias('street'),
col('address.city').alias('city')
)
)
which gives the expected result:
id | name | street | city
1 | Homer | 742 Evergreen Terrace | Springfield
However. The problem start when the address is unknown. In that case, the whole address struct in the json will just be null
:
{
"id": 2,
"name": "Ned",
"address": null
}
In the example file above, we don’t know Ned’s address so we have a null. Using the code from before, I would expect a result like this:
id | name | street | city
2 | Ned | null | null
however, running the code results in an error:
[INVALID_EXTRACT_BASE_FIELD_TYPE] Can't extract a value from "address". Need a complex type [STRUCT, ARRAY, MAP] but got "STRING"
I understand the reason behind the error but I can’t find any solution on it. Any idea’s how we could handle this?
2
Answers
You’re creating (an avoidable) problem by reading one file at a time. Read all files at once
spark.read.json('folder/with/all/json/files')
instead of:spark.read.json('folder/with/all/json/files/file1')
and thenspark.read.json('folder/with/all/json/files/file2')
There is a little gotcha here. In OP you’re reading one file at a time. Practically you’ll be reading all files at once.
address
asSTRING
fornull
values. Unless you specify the schema, while reading the file.address
asStructType([StructField('city', StringType(), True), StructField('street', StringType(), True)]), False)]
fornull
values. And your original code will work as is.Use coalesce() if you do actually want to use some specific default value for null values. E.g. code below translates
address=null
in json file to{city='', street=null}
in the dataframe, instead of{city=null, street=null}
that spark does by default when you read all files at once.When you don’t provide a schema for
spark.read.json
, it will be inferred from the data. So when theaddress
is missing in all objects, Spark assumes it is aStringType
and that’s why you are getting the error. One possible solution is to read the data with a schema: