I have a mongo db with multiple collections (let’s say 3). I want to join them all so that I can do some aggregations. I was able to connect to a local db and one collection using documentation found oneline but I want to be able to read multiple collections.
from pyspark.sql import SparkSession
my_spark = SparkSession
.builder
.appName("myApp")
.config("spark.jars.packages", "org.mongodb.spark:mongo-spark-connector_2.12:3.0.1")
.config("spark.mongodb.input.uri", "mongodb://127.0.0.1/mydb.collA")
.config("spark.mongodb.output.uri", "mongodb://127.0.0.1/mydb.output")
.getOrCreate()
df = my_spark.read.format("mongo").load()
df.printSchema()
The above code lets me read collA
into a df. but I want to also be able to read collB
, collC
and so on.
2
Answers
You can provide the source collection as part of the
spark.read
:If the collections are schema compatible, you can union them in the following way:
You can pass a list of URIs into the spark.mongodb.input.uri config.
The first one is the default, but swithing the collection option before the load can lead you into the other collections as well. Source