With Cassandra, if I need some specific data, all I need to do is pass the keys on join columns (with DirectJoin enabled). This works very well because is fast. But if I do a join with JDBC, first spark load all data, then get the necessary data. This is slow when I need a small fraction of the data.
For exemple: on the right side I have a dataframe (that I get the data from postgres) with 100kk of data, and other side there are a dataframe with 10kk (from other postgres db).
In this scenario that I need 1/10 of the data from the second dataframe, how can I get specific data that match join columns without load all 10kk of data? (note: I don’t pass pass a filter because the data is variable and depends on the first dataframe data)
2
Answers
No, JDBC source in Spark doesn’t support such thing. From the existing Spark sources that I know only Cassandra connector supports this kind of pattern.
In some regard this pattern is supported by Delta Lake in combination with bloom filters (right now only on Databricks, with OSS implementation is in plans) and/or data skipping. So you can try another approach – implement change data capture on the PostgreSQL side, and stream the changes into Delta Lake table, and then access data in it – your database will be very happy not having an additional load on the database servers 🙂
directJoin()
is only implemented in the Spark Cassandra connector (https://github.com/datastax/spark-cassandra-connector).You will need to implement something similar to it using the approach that Alex Ott suggested.
P.S. Alex is one of the authors of the Spark Cassandra connector. Cheers!