Postgresql - SQL vs PySpark/Spark SQL

GouthamNandan
August 17, 2022
136 views
2 votes
3 Answers

Could someone please help me understand why we need to use PySpark or SprakSQL etc if the source and target of my data is the same DB?

For example, lets say I need to load data to table X in PostgresDB from tables X and Y. Would it not be simpler and faster to just do it in Postgres instead of using SprakSQL or PySpark etc?

I understand the need for these solutions if data is from multiple sources, but if it is from same source, do I need to use PySpark?

Answers

Chosen as BEST ANSWER
- GouthamNandan
- August 18, 2022 at 7:05 pm
- 0 votes
0
Thank you all for the feedback. I think I will use glue pyspark if source and destination are different. Else i will use glue python with jdbc connection and have one session do the tasks without bringing data to dataframes.

(Edit)

- shubhamkakran
- August 17, 2022 at 8:08 am
- 0 votes
0
You can use spark when you want to do heavy data transformations, it makes it easier to load and process due to distributed processing.

It totally depends on how large is the data and how you want to transform it.

Using Postgres will be a good idea if data is relatively small and no transformation is required.

Login or Signup to reply.

- HariPalappetty
- August 17, 2022 at 8:41 am
- 0 votes
0
It is not necessary to use PySpark. Both PySpark & SparkSQL have their value in managing/manipulating large volumes of data few hundred of GBs, TBs, or PBs in a distributed computing setup. If this is your case, please use PySpark, it will be more efficient to load, manipulate, process/shape the data before inserting it into another table.

Login or Signup to reply.

Please signup or login to give your own answer.

Click here to cancel reply.

Postgresql – SQL vs PySpark/Spark SQL

Answers