How many tasks are created when spark read or write from mysql?

ShengxinHuang
September 27, 2022
292 views
0 votes
2 Answers

As far as I know, Spark executors handle many tasks at the same time to guarantee processing data parallelly.Here comes the question. When connecting to external data storage,say mysql,how many tasks are there to finishi this job?In other words,are multiple tasks created at the same time and each task reads all data ,or data is read from only one task and is distributed to the cluster in some other way? How about writing data to mysql,how many connections are there?

Here is some piece of code to read or write data from/to mysql:


    def jdbc(sqlContext: SQLContext, url: String, driver: String, dbtable: String, user: String, password: String, numPartitions: Int): DataFrame = {
    sqlContext.read.format("jdbc").options(Map(
      "url" -> url,
      "driver" -> driver,
      "dbtable" -> s"(SELECT * FROM $dbtable) $dbtable",
      "user" -> user,
      "password" -> password,
      "numPartitions" -> numPartitions.toString
    )).load
  }

  def mysqlToDF(sparkSession:SparkSession, jdbc:JdbcInfo, table:String): DataFrame ={
    var dF1 = sparkSession.sqlContext.read.format("jdbc")
      .option("url", jdbc.jdbcUrl)
      .option("user", jdbc.user)
      .option("password", jdbc.passwd)
      .option("driver", jdbc.jdbcDriver)
      .option("dbtable", table)
      .load()
    //    dF1.show(3)
    dF1.createOrReplaceTempView(s"${table}")
    dF1

  }
}

Tags: apache-spark mysql

Answers

- Young
- September 27, 2022 at 11:40 am
- 0 votes
0
here is a good article which answers your question:
https://freecontent.manning.com/what-happens-behind-the-scenes-with-spark/

In simple words: the workers separate the reading task into several parts and each worker only read a part of your input data. The number of tasks divided depends on your ressources and your data volume. The writing is the same principle: Spark writes the data to a distributed storage system, such as Hdfs and in Hdfs the data is stored in a ditributed way: each worker writes its data to some storage node in Hdfs.

Login or Signup to reply.

- M_S
- October 6, 2022 at 7:35 pm
- 0 votes
0
By default data from jdbc source are loaded by one thread so you will have one task processed by one executor and thats the case you may expect in your second function mysqlToDF

In the first function "jdbc" you are closer to parallel read but still some parameters are needed, numPartitions is not enough, spark need some integer/date column and lower/upper bounds to be able to read in paralell (it will execute x queries for partial results)

Spark jdb documentation

In this docu you will find:

partitionColumn, lowerBound, upperBound (none) These options must
all be specified if any of them is specified. In addition,
numPartitions must be specified. They describe how to partition the
table when reading in parallel from multiple workers. partitionColumn
must be a numeric, date, or timestamp column from the table in
question. Notice that lowerBound and upperBound are just used to
decide the partition stride, not for filtering the rows in table. So
all rows in the table will be partitioned and returned. This option
applies only to reading.

numPartitions (none) The maximum
number of partitions that can be used for parallelism in table reading
and writing. This also determines the maximum number of concurrent
JDBC connections. If the number of partitions to write exceeds this
limit, we decrease it to this limit by calling coalesce(numPartitions)
before writing. read/write

regarding write

How about writing data to mysql,how many connections are there?

As stated in docu it also depends on numPartitions, if number of partitions when writing will be higher than numPartitions Spark will figure it out and call coalesce. Remember that coalesce may generate skew so sometimes it may be better to repartition it explicitly with repartition(numPartitions) to distribute data equally before write

If you don’t set numPartitions number of paralell connections on write may be the same as number of active tasks in given moment so be aware that with to high parallelism and no upper bound you may choke source server

Login or Signup to reply.

Please signup or login to give your own answer.

Click here to cancel reply.