SparkContext: Error initializing SparkContext while Running Spark Job via google DataProc
After I upgraded the google dataproc version from 1.3.62-debian9 to 1.4-debian, all spark data proc jobs started falling with an error:
22/01/09 00:36:50 INFO org.spark_project.jetty.server.Server: Started 3339ms
22/01/09 00:36:50 INFO org.spark_project.jetty.server.AbstractConnector: Started
22/01/09 00:36:50 WARN org.apache.spark.scheduler.FairSchedulableBuilder: Fair
Scheduler configuration file not found so jobs will be scheduled in FIFO order. To use
fair scheduling, configure pools in fairscheduler.xml or set
spark.scheduler.allocation.file to a file that contains the configuration.
ERROR org.apache.spark.SparkContext: Error initializing SparkContext.
java.lang.NumberFormatException: For input string: "30s"
at java.lang.NumberFormatException.forInputString(NumberFormatException.java:65)
at java.lang.Long.parseLong(Long.java:589)
at java.lang.Long.parseLong(Long.java:631)
at org.apache.hadoop.conf.Configuration.getLong(Configuration.java:1441)
I don’t set ’30s’ in spark configration file or in sparkConf object.
initialize SparkContext in my code:
val conf = new SparkConf().setAppName(getMainName().toString)
val sc = new SparkContext(conf)
spark version – 2.3.0
I saw that the default setting of ׳spark.scheduler.maxRegisteredResourcesWaitingTime׳ has the same value
(https://spark.apache.org/docs/latest/configuration.html#spark-configuration) But I did not change or update it.
I do not understand where this value comes from and why it is related to updating the dataproc.
2
Answers
it's related to 'apache hadoop' - they adding a time unit to hdfs-default.xml.
more info about the issue - https://issues.apache.org/jira/browse/HDFS-12920 Apache Tez Job fails due to java.lang.NumberFormatException for input string: "30s"
1.4 image is approaching EOL too, may you try 1.5 instead?
That said, the problem probably is in your app that probably brings some old Hadoop/Spark jars and/or configs that break Spark, because when you SSH in 1.4 cluster main node and execute your code in
spark-shell
it works: