scala - Spark driver disassociated and removed by the master -
i have cluster made 2 slaves , 1 master , set , submit jar (scala) spark master (192.168.1.64):
spark-submit --master spark://spark-master:7077 --class tests.elements target/scala-2.10/zzz-project_2.10-1.0.jar
after quite sometime running fine stops abruptly last lines on terminal being
... 15/08/19 17:45:24 info scheduler.taskschedulerimpl: adding task set 411292.0 6 tasks 15/08/19 17:45:24 warn scheduler.tasksetmanager: stage 411292 contains task of large size (2762 kb). maximum recommended task size 100 kb. 15/08/19 17:45:24 info scheduler.tasksetmanager: starting task 2.0 in stage 411292.0 (tid 1832, 192.168.1.64, process_local, 2828792 bytes) 15/08/19 17:45:24 info scheduler.tasksetmanager: starting task 0.0 in stage 411292.0 (tid 1833, 192.168.1.62, process_local, 2310009 bytes) 15/08/19 17:45:24 info scheduler.tasksetmanager: starting task 3.0 in stage 411292.0 (tid 1834, 192.168.1.64, process_local, 2669188 bytes) 15/08/19 17:45:24 info scheduler.tasksetmanager: starting task 1.0 in stage 411292.0 (tid 1835, 192.168.1.62, process_local, 2295676 bytes) 15/08/19 17:45:24 info scheduler.tasksetmanager: starting task 4.0 in stage 411292.0 (tid 1836, 192.168.1.64, process_local, 2847786 bytes) 15/08/19 17:45:24 info scheduler.tasksetmanager: starting task 5.0 in stage 411292.0 (tid 1837, 192.168.1.64, process_local, 2913528 bytes) killed
and error occurring @ master log following:
... 15/08/19 16:09:49 info master.master: launching executor app-20150819160949-0001/0 on worker worker-20150819160925-192.168.1.64-51640 15/08/19 16:09:49 info master.master: launching executor app-20150819160949-0001/1 on worker worker-20150819160938-192.168.1.62-38007 15/08/19 16:15:44 info master.master: akka.tcp://sparkdriver@192.168.1.64:46823 got disassociated, removing it. 15/08/19 16:15:44 info master.master: removing app app-20150819160949-0001 15/08/19 16:15:44 warn remote.reliabledeliverysupervisor: association remote system [akka.tcp://sparkdriver@192.168.1.64:46823] has failed, address gated [5000] ms. reason is: [disassociated]. 15/08/19 16:15:44 warn master.master: application testpagerank still in progress, may terminated abnormally. ...
both workers have in logs this
... 15/08/19 16:15:49 info worker.worker: executor app-20150819160949-0001/0 finished state exited message command exited code 1 exitstatus 1 15/08/19 16:15:50 warn remote.reliabledeliverysupervisor: association remote system [akka.tcp://sparkexecutor@192.168.1.64:54799] has failed, address gated [5000] ms. reason is: [disassociated].
and
... 15/08/19 16:15:43 info worker.worker: executor app-20150819160949-0001/1 finished state exited message command exited code 1 exitstatus 1 15/08/19 16:15:43 warn remote.reliabledeliverysupervisor: association remote system [akka.tcp://sparkexecutor@192.168.1.62:53325] has failed, address gated [5000] ms. reason is: [disassociated].
respectively. work/app files contain this
... 15/08/19 16:15:41 info executor.executor: finished task 1.0 in stage 387758.0 (tid 1803). 1911 bytes result sent driver 15/08/19 16:15:41 info executor.executor: finished task 4.0 in stage 387758.0 (tid 1806). 1911 bytes result sent driver 15/08/19 16:15:41 info storage.blockmanager: found block rdd_1206_5 locally 15/08/19 16:15:41 info executor.executor: finished task 5.0 in stage 387758.0 (tid 1807). 1911 bytes result sent driver 15/08/19 16:15:41 info storage.blockmanager: found block rdd_1206_3 locally 15/08/19 16:15:41 info executor.executor: finished task 3.0 in stage 387758.0 (tid 1805). 1911 bytes result sent driver 15/08/19 16:15:44 error executor.coarsegrainedexecutorbackend: driver 192.168.1.64:46823 disassociated! shutting down. 15/08/19 16:15:44 warn remote.reliabledeliverysupervisor: association remote system [akka.tcp://sparkdriver@192.168.1.64:46823] has failed, address gated [5000] ms. reason is: [disassociated]. 15/08/19 16:15:45 info storage.diskblockmanager: shutdown hook called 15/08/19 16:15:46 info util.utils: shutdown hook called
and
... 15/08/19 16:15:41 info storage.blockmanager: found block rdd_1206_0 locally 15/08/19 16:15:41 info executor.executor: finished task 2.0 in stage 387758.0 (tid 1804). 1911 bytes result sent driver 15/08/19 16:15:41 info executor.executor: finished task 0.0 in stage 387758.0 (tid 1802). 1911 bytes result sent driver 15/08/19 16:15:42 error executor.coarsegrainedexecutorbackend: driver 192.168.1.64:46823 disassociated! shutting down. 15/08/19 16:15:42 info storage.diskblockmanager: shutdown hook called 15/08/19 16:15:42 warn remote.reliabledeliverysupervisor: association remote system [akka.tcp://sparkdriver@192.168.1.64:46823] has failed, address gated [5000] ms. reason is: [disassociated]. 15/08/19 16:15:42 info util.utils: shutdown hook called
respectively. there seem no other error in hdfs or spark.
i suspecting error lies in master log, third line (15/08/19 16:15:44 info master.master: akka.tcp://sparkdriver@192.168.1.64:46823 got disassociated, removing it.
) can't figure out why. tried changing spark.akka.heartbeat.interval
100 suggested in posts no luck. know why happens , how solve this? much.
as mentioned in similar question here warn reliabledeliverysupervisor: association remote system has failed, address gated [5000] ms. reason: [disassociated]
the problem lack of memory. adding more memory (or in case more nodes) should solve problem.
(alternately, needing less memory should work of course).
Comments
Post a Comment