scala - Spark driver disassociated and removed by the master -


i have cluster made 2 slaves , 1 master , set , submit jar (scala) spark master (192.168.1.64):

spark-submit --master spark://spark-master:7077 --class tests.elements target/scala-2.10/zzz-project_2.10-1.0.jar 

after quite sometime running fine stops abruptly last lines on terminal being

... 15/08/19 17:45:24 info scheduler.taskschedulerimpl: adding task set 411292.0 6 tasks 15/08/19 17:45:24 warn scheduler.tasksetmanager: stage 411292 contains task of large size (2762 kb). maximum recommended task size 100 kb. 15/08/19 17:45:24 info scheduler.tasksetmanager: starting task 2.0 in stage 411292.0 (tid 1832, 192.168.1.64, process_local, 2828792 bytes) 15/08/19 17:45:24 info scheduler.tasksetmanager: starting task 0.0 in stage 411292.0 (tid 1833, 192.168.1.62, process_local, 2310009 bytes) 15/08/19 17:45:24 info scheduler.tasksetmanager: starting task 3.0 in stage 411292.0 (tid 1834, 192.168.1.64, process_local, 2669188 bytes) 15/08/19 17:45:24 info scheduler.tasksetmanager: starting task 1.0 in stage 411292.0 (tid 1835, 192.168.1.62, process_local, 2295676 bytes) 15/08/19 17:45:24 info scheduler.tasksetmanager: starting task 4.0 in stage 411292.0 (tid 1836, 192.168.1.64, process_local, 2847786 bytes) 15/08/19 17:45:24 info scheduler.tasksetmanager: starting task 5.0 in stage 411292.0 (tid 1837, 192.168.1.64, process_local, 2913528 bytes) killed 

and error occurring @ master log following:

... 15/08/19 16:09:49 info master.master: launching executor app-20150819160949-0001/0 on worker worker-20150819160925-192.168.1.64-51640 15/08/19 16:09:49 info master.master: launching executor app-20150819160949-0001/1 on worker worker-20150819160938-192.168.1.62-38007 15/08/19 16:15:44 info master.master: akka.tcp://sparkdriver@192.168.1.64:46823 got disassociated, removing it. 15/08/19 16:15:44 info master.master: removing app app-20150819160949-0001 15/08/19 16:15:44 warn remote.reliabledeliverysupervisor: association remote system [akka.tcp://sparkdriver@192.168.1.64:46823] has failed, address gated [5000] ms. reason is: [disassociated]. 15/08/19 16:15:44 warn master.master: application testpagerank still in progress, may terminated abnormally. ... 

both workers have in logs this

... 15/08/19 16:15:49 info worker.worker: executor app-20150819160949-0001/0 finished state exited message command exited code 1 exitstatus 1 15/08/19 16:15:50 warn remote.reliabledeliverysupervisor: association remote system [akka.tcp://sparkexecutor@192.168.1.64:54799] has failed, address gated [5000] ms. reason is: [disassociated]. 

and

... 15/08/19 16:15:43 info worker.worker: executor app-20150819160949-0001/1 finished state exited message command exited code 1 exitstatus 1 15/08/19 16:15:43 warn remote.reliabledeliverysupervisor: association remote system [akka.tcp://sparkexecutor@192.168.1.62:53325] has failed, address gated [5000] ms. reason is: [disassociated]. 

respectively. work/app files contain this

... 15/08/19 16:15:41 info executor.executor: finished task 1.0 in stage 387758.0 (tid 1803). 1911 bytes result sent driver 15/08/19 16:15:41 info executor.executor: finished task 4.0 in stage 387758.0 (tid 1806). 1911 bytes result sent driver 15/08/19 16:15:41 info storage.blockmanager: found block rdd_1206_5 locally 15/08/19 16:15:41 info executor.executor: finished task 5.0 in stage 387758.0 (tid 1807). 1911 bytes result sent driver 15/08/19 16:15:41 info storage.blockmanager: found block rdd_1206_3 locally 15/08/19 16:15:41 info executor.executor: finished task 3.0 in stage 387758.0 (tid 1805). 1911 bytes result sent driver 15/08/19 16:15:44 error executor.coarsegrainedexecutorbackend: driver 192.168.1.64:46823 disassociated! shutting down. 15/08/19 16:15:44 warn remote.reliabledeliverysupervisor: association remote system [akka.tcp://sparkdriver@192.168.1.64:46823] has failed, address gated [5000] ms. reason is: [disassociated]. 15/08/19 16:15:45 info storage.diskblockmanager: shutdown hook called 15/08/19 16:15:46 info util.utils: shutdown hook called 

and

... 15/08/19 16:15:41 info storage.blockmanager: found block rdd_1206_0 locally 15/08/19 16:15:41 info executor.executor: finished task 2.0 in stage 387758.0 (tid 1804). 1911 bytes result sent driver 15/08/19 16:15:41 info executor.executor: finished task 0.0 in stage 387758.0 (tid 1802). 1911 bytes result sent driver 15/08/19 16:15:42 error executor.coarsegrainedexecutorbackend: driver 192.168.1.64:46823 disassociated! shutting down. 15/08/19 16:15:42 info storage.diskblockmanager: shutdown hook called 15/08/19 16:15:42 warn remote.reliabledeliverysupervisor: association remote system [akka.tcp://sparkdriver@192.168.1.64:46823] has failed, address gated [5000] ms. reason is: [disassociated]. 15/08/19 16:15:42 info util.utils: shutdown hook called 

respectively. there seem no other error in hdfs or spark.

i suspecting error lies in master log, third line (15/08/19 16:15:44 info master.master: akka.tcp://sparkdriver@192.168.1.64:46823 got disassociated, removing it.) can't figure out why. tried changing spark.akka.heartbeat.interval 100 suggested in posts no luck. know why happens , how solve this? much.

as mentioned in similar question here warn reliabledeliverysupervisor: association remote system has failed, address gated [5000] ms. reason: [disassociated]

the problem lack of memory. adding more memory (or in case more nodes) should solve problem.

(alternately, needing less memory should work of course).


Comments

Popular posts from this blog

html - Firefox flex bug applied to buttons? -

html - Missing border-right in select on Firefox -

python - build a suggestions list using fuzzywuzzy -