question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

TFSparkNode throws AttributeError on shutdown

See original GitHub issue

Environment:

  • Python version 3.6
  • Spark version 2.3.1
  • TensorFlow version 1.7.0
  • TensorFlowOnSpark version 1.4.2
  • Cluster version Standalone

Describe the bug: TFSparkNode throws AttributeError on shutdown

Logs:

2019-01-24 12:02:02,297 INFO (MainThread-13745) Feeding None into input queue
[2019-01-24 12:02:02.344] [ERROR] [Executor task launch worker for task 2] [org.apache.spark.executor.Executor] >>> [spark-] msg=Exception in task 0.0 in stage 1.0 (TID 2)
org.apache.spark.api.python.PythonException: Traceback (most recent call last):
  File "/home/mlp/.local/lib/python3.6/site-packages/tensorflowonspark/TFSparkNode.py", line 539, in _shutdown
AttributeError: 'AutoProxy[get_queue]' object has no attribute 'put'

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/home/vipshop/platform/spark/python/lib/pyspark.zip/pyspark/worker.py", line 234, in main
    process()
  File "/home/vipshop/platform/spark/python/lib/pyspark.zip/pyspark/worker.py", line 229, in process
    serializer.dump_stream(func(split_index, iterator), outfile)
  File "/home/vipshop/platform/spark/python/lib/pyspark.zip/pyspark/rdd.py", line 2457, in pipeline_func
  File "/home/vipshop/platform/spark/python/lib/pyspark.zip/pyspark/rdd.py", line 2457, in pipeline_func
  File "/home/vipshop/platform/spark/python/lib/pyspark.zip/pyspark/rdd.py", line 2457, in pipeline_func
  File "/home/vipshop/platform/spark/python/lib/pyspark.zip/pyspark/rdd.py", line 370, in func
  File "/home/vipshop/platform/spark/python/lib/pyspark.zip/pyspark/rdd.py", line 819, in func
  File "/home/mlp/.local/lib/python3.6/site-packages/tensorflowonspark/TFSparkNode.py", line 542, in _shutdown
Exception: Queue 'input' not found on this node, check for exceptions on other nodes.

	at org.apache.spark.api.python.BasePythonRunner$ReaderIterator.handlePythonException(PythonRunner.scala:298)
	at org.apache.spark.api.python.PythonRunner$$anon$1.read(PythonRunner.scala:438)
	at org.apache.spark.api.python.PythonRunner$$anon$1.read(PythonRunner.scala:421)
	at org.apache.spark.api.python.BasePythonRunner$ReaderIterator.hasNext(PythonRunner.scala:252)
	at org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:37)
	at scala.collection.Iterator$class.foreach(Iterator.scala:893)
	at org.apache.spark.InterruptibleIterator.foreach(InterruptibleIterator.scala:28)
	at scala.collection.generic.Growable$class.$plus$plus$eq(Growable.scala:59)
	at scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:104)
	at scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:48)
	at scala.collection.TraversableOnce$class.to(TraversableOnce.scala:310)
	at org.apache.spark.InterruptibleIterator.to(InterruptibleIterator.scala:28)
	at scala.collection.TraversableOnce$class.toBuffer(TraversableOnce.scala:302)
	at org.apache.spark.InterruptibleIterator.toBuffer(InterruptibleIterator.scala:28)
	at scala.collection.TraversableOnce$class.toArray(TraversableOnce.scala:289)
	at org.apache.spark.InterruptibleIterator.toArray(InterruptibleIterator.scala:28)
	at org.apache.spark.rdd.RDD$$anonfun$collect$1$$anonfun$12.apply(RDD.scala:939)
	at org.apache.spark.rdd.RDD$$anonfun$collect$1$$anonfun$12.apply(RDD.scala:939)
	at org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:2074)
	at org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:2074)
	at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87)
	at org.apache.spark.scheduler.Task.run(Task.scala:109)
	at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:345)
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
	at java.lang.Thread.run(Thread.java:748)

Spark Submit Command Line: spark-submit --py-files mnist_estimator.py mnist_estimator.py

Issue Analytics

  • State:closed
  • Created 5 years ago
  • Comments:8 (4 by maintainers)

github_iconTop GitHub Comments

1reaction
leewyangcommented, Jan 30, 2019

@manuzhang Basically, TFoS requires that you only schedule one task per executor, since we rely on the “mental model” of each Spark executor running only one TF node of the cluster. This makes it much simpler to understand, config, and debug. (Imagine if you had 20 TF node “tasks” running on one executor, where all those TF processes were writing to the executor’s single stderr log file). That said, you can still effectively achieve the “one TF node per executor” with higher cores by setting spark.task.cpus equal to spark.executor.cores. So for simplicity, we just recommend 1 for both, but you can just as easily set them both to 8 or 16 or 40… as long as only one task runs on each executor in the cluster.

@lasclocker Again, the Spark settings are mostly used for scheduling tasks onto executors, and I don’t believe that they’re enforced strictly. That said, MirroredStrategy is mostly a simplification of the GPU tower architecture, so it’s more useful for GPUs than CPU cores, and CollectiveAllReduceStrategy is more about distributed compute w/o a PS, so again, it’s less about CPU cores than network I/O.

Either way, if you really need to set cores > 1, just set spark.task.cpus equal to spark.executor.cores in your job, and it should work fine…

0reactions
lasclockercommented, Jan 30, 2019

@leewyang , When TFoS uses Distribution Strategy, such as MirroredStrategy and CollectiveAllReduceStrategy, it requires greater than one core per executor, is that right ?

Read more comments on GitHub >

github_iconTop Results From Across the Web

AttributeError: 'NoneType' object has no attribute 'shutdown ...
I run cuckoo,but it stopped,display like this: ''' 2021-12-09 18:50:38,135 [cuckoo.core.scheduler] INFO: Using "virtualbox" as machine ...
Read more >
AttributeError when shutting down tensorflow kernel in ...
I am a long time Mac user, but switched to Windows recently. I rebuilt the conda environment to run tensorflow in Windows. When...
Read more >
Issues-yahoo/TensorFlowOnSpark - PythonTechWorld - Python 博客
TFSparkNode throws AttributeError on shutdown. 888. Environment: Python version 3.6 Spark version 2.3.1 TensorFlow version 1.7.0 TensorFlowOnSpark version ...
Read more >
Error: 'DataLoaderIter' object has no attribute 'shutdown'
AttributeError : 'DataLoaderIter' object has no attribute 'shutdown'. smth February 24, 2017, 4:15pm #2. Without any context, we cannot reply ...
Read more >
AttributeError: 'NoneType' object has no attribute 'add_timer'
... in _shutdown_clusters 6 cluster.shutdown() 7 File "/opt/graphite/pypy/site-packages/cassandra/cluster.py", line 1319, in shutdown 8 ...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found