listening for reservations at wrong address
See original GitHub issueHello,
I am using tenserflowonspark inside containers on multi-host. I have a problem when starting a spark job. I keep getting waiting for 1 reservations
problem and Connection timeout
errors.
the problem is that the reservation listening address is wrong : I have this inside my stdout log
2018-03-22 16:26:56,485 INFO (MainThread-1769) worker node range range(1, 2), ps node range range(0, 1)
2018-03-22 16:26:56,501 INFO (MainThread-1769) listening for reservations at ('172.18.0.2', 38871)
2018-03-22 16:26:56,502 INFO (MainThread-1769) Starting TensorFlow on executors
2018-03-22 16:26:56,648 INFO (MainThread-1769) Waiting for TFSparkNodes to start
2018-03-22 16:26:56,648 INFO (MainThread-1769) waiting for 2 reservations
2018-03-22 16:26:57,650 INFO (MainThread-1769) waiting for 1 reservations
2018-03-22 16:26:58,651 INFO (MainThread-1769) waiting for 1 reservations
this adress is wrong normally the communication between cluster (hadoop-master and slaves) should use 10.0.0.0 now I will explain the problem and the architecture.
I am using the same image that I have create in my 2 hosts: I have 2 container (slave2 and master) in the same host I have 1 container (slave1) in different host
when using docker in multi-hosts I should use overlay network this produce 2 interfaces inside my containers:
- the local container interface with the network 10.0.0.0
- and another interface is created to communicate with the host machine using 172.18.0.0
hadoop-master: has address 10.0.0.2 and 172.18.0.2 hadoop-slave2: has address 10.0.0.3 and 172.18.0.3 hadoop-slave1: has address 10.0.0.4 and 172.18.0.2 (because not the same host as the master)
the hdoop cluster communicate with each other using host-names created by docker and this uses 10.0.0.0 interface.
So the reservation listening address 172.18.0.2 should instead be 10.0.0.xxx address
I think there is a problem with my spark or yarn configuration but I am not sure. So i was asking what are you using in order to get the interface where to listen for reservation.
Note1: when creating the cluster inside the same host everything is working fine (this is because they can all see the 172.18.0.0 network interface).
Note 2: I think that the node manager is listening on all interfaces 0.0.0.0 but not sure
this is my yarn-site.xml
<configuration>
<property>
<name>yarn.nodemanager.pmem-check-enabled</name>
<value>false</value>
</property>
<property>
<name>yarn.nodemanager.vmem-check-enabled</name>
<value>false</value>
</property>
<property>
<name>yarn.nodemanager.aux-services</name>
<value>mapreduce_shuffle</value>
</property>
<property>
<name>
yarn.nodemanager.aux-services.mapreduce_shuffle.class
</name>
<value>org.apache.hadoop.mapred.ShuffleHandler</value>
</property>
<property>
<name>yarn.resourcemanager.hostname</name>
<value>hadoop-master</value>
</property>
</configuration>
Thank you
Issue Analytics
- State:
- Created 5 years ago
- Comments:9 (4 by maintainers)
Top GitHub Comments
@MohamedAmineOuali can you try:
This was the original code before PR #109
Hello,
We are having some troubles with reservation, this is our setup: In our cluster each worker have 3 network interfaces (3 diferent ip’s):
But just with one (192.168.201.X) they can communicate with each other.
When we execute tensorflowonspark version 1.3.1 reservation process uses the ip in range 10.0.2.x this produces an error.
To solve the problem we have modified get_ip_address function inside util.py to
def get_ip_address():
"""Simple utility to get host IP address."""
return socket.gethostname()
moving 1.3.1 code to old version (@leewyang https://github.com/yahoo/TensorFlowOnSpark/issues/251#issuecomment-376337722)
with this little modification all works fine.
But this modification will not work if the nodes can’t resolve the name of the workers (lack of DNS or no /etc/hosts configuration) So. Is there a way to enhance get_ip_address to find the correct ip when the workers have more than one IP?
Thanks in advance.