Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

listening for reservations at wrong address

See original GitHub issue

Hello,

I am using tenserflowonspark inside containers on multi-host. I have a problem when starting a spark job. I keep getting waiting for 1 reservations problem and Connection timeout errors.

the problem is that the reservation listening address is wrong : I have this inside my stdout log

2018-03-22 16:26:56,485 INFO (MainThread-1769) worker node range range(1, 2), ps node range range(0, 1)
2018-03-22 16:26:56,501 INFO (MainThread-1769) listening for reservations at ('172.18.0.2', 38871)
2018-03-22 16:26:56,502 INFO (MainThread-1769) Starting TensorFlow on executors
2018-03-22 16:26:56,648 INFO (MainThread-1769) Waiting for TFSparkNodes to start
2018-03-22 16:26:56,648 INFO (MainThread-1769) waiting for 2 reservations
2018-03-22 16:26:57,650 INFO (MainThread-1769) waiting for 1 reservations
2018-03-22 16:26:58,651 INFO (MainThread-1769) waiting for 1 reservations

this adress is wrong normally the communication between cluster (hadoop-master and slaves) should use 10.0.0.0 now I will explain the problem and the architecture.

I am using the same image that I have create in my 2 hosts: I have 2 container (slave2 and master) in the same host I have 1 container (slave1) in different host

when using docker in multi-hosts I should use overlay network this produce 2 interfaces inside my containers:

the local container interface with the network 10.0.0.0
and another interface is created to communicate with the host machine using 172.18.0.0

hadoop-master: has address 10.0.0.2 and 172.18.0.2 hadoop-slave2: has address 10.0.0.3 and 172.18.0.3 hadoop-slave1: has address 10.0.0.4 and 172.18.0.2 (because not the same host as the master)

the hdoop cluster communicate with each other using host-names created by docker and this uses 10.0.0.0 interface.

So the reservation listening address 172.18.0.2 should instead be 10.0.0.xxx address

I think there is a problem with my spark or yarn configuration but I am not sure. So i was asking what are you using in order to get the interface where to listen for reservation.

Note1: when creating the cluster inside the same host everything is working fine (this is because they can all see the 172.18.0.0 network interface).

Note 2: I think that the node manager is listening on all interfaces 0.0.0.0 but not sure

this is my yarn-site.xml

<configuration>
<property>
<name>yarn.nodemanager.pmem-check-enabled</name>
<value>false</value>
</property>
<property>
<name>yarn.nodemanager.vmem-check-enabled</name>
<value>false</value>
</property>
<property>
<name>yarn.nodemanager.aux-services</name>
<value>mapreduce_shuffle</value>
</property>
<property>
<name>
yarn.nodemanager.aux-services.mapreduce_shuffle.class
</name>
<value>org.apache.hadoop.mapred.ShuffleHandler</value>
</property>
<property>
<name>yarn.resourcemanager.hostname</name>
<value>hadoop-master</value>
</property>
</configuration>

Thank you

Issue Analytics

State:
Created 5 years ago
Comments:9 (4 by maintainers)

Top GitHub Comments

1reaction

leewyangcommented, Mar 26, 2018

@MohamedAmineOuali can you try:

def getIP(iter):
  import socket
  return socket.gethostname()

This was the original code before PR #109

0reactions

moyanojvcommented, Aug 9, 2018

Hello,

We are having some troubles with reservation, this is our setup: In our cluster each worker have 3 network interfaces (3 diferent ip’s):

10.0.2.X
192.168.1.X
192.168.201.X

But just with one (192.168.201.X) they can communicate with each other.

When we execute tensorflowonspark version 1.3.1 reservation process uses the ip in range 10.0.2.x this produces an error.

To solve the problem we have modified get_ip_address function inside util.py to

def get_ip_address(): """Simple utility to get host IP address.""" return socket.gethostname()

moving 1.3.1 code to old version (@leewyang https://github.com/yahoo/TensorFlowOnSpark/issues/251#issuecomment-376337722)

with this little modification all works fine.

But this modification will not work if the nodes can’t resolve the name of the workers (lack of DNS or no /etc/hosts configuration) So. Is there a way to enhance get_ip_address to find the correct ip when the workers have more than one IP?

Thanks in advance.