question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

listening for reservations at wrong address

See original GitHub issue

Hello,

I am using tenserflowonspark inside containers on multi-host. I have a problem when starting a spark job. I keep getting waiting for 1 reservations problem and Connection timeout errors.

the problem is that the reservation listening address is wrong : I have this inside my stdout log

2018-03-22 16:26:56,485 INFO (MainThread-1769) worker node range range(1, 2), ps node range range(0, 1)
2018-03-22 16:26:56,501 INFO (MainThread-1769) listening for reservations at ('172.18.0.2', 38871)
2018-03-22 16:26:56,502 INFO (MainThread-1769) Starting TensorFlow on executors
2018-03-22 16:26:56,648 INFO (MainThread-1769) Waiting for TFSparkNodes to start
2018-03-22 16:26:56,648 INFO (MainThread-1769) waiting for 2 reservations
2018-03-22 16:26:57,650 INFO (MainThread-1769) waiting for 1 reservations
2018-03-22 16:26:58,651 INFO (MainThread-1769) waiting for 1 reservations

this adress is wrong normally the communication between cluster (hadoop-master and slaves) should use 10.0.0.0 now I will explain the problem and the architecture.

I am using the same image that I have create in my 2 hosts: I have 2 container (slave2 and master) in the same host I have 1 container (slave1) in different host

when using docker in multi-hosts I should use overlay network this produce 2 interfaces inside my containers:

  • the local container interface with the network 10.0.0.0
  • and another interface is created to communicate with the host machine using 172.18.0.0

hadoop-master: has address 10.0.0.2 and 172.18.0.2 hadoop-slave2: has address 10.0.0.3 and 172.18.0.3 hadoop-slave1: has address 10.0.0.4 and 172.18.0.2 (because not the same host as the master)

the hdoop cluster communicate with each other using host-names created by docker and this uses 10.0.0.0 interface.

So the reservation listening address 172.18.0.2 should instead be 10.0.0.xxx address

I think there is a problem with my spark or yarn configuration but I am not sure. So i was asking what are you using in order to get the interface where to listen for reservation.

Note1: when creating the cluster inside the same host everything is working fine (this is because they can all see the 172.18.0.0 network interface).

Note 2: I think that the node manager is listening on all interfaces 0.0.0.0 but not sure

this is my yarn-site.xml

<configuration>
<property>
<name>yarn.nodemanager.pmem-check-enabled</name>
<value>false</value>
</property>
<property>
<name>yarn.nodemanager.vmem-check-enabled</name>
<value>false</value>
</property>
<property>
<name>yarn.nodemanager.aux-services</name>
<value>mapreduce_shuffle</value>
</property>
<property>
<name>
yarn.nodemanager.aux-services.mapreduce_shuffle.class
</name>
<value>org.apache.hadoop.mapred.ShuffleHandler</value>
</property>
<property>
<name>yarn.resourcemanager.hostname</name>
<value>hadoop-master</value>
</property>
</configuration>

Thank you

Issue Analytics

  • State:closed
  • Created 5 years ago
  • Comments:9 (4 by maintainers)

github_iconTop GitHub Comments

1reaction
leewyangcommented, Mar 26, 2018

@MohamedAmineOuali can you try:

def getIP(iter):
  import socket
  return socket.gethostname()

This was the original code before PR #109

0reactions
moyanojvcommented, Aug 9, 2018

Hello,

We are having some troubles with reservation, this is our setup: In our cluster each worker have 3 network interfaces (3 diferent ip’s):

  • 10.0.2.X
  • 192.168.1.X
  • 192.168.201.X

But just with one (192.168.201.X) they can communicate with each other.

When we execute tensorflowonspark version 1.3.1 reservation process uses the ip in range 10.0.2.x this produces an error.

To solve the problem we have modified get_ip_address function inside util.py to

def get_ip_address(): """Simple utility to get host IP address.""" return socket.gethostname()

moving 1.3.1 code to old version (@leewyang https://github.com/yahoo/TensorFlowOnSpark/issues/251#issuecomment-376337722)

with this little modification all works fine.

But this modification will not work if the nodes can’t resolve the name of the workers (lack of DNS or no /etc/hosts configuration) So. Is there a way to enhance get_ip_address to find the correct ip when the workers have more than one IP?

Thanks in advance.

Read more comments on GitHub >

github_iconTop Results From Across the Web

Why (& How) You Should Actually Listen to Your Customers
Listening to their feedback is the best way to keep pace with customer demand and fulfill their short- and long-term expectations. 3. Increase ......
Read more >
nodejs: listen EACCES: permission denied 0.0.0.0:80
Now, when you tell a Node application that you want it to run on port 80, it will not complain. EDIT: Add a...
Read more >
Top Tips to Improve Listening Skills on the Telephone
Telephone skills in the Contact Centre is essential. Here are some top tips on improving your call listening and communication skills.
Read more >
Active Listening & Effective Questioning
Some common mistakes made by people who think they are actively listening, but aren't really, include: ... o Response: How can WE address...
Read more >
10 Tips For Dealing With Customer Complaints - Forbes
The old saying "kill them with kindness" could not be more true in a situation ... Listening to your customer complain may not...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found