question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

[bug] Available IPs on AWS consumed too quickly, limits number of dask-worker nodes that can spin up

See original GitHub issue

Describe the bug

A clear and concise description of what the problem is.

When requesting a large number of dask workers, some fraction of worker pods never spin up. During some reason testing, I requested 250 workers and only got 190 (or so) workers to spin up; 190 (or thereabouts) seems to be the limit of dask workers nodes we can spin up at one time. For the node-group in question, we have set the max number of workers to the default 450 in the qhub-config.yaml.

Checking the status of these nodes on AWS reveals an interesting error message:

runtime network not ready: NetworkReady=false reason:NetworkPluginNotReady message:docker: network plugin is not ready: cni config uninitialized

How can we help?

Help us help you.

  • What are you trying to achieve?
  • How can we reproduce the problem?
  • What is the expected behaviour?
  • And what is currently happening?
  • Any error messages?

If helpful, please add any screenshots and relevant links to the description.

Here are a few screenshots of these error messages:

Screen Shot 2021-09-20 at 16 30 50

Screen Shot 2021-09-20 at 16 41 22

Screen Shot 2021-09-20 at 16 31 55

Your environment

Describe the environment in which you are experiencing the bug.

Include your conda version (use conda --version), k8s and any other relevant details.

qhub - 0.3.12 (installed from commit 0dff706)
k8s - 1.19

Issue Analytics

  • State:closed
  • Created 2 years ago
  • Comments:12 (12 by maintainers)

github_iconTop GitHub Comments

1reaction
costrouccommented, Sep 30, 2021

@iameskild lets make these vpcs as large as we can then. I see for aws the vpc is “10.10.0.0/16” and thus 2^16 = 65536 ips and we reserve the first 4 bits for the subnets -> thus 2^12 = 4096. Instead lets use the entire ip 10.0.0.0/8 and use 4 bits for the subnet.

0reactions
iameskildcommented, Oct 6, 2021

The possible solution outlined below is likely overly complicated but I haven’t been able to find a simpler means of updating the WARM_IP_TARGET environment variable found in the aws-node daemonset.

I believe we are currently using the default amazon-vpc-cni-k8s plugin wherein the WARM_IP_TARGET is not set and the daemonset relies on WARM_ENI_TARGET=1; as mentioned above, this reserves a large pool of IP addresses for each node which is the root of the problem.

One possible solution might involve using this Terraform kubectl provider along with a “custom” kubernetes manifest/YAML for the daemonset in question. This will give us the option to make any necessary changes to the default AWS CNI settings including updating WARM_IP_TARGET.

This solution assumes that AWS EKS clusters use the amazon-vpc-cni-k8s plugin by default (something the AWS docs seems to suggest) and that we are comfortable using a third-party Terraform provider - kubectl provider. I also must admit that my Terraform knowledge is a bit rudimentary so if this is all non-sense, forgive me haha

Read more comments on GitHub >

github_iconTop Results From Across the Web

Troubleshoot Elastic IP addresses on EC2 instances
When allocating a new Elastic IP address, I'm getting the error: "Elastic IP address could not be allocated. The maximum number of addresses ......
Read more >
Subnet to be selected only if there is available IPs · Issue #1297
I understand the subnet is chosen randomly, which is ok, but a given subnet might run out of IP, and it would be...
Read more >
Bypassing IP Based Blocking with AWS API Gateway
Cloud resources can be used to change the source IP of an attacker in order to bypass IP based blocking and rate limiting....
Read more >
How to resolve "The maximum number of addresses has been ...
The maximum number of addresses has been reached. You get this error because you do associate public IP addresses to your instances.
Read more >
Jiawei Zhuang
MPI over Multiple TCP Connections on EC2 C5n Instances. Jiawei Zhuang ... For this part you might get "Too many files open" error....
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found