question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

DistributedLoad fails in k8s env

See original GitHub issue

Alluxio version: v2.7.2

In a k8s env where alluxio runs in docker containers, we have 1 master and 1 worker, a distributed load on a dir with 1.9G can only cache 8MB.

worker log: worker.log job worker log: job_worker.log

This problem can be reproduced consistently in Fluid with version 2.7.2. Alluxio 2.7.0 doesn’t have the issue

From the worker log, we found

2022-01-27 00:59:20,564 WARN  CacheRequestManager - Failed to async cache block 9110028288 from remote worker (ip-10-0-5-96.ec2.internal/10.0.5.96:20088) on copying the block: alluxio.exception.status.DeadlineExceededException: Timeout waiting for response after 300000ms. clientClosed: false clientCancelled: false serverClosed: false (Zero Copy GrpcDataReader)

The worker is requesting itself to async cache block and read from itself. https://github.com/Alluxio/alluxio/blob/d3e231a02ea4ef1415e623cfbc742c5f69e8ba8c/core/server/worker/src/main/java/alluxio/worker/block/CacheRequestManager.java#L225

By looking into commits between 2.7.0 (good version) and 2.7.2 (bad version), we found possible commit: https://github.com/Alluxio/alluxio/commit/e765c8436d36aaf911607f2fb4b772c52ac82f0a with code comments

    // issues#11172: If the worker is in a container, use the container hostname
    // to establish the connection.
    if (!dataSource.getContainerHost().equals("")) {
      host = dataSource.getContainerHost();
    }

https://github.com/Alluxio/alluxio/blob/d3e231a02ea4ef1415e623cfbc742c5f69e8ba8c/job/server/src/main/java/alluxio/job/util/JobUtils.java#L215

// issues#11172: If the worker is in a container, use the container hostname

Issue Analytics

  • State:closed
  • Created 2 years ago
  • Comments:15 (15 by maintainers)

github_iconTop GitHub Comments

1reaction
ssz1997commented, Feb 19, 2022

@ZhuTopher @jja725 The root cause is that Fluid by default sets alluxio.job.worker.threadpool.size to 164. The 164 threads seems flooding the connection, which errors out saying cannot connect to remote block worker. I set this property to 10 in Fluid, which is the default value in Alluxio doc, and no more such errors. @apc999 @yuzhu FYI

0reactions
LuQQiucommented, Mar 21, 2022

Close the issue, thanks for the investigation!

Read more comments on GitHub >

github_iconTop Results From Across the Web

Change static variable to environment variable for locust ...
This tutorial demonstrates how to conduct distributed load testing using [Kubernetes](http://kubernetes.io) and includes a sample web application, ...
Read more >
Distributed Load Testing Using Kubernetes
In this lab you will learn how to use Kubernetes Engine to deploy a distributed load testing framework. The framework uses multiple ...
Read more >
Distributed load testing using Google Kubernetes Engine
Objectives. Define environment variables to control deployment configuration. Create a GKE cluster. Perform load testing. Optionally scale up ...
Read more >
Kubernetes + locust (load testing) - Stack Overflow
The are all running with the new environment variable that I've modified previously. The issue is that in the locust dashboard the count...
Read more >
Define Dependent Environment Variables - Kubernetes
That is why UNCHANGED_REFERENCE fails to resolve $(PROTOCOL) in the example above. When the environment variable is undefined or only includes ...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found