Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

DistributedLoad fails in k8s env

See original GitHub issue

Alluxio version: v2.7.2

In a k8s env where alluxio runs in docker containers, we have 1 master and 1 worker, a distributed load on a dir with 1.9G can only cache 8MB.

worker log: worker.log job worker log: job_worker.log

This problem can be reproduced consistently in Fluid with version 2.7.2. Alluxio 2.7.0 doesn’t have the issue

From the worker log, we found

2022-01-27 00:59:20,564 WARN  CacheRequestManager - Failed to async cache block 9110028288 from remote worker (ip-10-0-5-96.ec2.internal/10.0.5.96:20088) on copying the block: alluxio.exception.status.DeadlineExceededException: Timeout waiting for response after 300000ms. clientClosed: false clientCancelled: false serverClosed: false (Zero Copy GrpcDataReader)

The worker is requesting itself to async cache block and read from itself. https://github.com/Alluxio/alluxio/blob/d3e231a02ea4ef1415e623cfbc742c5f69e8ba8c/core/server/worker/src/main/java/alluxio/worker/block/CacheRequestManager.java#L225

By looking into commits between 2.7.0 (good version) and 2.7.2 (bad version), we found possible commit: https://github.com/Alluxio/alluxio/commit/e765c8436d36aaf911607f2fb4b772c52ac82f0a with code comments

    // issues#11172: If the worker is in a container, use the container hostname
    // to establish the connection.
    if (!dataSource.getContainerHost().equals("")) {
      host = dataSource.getContainerHost();
    }

https://github.com/Alluxio/alluxio/blob/d3e231a02ea4ef1415e623cfbc742c5f69e8ba8c/job/server/src/main/java/alluxio/job/util/JobUtils.java#L215

// issues#11172: If the worker is in a container, use the container hostname

Issue Analytics

State:
Created 2 years ago
Comments:15 (15 by maintainers)

Top GitHub Comments

1reaction

ssz1997commented, Feb 19, 2022

@ZhuTopher @jja725 The root cause is that Fluid by default sets alluxio.job.worker.threadpool.size to 164. The 164 threads seems flooding the connection, which errors out saying cannot connect to remote block worker. I set this property to 10 in Fluid, which is the default value in Alluxio doc, and no more such errors. @apc999 @yuzhu FYI

0reactions

LuQQiucommented, Mar 21, 2022

Close the issue, thanks for the investigation!

Top Results From Across the Web

Change static variable to environment variable for locust ...

This tutorial demonstrates how to conduct distributed load testing using [Kubernetes](http://kubernetes.io) and includes a sample web application, ...

Distributed Load Testing Using Kubernetes

In this lab you will learn how to use Kubernetes Engine to deploy a distributed load testing framework. The framework uses multiple ...

Distributed load testing using Google Kubernetes Engine

Objectives. Define environment variables to control deployment configuration. Create a GKE cluster. Perform load testing. Optionally scale up ...

Kubernetes + locust (load testing) - Stack Overflow

The are all running with the new environment variable that I've modified previously. The issue is that in the locust dashboard the count...

Define Dependent Environment Variables - Kubernetes

That is why UNCHANGED_REFERENCE fails to resolve $(PROTOCOL) in the example above. When the environment variable is undefined or only includes ...