question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Enhance distributedLoad

See original GitHub issue

Summary Currently the max parallel jobs for distributedLoad is controlled by ACTIVE_JOB_COUNT_OPTION. For folder which contains many small files, such as imageNet, this number is not easy to define.

If the number is too large, many jobs will failed (due to the requests limitation of UFS). If the number is too small, the load speed will be impacted.

Here we propose a policy to adjust this number automatically.

Policy

The basic idea is like TCP congestion control. https://en.wikipedia.org/wiki/TCP_congestion_control If some job failed at a period, we set the active_job_num to half of original value and set the thredhold to this number. If all jobs are succeed at a period, and current number under the threshold we double the active_job_num, else we linear increase the active_job_num.

Required init parameters: ACTIVE_JOB_MAX_NUM, ACTIVE_JOB_INIT_NUM, ACTIVE_JOB_INCREASE_PACE , RUNING_JOB_MAX_NUM

ACTIVE_JOB_INIT_NUM = init_active_job_num
ACTIVE_JOB_MAX_NUM = max_active_job_num
ACTIVE_JOB_INCREASE_PACE  = increase_pace
RUNING_JOB_MAX_NUM = running_job_max_num

active_job_num: current number of active jobs, the initial value is ACTIVE_JOB_INIT_NUM 
activate_job_threshold, the threshold for active job
active_jobs: a queue for all active jobs
The main load porecss:

for file_name in load_file_list
   create job with corresponded file name
   if active_jobs.size() < active_job_num
     add job into active_jobs
   else
     wait a while
   end
end
drain active_jobs, waiting for all job finished.
submission process
using multi-thread, for each thread:
get job from active_jobs
submit job to job master
adjust active job number process
run in another thread
while !finished
  wake_up every second
  completed_job_num
  failed_job_num
  for job in active_jobs
    if job in completed status
      completed_job_num++
    else if job in failed status
      failed_job++
    end
  end
  if failed_job != 0
    active_job_num = max(active_job_num/2, init_active_job_num)
    activate_job_threshold = active_job_num 
  else if active_job_num - completed_job_num > running_job_max_num
      // too many jobs in running status, don't adjust the nunber
  else if active_job_num > activate_job_threshold
     active_job_num  = active_job_num  + ACTIVE_JOB_INCREASE_PACE   // linear increase
  else
     active_job_num   = min(active_job_num  * 2, ACTIVE_JOB_MAX_NUM) // double the number
   end
end

Issue Analytics

  • State:open
  • Created 2 years ago
  • Comments:10 (10 by maintainers)

github_iconTop GitHub Comments

1reaction
yuzhucommented, Jun 17, 2021

@Binyang2014 the scenario I imagined is as follows. it will limit our bandwidth unnecessarily. Imagine two concurrent distributedLoad jobs, each accessing a different ufs. If the first one reaches network capacity, and starts to error, it would prevent the second job from increasing the maxJobNum even if it is no where near the limit

1reaction
Binyang2014commented, Jun 15, 2021

For @LuQQiu’s comments:

  1. Sure, we can make it optional. And I think it works for both small file and large file. For large file, since we set the running_job_max_num, the number of job will be limited. Maye we can give some recommend init parameters for small file scenario and large file scenario.
  2. For current design, the retry logic is kept. We only treat the task failed if it reaches the max retry number. Otherwise, we will treat it as an active job.
  3. Maybe we can add a option for current distributedLoad command. Such as bin/alluxio distributedLoad -dynamicAdjustOptoin {activeJobInitNum: xx, activeJobMaxNum: xx} to achieve backward compatibility.
Read more comments on GitHub >

github_iconTop Results From Across the Web

Enhanced Channel Assignment and Load Distribution in IEEE ...
An algorithm to reduce congestion and balance users' load in IEEE 802.11b/g Wireless Local Area Networks (WLANs) is presented, which takes into account ......
Read more >
Enhanced Channel Assignment and Load Distribution in IEEE ...
This paper discusses enhanced channel assignment and load distribution in IEEE 802.11 WLANs.
Read more >
Modeling the 802.11e Enhanced Distributed Channel Access ...
Abstract—The Enhanced Distributed Channel Access (EDCA) function of IEEE 802.11e standard defines multiple Access. Categories (AC) with AC-specific ...
Read more >
IEEE 802.11e-2005 - Wikipedia
IEEE 802.11e-2005 or 802.11e is an approved amendment to the IEEE 802.11 standard that ... 2.1 Enhanced distributed channel access (EDCA); 2.2 HCF...
Read more >
Analysis of the 802.11e Enhanced Distributed Channel ... - arXiv
We propose an analytical model for the EDCA function which incorporates an accurate CW, AIFS, and TXOP differentiation at any traffic load. The ......
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found