Enhance distributedLoad
See original GitHub issueSummary
Currently the max parallel jobs for distributedLoad is controlled by ACTIVE_JOB_COUNT_OPTION
. For folder which contains many small files, such as imageNet, this number is not easy to define.
If the number is too large, many jobs will failed (due to the requests limitation of UFS). If the number is too small, the load speed will be impacted.
Here we propose a policy to adjust this number automatically.
Policy
The basic idea is like TCP congestion control. https://en.wikipedia.org/wiki/TCP_congestion_control If some job failed at a period, we set the active_job_num to half of original value and set the thredhold to this number. If all jobs are succeed at a period, and current number under the threshold we double the active_job_num, else we linear increase the active_job_num.
Required init parameters: ACTIVE_JOB_MAX_NUM
, ACTIVE_JOB_INIT_NUM
, ACTIVE_JOB_INCREASE_PACE
, RUNING_JOB_MAX_NUM
ACTIVE_JOB_INIT_NUM = init_active_job_num
ACTIVE_JOB_MAX_NUM = max_active_job_num
ACTIVE_JOB_INCREASE_PACE = increase_pace
RUNING_JOB_MAX_NUM = running_job_max_num
active_job_num: current number of active jobs, the initial value is ACTIVE_JOB_INIT_NUM
activate_job_threshold, the threshold for active job
active_jobs: a queue for all active jobs
The main load porecss:
for file_name in load_file_list
create job with corresponded file name
if active_jobs.size() < active_job_num
add job into active_jobs
else
wait a while
end
end
drain active_jobs, waiting for all job finished.
submission process
using multi-thread, for each thread:
get job from active_jobs
submit job to job master
adjust active job number process
run in another thread
while !finished
wake_up every second
completed_job_num
failed_job_num
for job in active_jobs
if job in completed status
completed_job_num++
else if job in failed status
failed_job++
end
end
if failed_job != 0
active_job_num = max(active_job_num/2, init_active_job_num)
activate_job_threshold = active_job_num
else if active_job_num - completed_job_num > running_job_max_num
// too many jobs in running status, don't adjust the nunber
else if active_job_num > activate_job_threshold
active_job_num = active_job_num + ACTIVE_JOB_INCREASE_PACE // linear increase
else
active_job_num = min(active_job_num * 2, ACTIVE_JOB_MAX_NUM) // double the number
end
end
Issue Analytics
- State:
- Created 2 years ago
- Comments:10 (10 by maintainers)
Top GitHub Comments
@Binyang2014 the scenario I imagined is as follows. it will limit our bandwidth unnecessarily. Imagine two concurrent distributedLoad jobs, each accessing a different ufs. If the first one reaches network capacity, and starts to error, it would prevent the second job from increasing the maxJobNum even if it is no where near the limit
For @LuQQiu’s comments:
running_job_max_num
, the number of job will be limited. Maye we can give some recommend init parameters for small file scenario and large file scenario.distributedLoad
command. Such asbin/alluxio distributedLoad -dynamicAdjustOptoin {activeJobInitNum: xx, activeJobMaxNum: xx}
to achieve backward compatibility.