Using threads taking longer than dummy
See original GitHub issueThis issue looked like it might shed light on my issue, but it seems I’m missing something more fundamental. When I run this script below to simply iterate through the BatchedDataLoader
, using the thread
option actually takes longer than using the main worker via dummy
and no matter what, it seems that more workers results in a longer processing time. I’m running this on a machine with 16 vCPU and 64 Gib Memory. The memory stays stable through the process, but I notice CPU usage spikes throughout these iterations. It seems like no matter how long an iteration is taking though, using multiple workers should take less time overall, right? The total size of this dataset I’m testing on is small, around ~1.4 GiB, saved in parquet format in s3, ~1.5M rows and I’m only running 100 iterations with batch size 100. Maybe the script is not working as intended?
Script:
pools = ['dummy', 'thread']
workers = [1, 4, 8]
for i in product(pools, workers):
print(i)
# the trsfm_spec is doing a few different transformations on various columns given the pandas dataframe
for pool_type, workers_count in product(pools, workers):
with make_batch_reader(s3_path,
workers_count=workers_count,
transform_spec=trsfm_spec,
schema_fields=cols,
num_epochs=10,
reader_pool_type = pool_type) as reader:
loader = BatchedDataLoader(reader, batch_size=100)
loader_iter = iter(loader)
time_sum = 0
batch_size_sum = 0
batches = 100
print('--')
print(f'workers_count: {workers_count} and pool_type: {pool_type}')
loop_start = time.time()
for batch_idx in range(batches):
start = time.time()
batch = next(loader_iter)
end = time.time()
t = end - start
time_sum += t
batch_size = sys.getsizeof(batch)
if not batch_idx in [0,1]:
batch_size_sum += batch_size
loop_end = time.time()
print(f'time sum: {time_sum}')
print(f'average time to process batch: {time_sum / (batches-2)}')
print(f'loop time: {loop_end-loop_start}')
print(f'average batch size: {batch_size_sum / batch_idx}')
Output:
('dummy', 1)
('dummy', 4)
('dummy', 8)
('thread', 1)
('thread', 4)
('thread', 8)
--
workers_count: 1 and pool_type: dummy
time sum: 2.927597761154175
average time to process batch: 0.029873446542389537
loop time: 2.927725315093994
average batch size: 1172.040404040404
--
workers_count: 4 and pool_type: dummy
time sum: 4.17024040222168
average time to process batch: 0.04255347349205796
loop time: 4.1703784465789795
average batch size: 1172.040404040404
--
workers_count: 8 and pool_type: dummy
time sum: 4.148790121078491
average time to process batch: 0.042334593072229504
loop time: 4.1489198207855225
average batch size: 1172.040404040404
--
workers_count: 1 and pool_type: thread
time sum: 4.524890422821045
average time to process batch: 0.04617235125327597
loop time: 4.525023937225342
average batch size: 1172.040404040404
--
workers_count: 4 and pool_type: thread
time sum: 8.192336320877075
average time to process batch: 0.08359526858037832
loop time: 8.192479372024536
average batch size: 1172.040404040404
--
workers_count: 8 and pool_type: thread
time sum: 11.62452483177185
average time to process batch: 0.11861760032420256
loop time: 11.624683380126953
average batch size: 1172.040404040404
Issue Analytics
- State:
- Created 3 years ago
- Comments:5
Top GitHub Comments
Do you think having an option for a user to supply their own collate function would be helpful in your case? This is something that was brought up in #647 and I will try to address in the following weeks.
Agreed. We’ll try moving in this direction. Thank you for your input!