Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Using threads taking longer than dummy

See original GitHub issue

This issue looked like it might shed light on my issue, but it seems I’m missing something more fundamental. When I run this script below to simply iterate through the BatchedDataLoader, using the thread option actually takes longer than using the main worker via dummy and no matter what, it seems that more workers results in a longer processing time. I’m running this on a machine with 16 vCPU and 64 Gib Memory. The memory stays stable through the process, but I notice CPU usage spikes throughout these iterations. It seems like no matter how long an iteration is taking though, using multiple workers should take less time overall, right? The total size of this dataset I’m testing on is small, around ~1.4 GiB, saved in parquet format in s3, ~1.5M rows and I’m only running 100 iterations with batch size 100. Maybe the script is not working as intended?

Script:

pools = ['dummy', 'thread']
workers = [1, 4, 8]
for i in product(pools, workers):
    print(i)

# the trsfm_spec is doing a few different transformations on various columns given the pandas dataframe
for pool_type, workers_count in product(pools, workers):
    with make_batch_reader(s3_path, 
                               workers_count=workers_count,
                               transform_spec=trsfm_spec,
                               schema_fields=cols,
                               num_epochs=10,
                               reader_pool_type = pool_type) as reader:
        
        loader = BatchedDataLoader(reader, batch_size=100)
        loader_iter = iter(loader)
        
        time_sum = 0
        batch_size_sum = 0
        batches = 100
        print('--')
        print(f'workers_count: {workers_count} and pool_type: {pool_type}')
        loop_start = time.time()
        for batch_idx in range(batches):
            start = time.time()
            batch = next(loader_iter)
            end = time.time()
            
            t = end - start       
            time_sum += t
            batch_size = sys.getsizeof(batch)
            if not batch_idx in [0,1]:
                batch_size_sum += batch_size
        loop_end = time.time()
        print(f'time sum: {time_sum}')
        print(f'average time to process batch: {time_sum / (batches-2)}')
        print(f'loop time: {loop_end-loop_start}')
        print(f'average batch size: {batch_size_sum / batch_idx}')

Output:

('dummy', 1)
('dummy', 4)
('dummy', 8)
('thread', 1)
('thread', 4)
('thread', 8)
--
workers_count: 1 and pool_type: dummy
time sum: 2.927597761154175
average time to process batch: 0.029873446542389537
loop time: 2.927725315093994
average batch size: 1172.040404040404
--
workers_count: 4 and pool_type: dummy
time sum: 4.17024040222168
average time to process batch: 0.04255347349205796
loop time: 4.1703784465789795
average batch size: 1172.040404040404
--
workers_count: 8 and pool_type: dummy
time sum: 4.148790121078491
average time to process batch: 0.042334593072229504
loop time: 4.1489198207855225
average batch size: 1172.040404040404
--
workers_count: 1 and pool_type: thread
time sum: 4.524890422821045
average time to process batch: 0.04617235125327597
loop time: 4.525023937225342
average batch size: 1172.040404040404
--
workers_count: 4 and pool_type: thread
time sum: 8.192336320877075
average time to process batch: 0.08359526858037832
loop time: 8.192479372024536
average batch size: 1172.040404040404
--
workers_count: 8 and pool_type: thread
time sum: 11.62452483177185
average time to process batch: 0.11861760032420256
loop time: 11.624683380126953
average batch size: 1172.040404040404

Issue Analytics

State:
Created 3 years ago
Comments:5

Top GitHub Comments

1reaction

selitvincommented, Mar 4, 2021

Do you think having an option for a user to supply their own collate function would be helpful in your case? This is something that was brought up in #647 and I will try to address in the following weeks.

0reactions

selitvincommented, Mar 4, 2021

Agreed. We’ll try moving in this direction. Thank you for your input!