question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Multi-Threaded Job PR Suggestion

See original GitHub issue

Hi,

in reference to #299

May I suggest an alternative to the fork-join method? These are the issues I have been having with the fork-join method:

  1. It breaks up your job into multiple jobs.

  2. It does not have the constant memory guarantee of a single batch job. A single batch job will only process a batch size of N, so you will only have N records in memory at any given time. With the fork join model, the memory grows somewhat unexpectedly. If your fork threads are much slower than your file reading thread (as is the case for me) then the memory can grow very fast. I ran the fork-join tutorial with a million line file, changing the following code:

private static Job buildForkJob(String jobName, File tweets, List<BlockingQueue<Record>> workQueues)
            throws FileNotFoundException {
        return aNewJob()
                .named(jobName)
                .reader(new FlatFileRecordReader(tweets))
                .writer(new RoundRobinBlockingQueueRecordWriter(workQueues))
                .jobListener(new PoisonRecordBroadcaster(workQueues))
                .build();
    }

    private static Job buildWorkerJob(String jobName, BlockingQueue<Record> workQueue, BlockingQueue<Record> joinQueue) {
        return aNewJob()
                .named(jobName)
                .reader(new BlockingQueueRecordReader(workQueue))
                .processor(new TweetProcessor(jobName))
                .processor(x -> {
                    Thread.sleep(1000);
                    return x;
                })
                .writer(new BlockingQueueRecordWriter(joinQueue))
                .build();
    }

    private static Job buildJoinJob(String jobName, File out, BlockingQueue<Record> joinQueue) throws IOException {
        return aNewJob()
                .named(jobName)
                .reader(new BlockingQueueRecordReader(joinQueue, NB_WORKERS))
                .filter(new PoisonRecordFilter())
                .writer(new FileRecordWriter(out))
                .build();
    }

image

I stopped the process early, but as you can see the memory kept growing. I would not expect a 100 size batch to use 1.5 GB of data!

To solve this issue, I would like to either create a new job type or update the existing BatchJob class.

The current BatchJob code reads as follows:

while (moreRecords() && !isInterrupted()) {
                Batch batch = readAndProcessBatch();
                writeBatch(batch);
            }

and I would like to implement something more like this (pseudo code):

 while (moreRecords() && !isInterrupted()) {
                Batch batch = readBatch();
                for(RecordProcessor processor : recordProcessorList){
                    if(BatchProcessor.class.isAssignableFrom(processor.class)){
                        batch = (BatchProcessor) processor.processBatch(batch);
                    }
                    else {
                        Batch newBatch = new Batch();
                        for(Record record : batch){
                            newBatch.addRecord(processor.processRecord(record));
                        }
                        batch = newBatch;
                    }
                }
                writeBatch(batch);
            }

The BatchProcessor class allows a processor to run on a batch of records. With this I could create a multiThreadedProccessor that can run on batches. I think this code could provide the following:

  • allow for a new type of processor that can process batches instead of individual records
  • keep constant memory
  • keep all multi threaded code within the same job
  • keep records in a consistent order

Do you think this is a bad idea? Are there any major issues that I am not addressing? If you think this is a good idea, may I attempt a PR?

Issue Analytics

  • State:closed
  • Created 6 years ago
  • Comments:15 (15 by maintainers)

github_iconTop GitHub Comments

1reaction
fmbenhassinecommented, Oct 27, 2017

Hi @ipropper

Thank you for this analysis! I didn’t have time yet to look at this issue but I will definitely try to do it this weekend and get back to you asap.

Kr Mahmoud

0reactions
fmbenhassinecommented, Feb 26, 2020

I was re-reading this thread and found it really interesting! Many thanks to all for sharing these nice ideas 👍 I have already explained my point of view in the previous messages, so I’m not going to repeat it here.

In hindsight, I would not introduce a multi-threaded job implementation to avoid all the joy of thread-safety and what not… The implementation complexity and the maintenance burden are higher than the added value of this feature. I do believe Easy Batch jobs are lightweight Callable objects that can be composed to create more complex workflows (with parallel jobs, loops, conditionals, etc) either manually (like shown in the tutorials) or using a workflow engine like Easy Flows.

This feature can be implemented in a fork if needed. OSS FTW!

Read more comments on GitHub >

github_iconTop Results From Across the Web

Multithreaded Job Submission Example - HPC@UMD
This page provides an example of submitting such a multithreaded job . It is based on the HelloUMD-Multithreaded job templates in the OnDemand...
Read more >
Query Suggest and Multi-Threading | Coveo
In this blog post, I will explain how Query Suggest works in the back end, and how it uses mutli-threading to provide results...
Read more >
Premiere & multi-threading - Adobe Support Community
I'm trying to use an old plug-in that's an AEX from 2003. It works on Prem Pro but I want to use it...
Read more >
Multi-threaded Batch Job Properties
To provide feedback and suggestions, log in with your Informatica credentials. Then, click the Comments button or go directly to the Comments section...
Read more >
Better performance through threading - Android Developers
With scenarios like these, we suggest that your app not include explicit references to UI objects in threaded work tasks.
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found