Multi-Threaded Job PR Suggestion
See original GitHub issueHi,
in reference to #299
May I suggest an alternative to the fork-join method? These are the issues I have been having with the fork-join method:
-
It breaks up your job into multiple jobs.
-
It does not have the constant memory guarantee of a single batch job. A single batch job will only process a batch size of N, so you will only have N records in memory at any given time. With the fork join model, the memory grows somewhat unexpectedly. If your fork threads are much slower than your file reading thread (as is the case for me) then the memory can grow very fast. I ran the fork-join tutorial with a million line file, changing the following code:
private static Job buildForkJob(String jobName, File tweets, List<BlockingQueue<Record>> workQueues)
throws FileNotFoundException {
return aNewJob()
.named(jobName)
.reader(new FlatFileRecordReader(tweets))
.writer(new RoundRobinBlockingQueueRecordWriter(workQueues))
.jobListener(new PoisonRecordBroadcaster(workQueues))
.build();
}
private static Job buildWorkerJob(String jobName, BlockingQueue<Record> workQueue, BlockingQueue<Record> joinQueue) {
return aNewJob()
.named(jobName)
.reader(new BlockingQueueRecordReader(workQueue))
.processor(new TweetProcessor(jobName))
.processor(x -> {
Thread.sleep(1000);
return x;
})
.writer(new BlockingQueueRecordWriter(joinQueue))
.build();
}
private static Job buildJoinJob(String jobName, File out, BlockingQueue<Record> joinQueue) throws IOException {
return aNewJob()
.named(jobName)
.reader(new BlockingQueueRecordReader(joinQueue, NB_WORKERS))
.filter(new PoisonRecordFilter())
.writer(new FileRecordWriter(out))
.build();
}
I stopped the process early, but as you can see the memory kept growing. I would not expect a 100 size batch to use 1.5 GB of data!
To solve this issue, I would like to either create a new job type or update the existing BatchJob class.
The current BatchJob code reads as follows:
while (moreRecords() && !isInterrupted()) {
Batch batch = readAndProcessBatch();
writeBatch(batch);
}
and I would like to implement something more like this (pseudo code):
while (moreRecords() && !isInterrupted()) {
Batch batch = readBatch();
for(RecordProcessor processor : recordProcessorList){
if(BatchProcessor.class.isAssignableFrom(processor.class)){
batch = (BatchProcessor) processor.processBatch(batch);
}
else {
Batch newBatch = new Batch();
for(Record record : batch){
newBatch.addRecord(processor.processRecord(record));
}
batch = newBatch;
}
}
writeBatch(batch);
}
The BatchProcessor class allows a processor to run on a batch of records. With this I could create a multiThreadedProccessor that can run on batches. I think this code could provide the following:
- allow for a new type of processor that can process batches instead of individual records
- keep constant memory
- keep all multi threaded code within the same job
- keep records in a consistent order
Do you think this is a bad idea? Are there any major issues that I am not addressing? If you think this is a good idea, may I attempt a PR?
Issue Analytics
- State:
- Created 6 years ago
- Comments:15 (15 by maintainers)
Hi @ipropper
Thank you for this analysis! I didn’t have time yet to look at this issue but I will definitely try to do it this weekend and get back to you asap.
Kr Mahmoud
I was re-reading this thread and found it really interesting! Many thanks to all for sharing these nice ideas 👍 I have already explained my point of view in the previous messages, so I’m not going to repeat it here.
In hindsight, I would not introduce a multi-threaded job implementation to avoid all the joy of thread-safety and what not… The implementation complexity and the maintenance burden are higher than the added value of this feature. I do believe Easy Batch jobs are lightweight
Callable
objects that can be composed to create more complex workflows (with parallel jobs, loops, conditionals, etc) either manually (like shown in the tutorials) or using a workflow engine like Easy Flows.This feature can be implemented in a fork if needed. OSS FTW!