Running multiple jobs using one config file
See original GitHub issueWe often have this kind of use case:
- Read src_accounts.csv file and write to
accounts
table on PostgreSQL - Read src_payments.csv file and write to
payments
table on PostgreSQL - Read src_requests.csv file and write to
requests
table on PostgreSQL
In this case, we need multiple configuration files and multiple embulk run
commands.
Idea here is to add a simple multi-transaction feature to embulk.
For example, the configuration file can be like this:
jobs:
accounts:
payments:
requests:
in:
type: file
path_prefix: src_${job}
parser:
type: csv
out:
type: postgresql
table: table_${job}
Discussion points are guess, transaction, next config diff and resume.
As the result of guess, we can output this config file:
jobs:
accounts:
in:
parser:
type: csv
columns:
- {name: id, type: long}
- {name: surename, type: string}
- {name: givenname, type: string}
payments:
in:
parser:
type: csv
columns:
- {name: id, type: long}
- {name: timestamp, type: timestamp, format: format: '%Y-%m-%d %H:%M:%S.%N'}
requests:
in:
parser:
type: csv
columns:
- {name: request_id, type: string}
- {name: account_id, type: long}
- {name: timestamp, type: timestamp, format: format: '%Y-%m-%d %H:%M:%S.%N'}
in:
type: file
path_prefix: src_${job}
parser:
type: csv
out:
type: postgresql
table: table_${job}
About transaction, an idea is to commit jobs one by one when all jobs are completed. For example, embulk runs transactions in this order:
- job
accounts
begins (1 task) - job
payments
begins (2 tasks) - job
requests
begins (2 tasks) - job
accounts
task-1 completes - job
payments
task-1 completes - job
requests
task-1 completes - job
payments
task-2 completes - job
requests
task-2 completes - job
requests
commits - job
payments
commits - job
accounts
commits
A concern is that requests
and payments
won’t be rolled back when commit of job accounts
fails.
DSL (https://github.com/embulk/embulk/issues/131) does similar thing. Any thoughts?
Issue Analytics
- State:
- Created 8 years ago
- Comments:6 (5 by maintainers)
Top Results From Across the Web
Running multiple jobs though concurrent is set to 1 in config.toml
With this configuration, we expect only one job to run on a given runner at any given time, but we found out that...
Read more >Config file with multiple jobs, want to make 1 of those jobs ...
Hello. I'm working on a config.yml file. I'm trying to make it where one of the multiple jobs is exclusive to a certain...
Read more >gitlab - Configure runner to run multiple jobs at the same time
I believe the configuration options you are looking for is concurrent and limit , which you'd change in the GitLab Runners config.toml file....
Read more >Configure multiple jobs with Workflows on CircleCI - YouTube
Learn how to use CircleCI Workflows to combine multiple jobs in your ...
Read more >The Parallel engine configuration file - IBM
One of the great strengths of InfoSphere DataStage is that, when designing ... and sorting facilities on your system should be used to...
Read more >Top Related Medium Post
No results found
Top Related StackOverflow Question
No results found
Troubleshoot Live Code
Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start FreeTop Related Reddit Thread
No results found
Top Related Hackernoon Post
No results found
Top Related Tweet
No results found
Top Related Dev.to Post
No results found
Top Related Hashnode Post
No results found
Top GitHub Comments
We have that kind of multiple-in/out expansion in a (very) long-term plan, but still just in plan, and it may not solve that exact issue of yours. It wouldn’t be in the form described in this #167. I’d also suggest that Digdag would be the most straightforward way.
embulk-output-multi
may not work with the Embulk v0.10 and later.Hello, @Ryo51289
Embulk doesn’t support multi input. You need to use workflow engine i.e digdag
And also this plugin embulk-output-multi may help.