question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Running multiple jobs using one config file

See original GitHub issue

We often have this kind of use case:

  • Read src_accounts.csv file and write to accounts table on PostgreSQL
  • Read src_payments.csv file and write to payments table on PostgreSQL
  • Read src_requests.csv file and write to requests table on PostgreSQL

In this case, we need multiple configuration files and multiple embulk run commands. Idea here is to add a simple multi-transaction feature to embulk.

For example, the configuration file can be like this:

jobs:
    accounts:
    payments:
    requests:
in:
    type: file
    path_prefix: src_${job}
    parser:
        type: csv
out:
    type: postgresql
    table: table_${job}

Discussion points are guess, transaction, next config diff and resume.

As the result of guess, we can output this config file:

jobs:
    accounts:
        in:
            parser:
                type: csv
                columns:
                    - {name: id, type: long}
                    - {name: surename, type: string}
                    - {name: givenname, type: string}
    payments:
        in:
            parser:
                type: csv
                columns:
                    - {name: id, type: long}
                    - {name: timestamp, type: timestamp, format: format: '%Y-%m-%d %H:%M:%S.%N'}
    requests:
        in:
            parser:
                type: csv
                columns:
                    - {name: request_id, type: string}
                    - {name: account_id, type: long}
                    - {name: timestamp, type: timestamp, format: format: '%Y-%m-%d %H:%M:%S.%N'}
in:
    type: file
    path_prefix: src_${job}
    parser:
        type: csv
out:
    type: postgresql
    table: table_${job}

About transaction, an idea is to commit jobs one by one when all jobs are completed. For example, embulk runs transactions in this order:

  1. job accounts begins (1 task)
  2. job payments begins (2 tasks)
  3. job requests begins (2 tasks)
  4. job accounts task-1 completes
  5. job payments task-1 completes
  6. job requests task-1 completes
  7. job payments task-2 completes
  8. job requests task-2 completes
  9. job requests commits
  10. job payments commits
  11. job accounts commits

A concern is that requests and payments won’t be rolled back when commit of job accounts fails.

DSL (https://github.com/embulk/embulk/issues/131) does similar thing. Any thoughts?

Issue Analytics

  • State:open
  • Created 8 years ago
  • Comments:6 (5 by maintainers)

github_iconTop GitHub Comments

1reaction
dmikurubecommented, Nov 24, 2020

We have that kind of multiple-in/out expansion in a (very) long-term plan, but still just in plan, and it may not solve that exact issue of yours. It wouldn’t be in the form described in this #167. I’d also suggest that Digdag would be the most straightforward way.

embulk-output-multi may not work with the Embulk v0.10 and later.

1reaction
hiroyuki-satocommented, Nov 24, 2020

Hello, @Ryo51289

Embulk doesn’t support multi input. You need to use workflow engine i.e digdag

And also this plugin embulk-output-multi may help.

Read more comments on GitHub >

github_iconTop Results From Across the Web

Running multiple jobs though concurrent is set to 1 in config.toml
With this configuration, we expect only one job to run on a given runner at any given time, but we found out that...
Read more >
Config file with multiple jobs, want to make 1 of those jobs ...
Hello. I'm working on a config.yml file. I'm trying to make it where one of the multiple jobs is exclusive to a certain...
Read more >
gitlab - Configure runner to run multiple jobs at the same time
I believe the configuration options you are looking for is concurrent and limit , which you'd change in the GitLab Runners config.toml file....
Read more >
Configure multiple jobs with Workflows on CircleCI - YouTube
Learn how to use CircleCI Workflows to combine multiple jobs in your ...
Read more >
The Parallel engine configuration file - IBM
One of the great strengths of InfoSphere DataStage is that, when designing ... and sorting facilities on your system should be used to...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found