Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Running multiple jobs using one config file

See original GitHub issue

We often have this kind of use case:

Read src_accounts.csv file and write to accounts table on PostgreSQL
Read src_payments.csv file and write to payments table on PostgreSQL
Read src_requests.csv file and write to requests table on PostgreSQL

In this case, we need multiple configuration files and multiple embulk run commands. Idea here is to add a simple multi-transaction feature to embulk.

For example, the configuration file can be like this:

jobs:
    accounts:
    payments:
    requests:
in:
    type: file
    path_prefix: src_${job}
    parser:
        type: csv
out:
    type: postgresql
    table: table_${job}

Discussion points are guess, transaction, next config diff and resume.

As the result of guess, we can output this config file:

jobs:
    accounts:
        in:
            parser:
                type: csv
                columns:
                    - {name: id, type: long}
                    - {name: surename, type: string}
                    - {name: givenname, type: string}
    payments:
        in:
            parser:
                type: csv
                columns:
                    - {name: id, type: long}
                    - {name: timestamp, type: timestamp, format: format: '%Y-%m-%d %H:%M:%S.%N'}
    requests:
        in:
            parser:
                type: csv
                columns:
                    - {name: request_id, type: string}
                    - {name: account_id, type: long}
                    - {name: timestamp, type: timestamp, format: format: '%Y-%m-%d %H:%M:%S.%N'}
in:
    type: file
    path_prefix: src_${job}
    parser:
        type: csv
out:
    type: postgresql
    table: table_${job}

About transaction, an idea is to commit jobs one by one when all jobs are completed. For example, embulk runs transactions in this order:

job accounts begins (1 task)
job payments begins (2 tasks)
job requests begins (2 tasks)
job accounts task-1 completes
job payments task-1 completes
job requests task-1 completes
job payments task-2 completes
job requests task-2 completes
job requests commits
job payments commits
job accounts commits

A concern is that requests and payments won’t be rolled back when commit of job accounts fails.

DSL (https://github.com/embulk/embulk/issues/131) does similar thing. Any thoughts?

Issue Analytics

State:
Created 8 years ago
Comments:6 (5 by maintainers)

Top GitHub Comments

1reaction

dmikurubecommented, Nov 24, 2020

We have that kind of multiple-in/out expansion in a (very) long-term plan, but still just in plan, and it may not solve that exact issue of yours. It wouldn’t be in the form described in this #167. I’d also suggest that Digdag would be the most straightforward way.

embulk-output-multi may not work with the Embulk v0.10 and later.

1reaction

hiroyuki-satocommented, Nov 24, 2020

Hello, @Ryo51289

Embulk doesn’t support multi input. You need to use workflow engine i.e digdag

And also this plugin embulk-output-multi may help.