question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Feature proposal: meltano-manifest.json

See original GitHub issue

Related to:

As discussed in office hours on 2022-10-12.

Feature goals/requirements:

  1. Stability. Since this would serve as the base for many interop/integration scenarios, we want to make the format as stable as possible.
  2. Self-contained. There should be no need to reference external files or services when using this format. So, for example, lock file content would be embedded so no external references are needed.
  3. Pre-calculated inheritance, precedence, and override effects. The reader shouldn’t need to know how inheritance works, or understand any other internal Meltano business logic.
  4. No secrets. This artifact essentially needs to be treated as ‘code’ and may be passed around in less-trusted contexts. As such, it should not contain: env vars from local OS or terminal context, env vars from .dotenv, settings values from systemdb or other external settings/secrets providers.
  5. Simple as possible. Bonus points if the new json manifest is able to be validated against existing JSON Schema rulesets, without creating or maintaining net new data structures.

So, given these guidelines, here’s a possible path forward:

  1. Start with contents and structure of the meltano.yml file itself.
  2. Merge in all data from include_paths declarations.
  3. Super-populate each plugin definition in the global context:
    1. Inject all properties from the lock file, except of course, those properties overridden.
    2. Calculate the value of each entry of the plugin’s config.
    3. Optionally, we can store ‘extra’ info about how the evaluation was performed under a meta or vendor key that is explicitly non-stable or at least explicitly free-form.
    4. Environment variable declarations in plugin config should be left unresolved until runtime. If we are confident that the value can be resolved without leaking any sensitive info, the predicted evaluation could optionally be rendered under a meta or vendor key, without losing fidelity of the config’s env var reference.
  4. Repeat the above process for each declared Meltano Environment.
  5. Inject other top-level entities (jobs, env, schedules, etc.) into the environment declarations if applicable, so that each environment definition is standalone.

Variations:

  1. We optionally could give an environment name to the ‘global’ or --no-environment behavior, so that the top-level file is just the environments declaration: version: 1 \n environments: [...] \n <EOF>. This reduces the clutter in the file, and readers would only deserialize the environment definition they need in the given context.
  2. We could optionally create separate manifest files per environment so that the manifest is specific to what is needed exactly in a given context.

Issue Analytics

  • State:open
  • Created a year ago
  • Comments:25 (15 by maintainers)

github_iconTop GitHub Comments

2reactions
aaronsteerscommented, Oct 16, 2022

@JulesHuisman - From the above, it looks like we’d ideally just capture the whole selected part of the JSON schema per stream. Assuming we just use schema as the property name for the JSON schema definition, and assuming we drop from the schema everything that has been deselected by the user, the spec starts to take shape here pretty organically…

meltano-manifest.json (yaml-converted for readability)

# ...
plugins:
  extractors:
    tap-github:
      pip_url: ...
      # ...
      select:
      # These are the user-defined rules that get applied to the raw catalog.json file.
      - repositories.*
      - "!*.reactions"
      - # ...
      streams:
      # A simplified and pre-resolved version of the upstream schema for each of the extractor's streams,
      # calculated from the most recent cache of the extractor's catalog.json file.
      - name: repositories
        key_properties: [...]
        replication_key: null
        schema: {...} # json schema here
      # ...

We already have another function using the key catalog at this level (used for catalog overrides today), but perhaps a streams key could work instead of hiding under an interim meta/vendor key. The JSON Schema is pretty non-controversial in terms of compatibility with our own Meltano/Singer paradigms and interop with other platforms - especially if we pre-filter it, since that removes the requirement for needing metadata and other Singer-specific catalog details. The primary keys and incremental replication key values would probably would be worth including as well.

Still worth calling out that the inclusion of a streams entry presupposes that either we’ve already ran discovery on the extractors, or that we have built the future meltano catalog feature which would store those in a long-term git artifact other than the local .meltano cache files.

Assuming that discovery has run and we already have a catalog cache, this is not a lot more surface area to add, spec wise. And the fact that we would be excluding deselected streams makes me less worried about file sizes overall. Still significant, but not a ‘show stopper’, per se.

That said, the inclusion of the stream’s JSON Schema very likely could lean us more towards leveraging a separate manifest file per Meltano Environment - reducing the file size penalty for adding json schema to only once per file - instead of multiplying the increase in size by the number of Meltano environments. While we don’t generally expect schemas to vary widely across environments, there’s no reason that they have to be identical across environments, and in ‘real-world’ scenarios, it would not be uncommon for different environments to have slightly different schema definitions for the same streams.

1reaction
WillDaSilvacommented, Dec 22, 2022

@aaronsteers @tayloramurphy In order to have a relatively flat manifest wherein you don’t have to perform computations on the config at runtime, while still making it compatible with the Meltano project file schema, we’ll need to have more than just one manifest file per environment as suggested above.

This is because schedules currently have env blocks, and jobs likely will in the future too. We may also want to support plugin config at these levels in the future too.

To have a fully pre-computed manifest file we have to make each leaf node in the project files (i.e. contexts in which Meltano can actually do work) static. It can’t change based on whether you’re calling from one environment or another, or running under a schedule or not, or running a job or not. To accomplish this for environments we agreed that we’d create a separate environment for each manifest file, but at the time I didn’t realize that we actually need one for each combination of these 3 contexts in which Meltano can be running. If I’m missing any other such execution contexts, we’ll need to incorporate that in the same way.

To handle this, I propose we decide on an arbitrary order for these contexts, such as (environment, schedule, job), then create each manifest file for a particular combination of them in that order. These can determine the file name of each manifest file: meltano-manifest.<environment name>.<schedule name>.<job name>.json. This may be an issue if . is a legal character within any of these names, so we may have to hash each of the names to avoid that problem.

This is just the default name/location for a manifest file - when one is generated manually via meltano compile the --output parameter can be used to save the file to a different location, e.g. --output ./manifest.json. Specifying --output should only work if a single manifest file is being generated. If some context (e.g. the schedule) is left unspecified, then an --output-dir must be specified instead in which manifests for every value of the unspecified context will be generated.

Each context has a global/none option, so you end up with $(e+1)(s+1)(j+1)$ manifest files, where $e$ is the number of environments, $s$ is the number of schedules, and $j$ is the number of jobs.

If we compile them only as-needed and selectively, that’s fine.

As-needed in the case where you’re some part of meltano.core and you want to get config for whaterver you’re doing now, so you request a manifest for the current environment-schedule-job triple. If it already exists for the current hash of the project files (maybe that hash can be stored within annotations?) then it’s simply read.

Otherwise a new manifest is generated, and saved to disk. Selectively when you’re operating outside of a Meltano process and need data about a project. You run meltano compile --environment <environment name> --schedule <schedule name> --job <job name> instead of meltano compile, and thereby only generate the manifest for the desired triple.

If it really is the case that all combinations are needed… oof. Hopefully we can generate them quickly, and the project files don’t have too many environments/schedules/jobs.

Read more comments on GitHub >

github_iconTop Results From Across the Web

Actions · meltano/meltano - GitHub
Feature proposal : meltano-manifest.json Slash commands dispatcher #3142: Issue comment #6876 (comment) created by aaronsteers. yesterday 12s. yesterday 12s.
Read more >
Settings - Meltano Documentation
Meltano supports a number of settings that allow you to fine tune its behavior, which are documented here.To quickly find the setting you're...
Read more >
Draft: Resolve "Create a JSON schema for `meltano.yml` and ...
Draft: Resolve "Create a JSON schema for `meltano.yml` and publish it on schemastore.org" ... Include the proposed fix or feature
Read more >
Untitled
... Marketing branding proposal template, Frontier airlines training center, ... Paradigma biocentrico, Chocolate lab features, Dooney theme jenkins, ...
Read more >
Proceedings of the 11th Python in Science Conference
methods have been proposed to bridge this disparity, with varying ... efficiently express vector operations is an important feature of the.
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found