question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Change of behavior of IncludeFiles on AWS Batch between versions 2.0.3 and 2.2.2

See original GitHub issue

Hi,

I am trying to upgrade to Metaflow’s latest version, but the code that was running fine under 2.0.3 is now breaking.

Here’s a minimal reproducible example of my problem.

I have these files in a directory.

$ tree
.
├── pipeline.py
├── query_one.sql
├── query_two.sql
└── sql_list.json

sql_list.json contains a list of SQL files that I wand to load dynamically into my flow. Usually, this is a very long list.

# sql_list.json
{
    "queries": [
        {"name": "query_one", "full_path": "query_one.sql"},
        {"name": "query_two", "full_path": "query_two.sql"}
    ]
}

I am running python3 pipeline.py run --with batch:

 # pipeline.py
import json
from pathlib import Path
from metaflow import FlowSpec, step, IncludeFile


class Flow(FlowSpec):
    @step
    def start(self):
        print(self.query_one)
        print(self.query_two)
        self.next(self.end)

    @step
    def end(self):
        ...


if __name__ == "__main__":

    def include_files(flow):
        file_list_path = "sql_list.json"

        if Path(file_list_path).exists():
            with open(file_list_path) as f:
                content = json.load(f)

            for query in content['queries']:
                name = query["name"]
                path = query["full_path"]
                setattr(flow, name, IncludeFile(name, default=path, help=""))

        return flow

    Flow = include_files(Flow)
    Flow()

2.0.3

With version 2.0.3, I would get the desired output. That is, Metaflow would print the content of query_one.sql and query_two.sql.

2020-08-25 14:58:10.738 [1598360287065934/start/1 (pid 14046)] [1e05c59b-6007-42b9-906f-039a65c4f6b0] Task is starting (status SUBMITTED)...
2020-08-25 14:58:11.831 [1598360287065934/start/1 (pid 14046)] [1e05c59b-6007-42b9-906f-039a65c4f6b0] Task is starting (status RUNNABLE)...
2020-08-25 14:58:12.973 [1598360287065934/start/1 (pid 14046)] [1e05c59b-6007-42b9-906f-039a65c4f6b0] Task is starting (status STARTING)...
2020-08-25 14:58:14.101 [1598360287065934/start/1 (pid 14046)] [1e05c59b-6007-42b9-906f-039a65c4f6b0] Task is starting (status RUNNING)...
2020-08-25 14:58:20.147 [1598360287065934/start/1 (pid 14046)] [1e05c59b-6007-42b9-906f-039a65c4f6b0] Setting up task environment.
2020-08-25 14:58:28.542 [1598360287065934/start/1 (pid 14046)] [1e05c59b-6007-42b9-906f-039a65c4f6b0] Downloading code package.
2020-08-25 14:58:28.543 [1598360287065934/start/1 (pid 14046)] [1e05c59b-6007-42b9-906f-039a65c4f6b0] Code package downloaded.
2020-08-25 14:58:29.761 [1598360287065934/start/1 (pid 14046)] [1e05c59b-6007-42b9-906f-039a65c4f6b0] Task is starting.
2020-08-25 14:58:29.761 [1598360287065934/start/1 (pid 14046)] [1e05c59b-6007-42b9-906f-039a65c4f6b0] SELECT 1
2020-08-25 14:58:29.762 [1598360287065934/start/1 (pid 14046)] [1e05c59b-6007-42b9-906f-039a65c4f6b0] SELECT 2

2.2.2

With version 2.2.2, instead of the contents of the files listed in sql_list.json, I get some kind of reference to where they are stored in S3.

2020-08-25 14:59:29.440 [1598360366549490/start/1 (pid 19014)] [4d6263d9-81a7-417c-a618-d990edbed596] Task is starting (status SUBMITTED)...
2020-08-25 14:59:31.657 [1598360366549490/start/1 (pid 19014)] [4d6263d9-81a7-417c-a618-d990edbed596] Task is starting (status STARTING)...
2020-08-25 14:59:35.102 [1598360366549490/start/1 (pid 19014)] [4d6263d9-81a7-417c-a618-d990edbed596] Task is starting (status RUNNING)...
2020-08-25 14:59:42.265 [1598360366549490/start/1 (pid 19014)] [4d6263d9-81a7-417c-a618-d990edbed596] Setting up task environment.
2020-08-25 14:59:49.440 [1598360366549490/start/1 (pid 19014)] [4d6263d9-81a7-417c-a618-d990edbed596] Downloading code package.
2020-08-25 14:59:49.440 [1598360366549490/start/1 (pid 19014)] [4d6263d9-81a7-417c-a618-d990edbed596] Code package downloaded.
2020-08-25 14:59:54.380 [1598360366549490/start/1 (pid 19014)] [4d6263d9-81a7-417c-a618-d990edbed596] Task is starting.
2020-08-25 14:59:54.380 [1598360366549490/start/1 (pid 19014)] [4d6263d9-81a7-417c-a618-d990edbed596] {"type": "uploader-v1", "url": "s3://i-penp-b-s3-eu-west-1/data/Flow/b381cee440378c2591cd8955a04e24b7b21642b2", "is_text": true, "encoding": null}
2020-08-25 14:59:54.380 [1598360366549490/start/1 (pid 19014)] [4d6263d9-81a7-417c-a618-d990edbed596] {"type": "uploader-v1", "url": "s3://i-penp-b-s3-eu-west-1/data/Flow/12c85e48727ade1837c6356bac6f0a71d6d3a7b3", "is_text": true, "encoding": null}

Is there a way I get code to work again?

Issue Analytics

  • State:closed
  • Created 3 years ago
  • Comments:5 (3 by maintainers)

github_iconTop GitHub Comments

1reaction
romain-intelcommented, Aug 25, 2020

Let me look into this. Yes, we changed the behavior of IncludeFile to be able to support step functions specifically. It should not have broken things like this though so I will take a look.

0reactions
denismacielcommented, Sep 2, 2020

Cool, @romain-intel. I actually went on and included the SQL files like so --package-suffixes=‘.json,.sql’, so I could get rid of the IncludeFile. It’s great, though, to now have the environment decorator in my toolbelt. Thanks!

Read more comments on GitHub >

github_iconTop Results From Across the Web

AWS Batch FAQs
Q: What is AWS Batch? AWS Batch is a set of batch management capabilities that enables developers, scientists, and engineers to easily and...
Read more >
Release Notes - Metaflow Docs
This release introduces a number of internal changes, removing all remaining discrepancies between the legacy version of Metaflow that was used inside Netflix ......
Read more >
aws-samples/aws-batch-architecture-for-alphafold - GitHub
Contribute to aws-samples/aws-batch-architecture-for-alphafold development by creating an account on GitHub.
Read more >
AWS Batch — apache-airflow-providers-amazon Documentation
Batch computing is a common way for developers, scientists, and engineers to access large amounts of compute resources. AWS Batch removes the undifferentiated ......
Read more >
aws-cdk.aws-batch-alpha - PyPI
The CDK Construct Library for AWS::Batch. ... They are subject to non-backward compatible changes or removal in any future version.
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found