Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

GetItem Results should not directly inherit their parent's Result

See original GitHub issue

Description

The below is a mass simplification of our pipeline where it is failing:

import Table

@task
def extract_tables()
	
	processed_tables = dict()

	tables = ["companies", "people"]

	for table in tables:
		processed_tables[table] = Table(table)
        
        return processed_tables





@task
def clean_companies_table():
	...

@task
def clean_people_table():
	...



with Flow("Dummy") as flow:

	all_tables = extract_tables()

	people_table = all_tables["people"]
	companies_table = all_tables["companies"]

	cleaned_companies_table = clean_companies_table(companies_table)
	...

Imagine instead 12 tables. This has worked fine in version 0.12.6. Since upgrading to 0.13.7 however, we started having issues. I confirmed by reverting to 0.12.6 with everything else unchanged that this error does not occur in 0.12.6.

The flow will fail with Unexpected error: TypeError("'Table' object is not subscriptable"). This failure would be at the steps GetItem that are generated by lines like people_table = all_tables["people"]. Initially this failed only on a second run, when the tasks were cached. Now it started failing even at the first run.

But on 12 tables (and their GetItem) this will only happen at a random number of GetItem, and succeed with others. The dict is generated via a loop and all tables are treated with the same code.

Reproduction

I haven’t made this yet, just an abstract one above.

Environment

{
  "config_overrides": {
    "server": {
      "telemetry": {
        "enabled": true
      }
    }
  },
  "env_vars": [
    "PREFECT__CONTEXT__SECRETS__CRUNCHBASE_API_KEY",
    "PREFECT__LOGGING__LEVEL",
    "PREFECT__CLOUD__AUTH_TOKEN",
    "PREFECT__LOGGING__EXTRA_LOGGERS"
  ],
  "system_information": {
    "platform": "macOS-10.15.6-x86_64-i386-64bit",
    "prefect_backend": "cloud",
    "prefect_version": "0.13.7",
    "python_version": "3.8.3"
  }
}

Issue Analytics

State:
Created 3 years ago
Comments:11 (3 by maintainers)

Top GitHub Comments

2reactions

cicdwcommented, Oct 14, 2020

Update: I have a path forward that effectively “collapses” the GetItem task out of the graph and runs the __getitem__ operation within the downstream task runner (this avoids the checkpointing of the indexed result altogether, which is both inefficient and causing the overwrites you both are dealing with).

The code itself is a little messy right now though, so I’d like to take some time to clean it up, test it better, etc. – we’ll plan to get it into next week’s release (0.13.12 around Oct 20th).

2reactions

cicdwcommented, Oct 9, 2020

Why not change the implementation to do what a user (not knowing the implementation) would expect rather than requiring more docs and adding complexity to the API?

Knowing what the user expects is not nearly as trivial as this question implies - this feature was implemented as-is by user request haha; without passing on some amount of results configuration to the autogenerated tasks, there are different edge cases involving restarting from failure that won’t work properly.

@benfuja ah that’s really interesting and makes sense! You don’t need to open a new issue, I think this one suffices for refactoring how results are configured for autogenerated tasks (and I’ll update the title).

In the meantime, there is something you both could do to hack around this (it’s ugly, but it would unblock you I think) - after you perform an index into a task, set the new task’s checkpoint attribute to False:

with Flow("DataFrame to Series") as test_pipeline:
    my_data = test_dataframe(task_args=dict(result=LOCAL_DATAFRAME_RESULTS))
    col_a = my_data["a"]
    col_a.checkpoint = False



with Flow("Dummy") as flow:

	all_tables = extract_tables()

	people_table = all_tables["people"]
	companies_table = all_tables["companies"]
        people_table.checkpoint = False
        companies_table.checkpoint = False

	cleaned_companies_table = clean_companies_table(companies_table)
	...

I have some ideas for how to fix this more broadly, but will need to play around with it a bit. I’ll update here when I have a working PoC!

Top Results From Across the Web

Understanding __getitem__ method - python - Stack Overflow

So all I can understand is that __getitem__ is used to implement calls like self[key] . But what is the use of it?...

GetItem - API Reference - eBay Developers Program

After one item in a multi-quantity listing has been sold, sellers can not change the values in the Title, Primary Category, Secondary Category,...

Window.sessionStorage - Web APIs - MDN Web Docs

The read-only sessionStorage property accesses a session Storage object for the current origin. sessionStorage is similar to localStorage; ...

3. Data model — Python 3.11.1 documentation

Methods also support accessing (but not setting) the arbitrary function attributes on the underlying function object. User-defined method objects may be created ...

If a genetic disorder runs in my family, what are the chances ...

The chance that a child will not inherit the altered gene is also 50 percent. However, in some cases an autosomal dominant disorder...