GetItem Results should not directly inherit their parent's Result
See original GitHub issueDescription
The below is a mass simplification of our pipeline where it is failing:
import Table
@task
def extract_tables()
processed_tables = dict()
tables = ["companies", "people"]
for table in tables:
processed_tables[table] = Table(table)
return processed_tables
@task
def clean_companies_table():
...
@task
def clean_people_table():
...
with Flow("Dummy") as flow:
all_tables = extract_tables()
people_table = all_tables["people"]
companies_table = all_tables["companies"]
cleaned_companies_table = clean_companies_table(companies_table)
...
Imagine instead 12 tables. This has worked fine in version 0.12.6. Since upgrading to 0.13.7 however, we started having issues. I confirmed by reverting to 0.12.6 with everything else unchanged that this error does not occur in 0.12.6.
The flow will fail with Unexpected error: TypeError("'Table' object is not subscriptable")
. This failure would be at the steps GetItem
that are generated by lines like people_table = all_tables["people"]
. Initially this failed only on a second run, when the tasks were cached. Now it started failing even at the first run.
But on 12 tables (and their GetItem
) this will only happen at a random number of GetItem
, and succeed with others. The dict is generated via a loop and all tables are treated with the same code.
Reproduction
I haven’t made this yet, just an abstract one above.
Environment
{
"config_overrides": {
"server": {
"telemetry": {
"enabled": true
}
}
},
"env_vars": [
"PREFECT__CONTEXT__SECRETS__CRUNCHBASE_API_KEY",
"PREFECT__LOGGING__LEVEL",
"PREFECT__CLOUD__AUTH_TOKEN",
"PREFECT__LOGGING__EXTRA_LOGGERS"
],
"system_information": {
"platform": "macOS-10.15.6-x86_64-i386-64bit",
"prefect_backend": "cloud",
"prefect_version": "0.13.7",
"python_version": "3.8.3"
}
}
Issue Analytics
- State:
- Created 3 years ago
- Comments:11 (3 by maintainers)
Update: I have a path forward that effectively “collapses” the
GetItem
task out of the graph and runs the__getitem__
operation within the downstream task runner (this avoids the checkpointing of the indexed result altogether, which is both inefficient and causing the overwrites you both are dealing with).The code itself is a little messy right now though, so I’d like to take some time to clean it up, test it better, etc. – we’ll plan to get it into next week’s release (0.13.12 around Oct 20th).
Knowing what the user expects is not nearly as trivial as this question implies - this feature was implemented as-is by user request haha; without passing on some amount of results configuration to the autogenerated tasks, there are different edge cases involving restarting from failure that won’t work properly.
@benfuja ah that’s really interesting and makes sense! You don’t need to open a new issue, I think this one suffices for refactoring how results are configured for autogenerated tasks (and I’ll update the title).
In the meantime, there is something you both could do to hack around this (it’s ugly, but it would unblock you I think) - after you perform an index into a task, set the new task’s checkpoint attribute to False:
I have some ideas for how to fix this more broadly, but will need to play around with it a bit. I’ll update here when I have a working PoC!