question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Prefect merge discussion

See original GitHub issue

This issue is to discuss how we want to the work in the Prefect branch #901 in.

Questions

1. What are our top infrastructure issues? Does the Prefect branch get us closer or further to addressing those issues? Some of these issues are being enumerated in #1401.

Here are a couple of our issues and how the prefect branch addresses them:

  • Cache steps of the pipeline:
    • Currently, extracting excel files seems like an unnecessary bottleneck. A majority of the EIA processing is spent in pd.read_excel(). Given the raw excel files do not change, caching this extraction step and any other long-running processing tasks would dramatically speed up development and testing.
    • Prefect: The prefect branch allows you to set checkpoints to determine which tasks should be cached. For example, if you wanted to cache the extraction step you would update the @task(checkpoint=False) parameter for transformation tasks (all downstream tasks will also not be cached). This is a bit awkward right now but it’s a step in the right direction.
  • Data processing uniformity
    • We don’t have a common recipe for adding and processing datasets. EIA and Ferc 1 are extracted in different ways, and it is unclear where certain data cleaning should happen, in the transform methods, output, or analysis tables? How do we add new datasets?
    • Prefect: The branch introduces the abstract class DatasetPipeline. The class requires you to implement the dataset settings and a build() method that adds processing tasks to the main flow. I think this is a decent start and could be expanded. Maybe instead of having a catch-all build method, a pipeline should implement extract, transform and load methods? The questions regarding how to extract data and where to apply transformations are not addressed by the branch.
  • Ability to handle large datasets (EQR)
    • EPACEMS takes about 1:20 mins to run on most of our machines. This is decent for our purposes but won’t work for larger datasets like FERC EQR.
    • Prefect: Prefect allows us to easily parallelize data processing. This branch provides a small boost in CEMS performance (~20 min).
  • Access to interim tables
    • There is some desire to access partially processed data tables for analyses. (Which tables?)
    • Prefect: The branch wasn’t designed to handle writing interim tables to an accessible/reusable location.

2. If we are to merge the prefect branch in, what would be the next steps?

  • Create tasks for each table instead of each dataset.
  • Incorporate validation into the etl run (not prefect related?).
  • Orchestrate the output and analysis tables with prefect. We should probably clean up the post ETL pipeline before adding prefect.
  • Continue to make our processing more uniform.
  • Try running it in the cloud on a single machine and cluster.
  • Setup prefect server.

Issue Analytics

  • State:closed
  • Created 2 years ago
  • Comments:7 (7 by maintainers)

github_iconTop GitHub Comments

1reaction
zaneselvanscommented, Jan 26, 2022

A good video overview of Dagster.

After spending an hour with the tutorial and video walkthrough it seems really similar to Prefect, but with stronger opinions and more defined structures. But those opinions and structures seem to be pretty directly aimed at our patterns of use.

Another think I liked was how easy they made it seem to swap out local disk, database, or object store persistence, so testing, CI, cloud versions of the same run could all be easily done only swapping out the persistence layer / objects.

I like the focus on typing, enumerating, validating, tracking of the assets that are produced as much as or even more than the DAG itself. They seem more like a necessary glue in Prefect, but not a focus.

The tracking of how assets evolve over time is also attractive – being able to see how the number of rows in a table has changed, and the schema of the table would be really useful.

What I really want is an opinionated tool for our use case with good opinions since left to our own devices we’ll probably come up with bad opinions, since we aren’t experts.

Another thing I liked was the ability to swap out different persistence layers so that local testing / CI / deployment can use the same data processing code, with the saving of assets decoupled from the rest of the system.

0reactions
bendnormancommented, Feb 24, 2022

We have decided to move forward with Dagster. See epic #1487 for the full reasoning.

Read more comments on GitHub >

github_iconTop Results From Across the Web

Merge in a skipped branch with apply_map is triggered ...
The expected behavior is that the merge within a branch will not be triggered. Reproduction. from prefect import Flow, task, Parameter, ...
Read more >
Why Not Airflow? - Prefect Docs
It introduced the ability to combine a strict Directed Acyclic Graph (DAG) model with ... We have tried to be balanced and limit...
Read more >
Merged EO, VO, and AM but Data Controls not showing (JDEV ...
So, the new application module works prefect and then I added a ViewController project but for some reason the View Objects from the...
Read more >
prefect Changelog - pyup.io
This release includes big updates to the Prefect Cloud login experience as well ... Allow passing kwargs to `Merge` task constructor via `merge()`...
Read more >
Prefect how to wait for external dependency - Stack Overflow
... the condition) to create an ad-hoc run of your flow. For more discussion of this pattern, check out this stackoverflow question.
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found