question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Update pipeline and components to return Woodwork data structures

See original GitHub issue

#1393 updated pipelines to accept Woodwork data structures, and #1288 addresses updating pipelines and components to accept Woodwork data structures as input. However, the output for methods like transform and predict are still pandas DataFrames, which is odd. This issue tracks updating our methods to return Woodwork data structures.

Issue Analytics

  • State:closed
  • Created 3 years ago
  • Comments:5 (3 by maintainers)

github_iconTop GitHub Comments

2reactions
chukarstencommented, Jan 14, 2021

It seems like the third option is the best, cleanest option. Hopefully the performance isn’t impacted, but conceptually it seems sound. Thanks for bringing it to my attention…trying to wrap my head around all of the things.

1reaction
dsherrycommented, Jan 14, 2021

@angela97lin and I checked in, and discussed a few implementation options:

  1. Have component graph evaluation pass pandas to each component. To indicate ww types to components, either add new fields to fit etc., or stick with the text featurizer pattern of using init parameters to indicate relevant columns. Disadvantage: ugly from API perspective, this is why we created woodwork in the first place.
  2. During component graph evaluation, pass woodwork to each component. Have each component return pandas. Disadvantage: a potential limitation is that components cannot change woodwork type of the input features or of newly generated features, except through changing the pandas dtype. We don’t have any components which rely on this however.
  3. During component graph evaluation, pass woodwork to each component. Have each component return woodwork. Challenge: most components must convert to pandas internally to do transformations like adding features, deleting features or modifying features. After those transformations, we have to make sure the original woodwork types get into the new returned woodwork datatable, otherwise user-overridden settings will get lost, as they are today.

Status: @angela97lin is currently pursuing option 3 in #1668

Plan: we’ll continue that strategy, keeping an eye out for reduced runtime due to multiple ww datatable instantiations. And we’ll consider if there are any feature requests we should make to woodwork to make this easier. We’ll also keep an eye out for any compelling options we may have missed so far.

@chukarsten @gsheni

Read more comments on GitHub >

github_iconTop Results From Across the Web

Release Notes — EvalML 0.64.0 documentation - Alteryx
Updated components and pipelines to return Woodwork data structures #1668. Updated clone() for pipelines and components to copy over random state ...
Read more >
The Data Pipeline Requirements Model - LinkedIn
In this piece we are narrowly focused on the requirements that define the Data Pipeline: Data Elements and the series of transformations ...
Read more >
Chapter 36 Automating data-analysis pipelines - STAT 545
We'll just use this as a teachable moment to demonstrate how handy an automated pipeline is for dealing with such annoyances and to...
Read more >
How To Organize A Pipeline Of Small Scripts Together?
My favorite way of defining pipelines is by writing Makefiles, about which you can find a very good introduction in Software Carpentry for...
Read more >
Manipulating, analyzing and exporting data with tidyverse
Select certain columns in a data frame with the dplyr function select . ... be conducted on that database, and only the results...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found