Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Deep refactoring

See original GitHub issue

Overview

@pwalsh’s wrote:

goodtables is very little apart from tabulator + jsontableschema + data-quality-spec + an output report.

Of course there will be issues when you port, but the bottom line is we’ll probably reduce 100s to 1000s of lines of code, some inconsistent internal APIs, and you’ll likely also learn some edge cases and stuff for tabular data handling that can feed back into tabulator

the important thing is: goodtables is the basis for a product, and it is more important, right now, to move on shipping goodtables than to get the other libs to v1, etc

getting them to v1 can be a by-product of goodtables, not the other way around

just because there is lots of interest in goodtables, and before we start the test pilots around GT from beg. October onwards, it would be better to be on a stabler base, as the GT needs serious refactoring by now with all we learned

---
`goodtables.datatable`: completely replace with `tabulator`, but in the code of datatable, might be some edge case handling that is useful, so check it.

- https://github.com/frictionlessdata/goodtables/tree/master/goodtables/datatable
- also in georgiana’s branch she found some more error handling edge cases here ( https://github.com/georgiana-b/goodtables/commit/7e9da1e480938b9c197e91e5d9c79ccbc5c12fa8 )  
----
`goodtables.cli`: should be easy, and I assume you will have good ideas

- https://github.com/frictionlessdata/goodtables/tree/master/goodtables/cli
----
`goodtables.pipeline.Pipeline`: Pretty easy once datatable is replaced by tabulator, and now using jsontableschema-py to replace lots of hardcoded stuff

- https://github.com/frictionlessdata/goodtables/blob/master/goodtables/pipeline/pipeline.py
----

`goodtables.pipeline.Batch`: *very* useful batch pipeline runner. Needs proper parallelization

- https://github.com/frictionlessdata/goodtables/blob/master/goodtables/pipeline/batch.py
- georgiana tried to parallelize here but forgot about shared state: https://github.com/georgiana-b/goodtables/commit/5514c3cb769c54a5dbe415df61cf406812600c83
---- 

`goodtables.processors`: they will essentially be tabulator processors I guess. Again, some hard coded stuff from before we had libs, or really understood the desirable API design. A `pipeline` calls processors, and note we also have an API for custom processors - the tabulator API for that is better

- https://github.com/frictionlessdata/goodtables/tree/master/goodtables/processors (edited)

---
Reports: GoodTables generates reports, and has a few output formats, as well as some tricky (ugly) stuff to shape the output. I absolutely encourage you to break the API of reports for something better, and even preferable, a format that can be streamed (emit report objects as they occur - if possible…)

- `tellme`: rufus wanted me to make that a distinct lib: IMHO it is not really worth it: https://github.com/okfn/tellme but it is what is used for report objects
- “report results”: you’ll see various places with report result definitions ( eg: https://github.com/frictionlessdata/goodtables/blob/master/goodtables/processors/schema.py#L11 ), including recent ones that that georgiana added in unusual places. these should be replaced by use of the new, generic `data-quality-spec`
- data-quality-spec: https://github.com/frictionlessdata/data-quality-spec (edited)

---

- note `self.report` and `def set_report_meta` here: https://github.com/frictionlessdata/goodtables/blob/master/goodtables/pipeline/pipeline.py#L280 (edited)

inspect with ipdb on some sample datasets and you’ll see what goes on there

a better design can come out of seeing what we have there

------

Lastly! “Hooks”! I built a simple hook system into `pipeline.Pipeline` and `pipeline.Batch`: both take a `post_task` argument. We use this, for example, in `data-quality-cli` and also in `goodtables-web` to modify reports. At the time, I kind of needed it, but I am honestly not sure any more if it is required… maybe better way to solve same problem:

- https://github.com/frictionlessdata/goodtables/blob/master/goodtables/pipeline/pipeline.py#L79

Issue Analytics

State:
Created 7 years ago
Comments:11 (10 by maintainers)

Top GitHub Comments

1reaction

rollcommented, Sep 26, 2016

Forgot one thing - 100% streaming. Every row from source we read only once (using http stream without additional requests etc)

0reactions

pwalshcommented, Oct 17, 2016

Fixed in https://github.com/frictionlessdata/goodtables-py/tree/next

next will be tested for several weeks now before we merge onto master.

Top Results From Across the Web

What is Refactoring (Code Refactoring)? - TechTarget

Encourages a more in-depth understanding of code. Developers have to think further about how their code will mix with code already in the...

Refactoring How Deep? - DEV Community ‍ ‍

Deep Refactoring is always good, it forces us into creating composable software. Composable Software is just 'pulling in parts where we need it....

Refactoring to a Deeper Model - InfoQ

Paul Rayner uses a case study to demonstrate how refactoring your code can lead to a deeper understanding of your domain model.

Deep Refactoring - YouTube

Deep Refactoring. Deep Refactoring. 750 views2 years ago. 1:16:08 Now playing. Card Payments 101 · Deep Refactoring. Deep Refactoring. 232 views2 years ago....

Code by Refactoring - Deep Roots

Learn to Solve Technical Debt in Legacy Code. Do you feel that every day's work erodes your technical health? No matter how many...