Deep refactoring
See original GitHub issueOverview
@pwalsh’s wrote:
goodtables is very little apart from tabulator + jsontableschema + data-quality-spec + an output report.
Of course there will be issues when you port, but the bottom line is we’ll probably reduce 100s to 1000s of lines of code, some inconsistent internal APIs, and you’ll likely also learn some edge cases and stuff for tabular data handling that can feed back into tabulator
the important thing is: goodtables is the basis for a product, and it is more important, right now, to move on shipping goodtables than to get the other libs to v1, etc
getting them to v1 can be a by-product of goodtables, not the other way around
just because there is lots of interest in goodtables, and before we start the test pilots around GT from beg. October onwards, it would be better to be on a stabler base, as the GT needs serious refactoring by now with all we learned
---
`goodtables.datatable`: completely replace with `tabulator`, but in the code of datatable, might be some edge case handling that is useful, so check it.
- https://github.com/frictionlessdata/goodtables/tree/master/goodtables/datatable
- also in georgiana’s branch she found some more error handling edge cases here ( https://github.com/georgiana-b/goodtables/commit/7e9da1e480938b9c197e91e5d9c79ccbc5c12fa8 )
----
`goodtables.cli`: should be easy, and I assume you will have good ideas
- https://github.com/frictionlessdata/goodtables/tree/master/goodtables/cli
----
`goodtables.pipeline.Pipeline`: Pretty easy once datatable is replaced by tabulator, and now using jsontableschema-py to replace lots of hardcoded stuff
- https://github.com/frictionlessdata/goodtables/blob/master/goodtables/pipeline/pipeline.py
----
`goodtables.pipeline.Batch`: *very* useful batch pipeline runner. Needs proper parallelization
- https://github.com/frictionlessdata/goodtables/blob/master/goodtables/pipeline/batch.py
- georgiana tried to parallelize here but forgot about shared state: https://github.com/georgiana-b/goodtables/commit/5514c3cb769c54a5dbe415df61cf406812600c83
----
`goodtables.processors`: they will essentially be tabulator processors I guess. Again, some hard coded stuff from before we had libs, or really understood the desirable API design. A `pipeline` calls processors, and note we also have an API for custom processors - the tabulator API for that is better
- https://github.com/frictionlessdata/goodtables/tree/master/goodtables/processors (edited)
---
Reports: GoodTables generates reports, and has a few output formats, as well as some tricky (ugly) stuff to shape the output. I absolutely encourage you to break the API of reports for something better, and even preferable, a format that can be streamed (emit report objects as they occur - if possible…)
- `tellme`: rufus wanted me to make that a distinct lib: IMHO it is not really worth it: https://github.com/okfn/tellme but it is what is used for report objects
- “report results”: you’ll see various places with report result definitions ( eg: https://github.com/frictionlessdata/goodtables/blob/master/goodtables/processors/schema.py#L11 ), including recent ones that that georgiana added in unusual places. these should be replaced by use of the new, generic `data-quality-spec`
- data-quality-spec: https://github.com/frictionlessdata/data-quality-spec (edited)
---
- note `self.report` and `def set_report_meta` here: https://github.com/frictionlessdata/goodtables/blob/master/goodtables/pipeline/pipeline.py#L280 (edited)
inspect with ipdb on some sample datasets and you’ll see what goes on there
a better design can come out of seeing what we have there
------
Lastly! “Hooks”! I built a simple hook system into `pipeline.Pipeline` and `pipeline.Batch`: both take a `post_task` argument. We use this, for example, in `data-quality-cli` and also in `goodtables-web` to modify reports. At the time, I kind of needed it, but I am honestly not sure any more if it is required… maybe better way to solve same problem:
- https://github.com/frictionlessdata/goodtables/blob/master/goodtables/pipeline/pipeline.py#L79
Issue Analytics
- State:
- Created 7 years ago
- Comments:11 (10 by maintainers)
Top Results From Across the Web
What is Refactoring (Code Refactoring)? - TechTarget
Encourages a more in-depth understanding of code. Developers have to think further about how their code will mix with code already in the...
Read more >Refactoring How Deep? - DEV Community
Deep Refactoring is always good, it forces us into creating composable software. Composable Software is just 'pulling in parts where we need it....
Read more >Refactoring to a Deeper Model - InfoQ
Paul Rayner uses a case study to demonstrate how refactoring your code can lead to a deeper understanding of your domain model.
Read more >Deep Refactoring - YouTube
Deep Refactoring. Deep Refactoring. 750 views2 years ago. 1:16:08 Now playing. Card Payments 101 · Deep Refactoring. Deep Refactoring. 232 views2 years ago....
Read more >Code by Refactoring - Deep Roots
Learn to Solve Technical Debt in Legacy Code. Do you feel that every day's work erodes your technical health? No matter how many...
Read more >Top Related Medium Post
No results found
Top Related StackOverflow Question
No results found
Troubleshoot Live Code
Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start FreeTop Related Reddit Thread
No results found
Top Related Hackernoon Post
No results found
Top Related Tweet
No results found
Top Related Dev.to Post
No results found
Top Related Hashnode Post
No results found
Top GitHub Comments
Forgot one thing - 100% streaming. Every row from source we read only once (using http stream without additional requests etc)
Fixed in https://github.com/frictionlessdata/goodtables-py/tree/next
next
will be tested for several weeks now before we merge ontomaster
.