question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Overview

@pwalsh’s wrote:

goodtables is very little apart from tabulator + jsontableschema + data-quality-spec + an output report.

Of course there will be issues when you port, but the bottom line is we’ll probably reduce 100s to 1000s of lines of code, some inconsistent internal APIs, and you’ll likely also learn some edge cases and stuff for tabular data handling that can feed back into tabulator

the important thing is: goodtables is the basis for a product, and it is more important, right now, to move on shipping goodtables than to get the other libs to v1, etc

getting them to v1 can be a by-product of goodtables, not the other way around

just because there is lots of interest in goodtables, and before we start the test pilots around GT from beg. October onwards, it would be better to be on a stabler base, as the GT needs serious refactoring by now with all we learned

---
`goodtables.datatable`: completely replace with `tabulator`, but in the code of datatable, might be some edge case handling that is useful, so check it.

- https://github.com/frictionlessdata/goodtables/tree/master/goodtables/datatable
- also in georgiana’s branch she found some more error handling edge cases here ( https://github.com/georgiana-b/goodtables/commit/7e9da1e480938b9c197e91e5d9c79ccbc5c12fa8 )  
----
`goodtables.cli`: should be easy, and I assume you will have good ideas

- https://github.com/frictionlessdata/goodtables/tree/master/goodtables/cli
----
`goodtables.pipeline.Pipeline`: Pretty easy once datatable is replaced by tabulator, and now using jsontableschema-py to replace lots of hardcoded stuff

- https://github.com/frictionlessdata/goodtables/blob/master/goodtables/pipeline/pipeline.py
----

`goodtables.pipeline.Batch`: *very* useful batch pipeline runner. Needs proper parallelization

- https://github.com/frictionlessdata/goodtables/blob/master/goodtables/pipeline/batch.py
- georgiana tried to parallelize here but forgot about shared state: https://github.com/georgiana-b/goodtables/commit/5514c3cb769c54a5dbe415df61cf406812600c83
---- 

`goodtables.processors`: they will essentially be tabulator processors I guess. Again, some hard coded stuff from before we had libs, or really understood the desirable API design. A `pipeline` calls processors, and note we also have an API for custom processors - the tabulator API for that is better

- https://github.com/frictionlessdata/goodtables/tree/master/goodtables/processors (edited)

---
Reports: GoodTables generates reports, and has a few output formats, as well as some tricky (ugly) stuff to shape the output. I absolutely encourage you to break the API of reports for something better, and even preferable, a format that can be streamed (emit report objects as they occur - if possible…)

- `tellme`: rufus wanted me to make that a distinct lib: IMHO it is not really worth it: https://github.com/okfn/tellme but it is what is used for report objects
- “report results”: you’ll see various places with report result definitions ( eg: https://github.com/frictionlessdata/goodtables/blob/master/goodtables/processors/schema.py#L11 ), including recent ones that that georgiana added in unusual places. these should be replaced by use of the new, generic `data-quality-spec`
- data-quality-spec: https://github.com/frictionlessdata/data-quality-spec (edited)

---

- note `self.report` and `def set_report_meta` here: https://github.com/frictionlessdata/goodtables/blob/master/goodtables/pipeline/pipeline.py#L280 (edited)

inspect with ipdb on some sample datasets and you’ll see what goes on there

a better design can come out of seeing what we have there

------

Lastly! “Hooks”! I built a simple hook system into `pipeline.Pipeline` and `pipeline.Batch`: both take a `post_task` argument. We use this, for example, in `data-quality-cli` and also in `goodtables-web` to modify reports. At the time, I kind of needed it, but I am honestly not sure any more if it is required… maybe better way to solve same problem:

- https://github.com/frictionlessdata/goodtables/blob/master/goodtables/pipeline/pipeline.py#L79

Issue Analytics

  • State:closed
  • Created 7 years ago
  • Comments:11 (10 by maintainers)

github_iconTop GitHub Comments

1reaction
rollcommented, Sep 26, 2016

Forgot one thing - 100% streaming. Every row from source we read only once (using http stream without additional requests etc)

0reactions
pwalshcommented, Oct 17, 2016

Fixed in https://github.com/frictionlessdata/goodtables-py/tree/next

next will be tested for several weeks now before we merge onto master.

Read more comments on GitHub >

github_iconTop Results From Across the Web

What is Refactoring (Code Refactoring)? - TechTarget
Encourages a more in-depth understanding of code. Developers have to think further about how their code will mix with code already in the...
Read more >
Refactoring How Deep? - DEV Community ‍ ‍
Deep Refactoring is always good, it forces us into creating composable software. Composable Software is just 'pulling in parts where we need it....
Read more >
Refactoring to a Deeper Model - InfoQ
Paul Rayner uses a case study to demonstrate how refactoring your code can lead to a deeper understanding of your domain model.
Read more >
Deep Refactoring - YouTube
Deep Refactoring. Deep Refactoring. 750 views2 years ago. 1:16:08 Now playing. Card Payments 101 · Deep Refactoring. Deep Refactoring. 232 views2 years ago....
Read more >
Code by Refactoring - Deep Roots
Learn to Solve Technical Debt in Legacy Code. Do you feel that every day's work erodes your technical health? No matter how many...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found