Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Implement `resource.analyze` function and CLI command

See original GitHub issue

Overview

https://frictionlessdata.slack.com/archives/C0369HZ2SLT/p1651844750785019

Is there any tooling around that does more than describe to analyse the data. For example tools that would give you distributions for number fields, most common word statistics for text fields, distinct counts, counts for fields that are not blank, most common categories used, and many others? Essentially statistics that are currently not in the tablular data package resource specification? Also could help detect more kinds of data formats. As example of some of this could be done by pandas describe function https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.describe.html, however there is a lot more potential than that. Also is it ok to add suche extra statistics to resouces without causing validation errors? I am thinking such tooling would be interesting as it could give greater insight into the data before the need to analyse it.

Plan

@shashigharti @aivuk We need to brainstorm the analytics output format and contents. We can probably use the Stats class (resource.stats) as a target (alternatively, might be a validate/report part)
implement resource.analyze
implement package.analyze (reusing the above)
expose in the CLI (reusing above)

Issue Analytics

State:
Created a year ago
Comments:16 (16 by maintainers)

Top GitHub Comments

2reactions

fjuniorrcommented, Jun 2, 2022

We need to brainstorm the analytics output format and contents. We can probably use the Stats class (resource.stats) as a target (alternatively, might be a validate/report part)

It would be great if alongside the implementation on frictionless-py at least a pattern is added to the specs to allow the creation of data resources with descriptive statistics generated through other tools (such as frictionless-r) but still consistent. Maybe push this discussion https://github.com/frictionlessdata/specs/issues/364 forward?

Also, since you mentioned pandas, pandas-profiling might be worth taking a look (at least for inspiration).

1reaction

rollcommented, Jun 29, 2022

@shashigharti We don’t need actions for now just resource.analyze and package.analyze (it will be to hard to merge into v5 if work now on actions)