Implement `resource.analyze` function and CLI command
See original GitHub issueOverview
https://frictionlessdata.slack.com/archives/C0369HZ2SLT/p1651844750785019
Is there any tooling around that does more than describe to analyse the data. For example tools that would give you distributions for number fields, most common word statistics for text fields, distinct counts, counts for fields that are not blank, most common categories used, and many others? Essentially statistics that are currently not in the tablular data package resource specification? Also could help detect more kinds of data formats. As example of some of this could be done by pandas describe function https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.describe.html, however there is a lot more potential than that. Also is it ok to add suche extra statistics to resouces without causing validation errors? I am thinking such tooling would be interesting as it could give greater insight into the data before the need to analyse it.
Plan
- @shashigharti @aivuk We need to brainstorm the analytics output format and contents. We can probably use the
Stats
class (resource.stats
) as a target (alternatively, might be avalidate/report
part) - implement
resource.analyze
- implement
package.analyze
(reusing the above) - expose in the CLI (reusing above)
Issue Analytics
- State:
- Created a year ago
- Comments:16 (16 by maintainers)
Top GitHub Comments
It would be great if alongside the implementation on
frictionless-py
at least a pattern is added to the specs to allow the creation of data resources with descriptive statistics generated through other tools (such asfrictionless-r
) but still consistent. Maybe push this discussion https://github.com/frictionlessdata/specs/issues/364 forward?Also, since you mentioned pandas, pandas-profiling might be worth taking a look (at least for inspiration).
@shashigharti We don’t need actions for now just
resource.analyze
andpackage.analyze
(it will be to hard to merge into v5 if work now on actions)