[Discuss] Defining source data modifications in the FDP descriptor file
See original GitHub issueThe new draft FDP spec defines the concept of “virtual columns”, that are columns that don’t exist in the source file, only in the descriptor. The tools that support the FDP spec would then parse the descriptor and add the columns based on its definition in the descriptor file. It can be as simple as adding a constant column (see https://github.com/frictionlessdata/specs/issues/529), to a more complex “normalisation column” that split rows based on a column’s content.
Sorry if this was already discussed, but I’m opening this issue to talk about the general question of: should the descriptor file define modifications to the resource files? This means that the resource contents of a datapackage are a function that depends on the CSV files’ contents and the descriptor file.
The advantage I see is that we can provide FDP authors with a few more tools, allowing them to do common modifications to their data files. The disadvantages are:
- The FDP specification gets more complicated
- Implementing the FDP spec gets harder
- A user can’t simply ignore the
datapackage.json
file and get the full data: they need FDP-aware tools to be able to read the data as intended by the FDP author
Personally, I think the descriptor file should only describe the resources, not modify them. Any data wrangling operation should be performed before the FDP is created, not as part of it.
Issue Analytics
- State:
- Created 6 years ago
- Comments:12 (12 by maintainers)
This is the current spec: https://hackmd.io/BwNgpgrCDGDsBMBaAhtALARkWsPEE5posR8RxgAzffWfDIA=
The relevant bits from the spec (although I do recommend reading the entire thing, it’s super interesting!):
Fiscal Modelling
This specification allows modelling of fiscal data in two distinct levels: the structural level and the semantic level.
Structural Level
We want to properly describe the structure of a dataset - so that data consumers are able to restructure the dataset based on their own needs and requirements.
The method for describing a dataset’s structure is to detail the difference between the provided form a dataset to its fully denormalised form. Essentially we’re listing a set of transformations, that when applied, would convert the dataset from the former to the latter.
Using the knowledge of how the denormalised data looks like, consumers can then better understand how to read, store or otherwise manipulate the data so it fits their existing systems and processes.
A denormalised presentation of a data set needs to fulfill these conditions:
The specification provides 3 possible transformations that might be used to describe how the dataset could be denormalised:
The
extraFields
propertyThe main vehicle for the structural modelling is the
extraFields
property - a property added to a tabular resource schema (as a sibling to thefields
property), similarly containing field definitions.All the fields that are listed in the
extraFields
property are ones that appear in the denormalised form but not on the original data. The contents of these columns is derived from the dataset itself (or from the descriptor). Each of these fields there also specifies how their content relates to the original dataset.Denormalising Measures
In many cases, publishers will prefer to have Approved, Modified and Executed values of a budget as separate columns, instead of duplicating the same line just to provide 3 figures. It is more readable to humans and more concise (i.e. creates a smaller file size).
In other cases, the budget figures for the current, next and after next years will appear as separate columns instead of in separate rows. This allows readers to more easily compare the budget figures across consecutive years.
In fact, we might even encounter data-set where both phase and year columns were reduced in the same way.
This practice is very common as a simple form of normalization being done on a published dataset. However, some data is lost along the way - in our examples, we’ve lost the ‘Budget Phase’ column in the former, and ‘Fiscal Year’ column in the latter.
We want to describe this process to allow data consumers to potentially undo it - and to the least resurrect the data that was lost in the process.
In order to do so we need to:
extraFields
property a field definition for each column that was reduced (budget phase or fiscal year in our scenario), for example:normalize
property to each measure in the schema. The value of this property is a mapping between every ‘reduced column’ name to a value, for example:extraFields
property a field definition for the target column for the measures’ values, like so:Constant Fields
In order to complement missing information in the dataset it’s possible to add columns with ‘constant’ values to the schema.
We can do so by adding field definitions to the
extraFields
property. Each of these field objects must also contain aconstant
property, holding the constant value.Provided value might be provided either in its logical representation or its physical representation.
Examples:
Is there really a conceptual difference between virtual columns and foreign keys? Both describe very similar things.
Consider this data set:
All of them describe exactly the same data. The foreign key method ‘compresses’ the data by removing columns from it (and supplying the values in a separate connected table). The virtual columns method does it by combining rows together (and supplying the missing column values in the metadata).
Why one method belongs to the physical model and the other to the logical model?