question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

[Discuss] Defining source data modifications in the FDP descriptor file

See original GitHub issue

The new draft FDP spec defines the concept of “virtual columns”, that are columns that don’t exist in the source file, only in the descriptor. The tools that support the FDP spec would then parse the descriptor and add the columns based on its definition in the descriptor file. It can be as simple as adding a constant column (see https://github.com/frictionlessdata/specs/issues/529), to a more complex “normalisation column” that split rows based on a column’s content.

Sorry if this was already discussed, but I’m opening this issue to talk about the general question of: should the descriptor file define modifications to the resource files? This means that the resource contents of a datapackage are a function that depends on the CSV files’ contents and the descriptor file.

The advantage I see is that we can provide FDP authors with a few more tools, allowing them to do common modifications to their data files. The disadvantages are:

  • The FDP specification gets more complicated
  • Implementing the FDP spec gets harder
  • A user can’t simply ignore the datapackage.json file and get the full data: they need FDP-aware tools to be able to read the data as intended by the FDP author

Personally, I think the descriptor file should only describe the resources, not modify them. Any data wrangling operation should be performed before the FDP is created, not as part of it.

/cc @pwalsh @akariv @rufuspollock

Issue Analytics

  • State:closed
  • Created 6 years ago
  • Comments:12 (12 by maintainers)

github_iconTop GitHub Comments

1reaction
akarivcommented, Dec 13, 2017

This is the current spec: https://hackmd.io/BwNgpgrCDGDsBMBaAhtALARkWsPEE5posR8RxgAzffWfDIA=

The relevant bits from the spec (although I do recommend reading the entire thing, it’s super interesting!):

Fiscal Modelling

This specification allows modelling of fiscal data in two distinct levels: the structural level and the semantic level.

Structural Level

We want to properly describe the structure of a dataset - so that data consumers are able to restructure the dataset based on their own needs and requirements.

The method for describing a dataset’s structure is to detail the difference between the provided form a dataset to its fully denormalised form. Essentially we’re listing a set of transformations, that when applied, would convert the dataset from the former to the latter.

Using the knowledge of how the denormalised data looks like, consumers can then better understand how to read, store or otherwise manipulate the data so it fits their existing systems and processes.

A denormalised presentation of a data set needs to fulfill these conditions:

  • all data is contained in a single data table
  • each row contains just one single data point with a single value
  • all data and metadata is provided within the data table

The specification provides 3 possible transformations that might be used to describe how the dataset could be denormalised:

  1. Foreign Keys - connect separate data tables via an ID column present in both tables. This method is already part of the Tabular Data Package specification and will not be covered here.
  2. Denormalising Measures - convert a row with multiple measures in the source data into multiple rows, each with a single value.
  3. Constant Fields - represent metadata as constant columns in the data table

The extraFields property

The main vehicle for the structural modelling is the extraFields property - a property added to a tabular resource schema (as a sibling to the fields property), similarly containing field definitions.

All the fields that are listed in the extraFields property are ones that appear in the denormalised form but not on the original data. The contents of these columns is derived from the dataset itself (or from the descriptor). Each of these fields there also specifies how their content relates to the original dataset.

Denormalising Measures

In many cases, publishers will prefer to have Approved, Modified and Executed values of a budget as separate columns, instead of duplicating the same line just to provide 3 figures. It is more readable to humans and more concise (i.e. creates a smaller file size).

In other cases, the budget figures for the current, next and after next years will appear as separate columns instead of in separate rows. This allows readers to more easily compare the budget figures across consecutive years.

In fact, we might even encounter data-set where both phase and year columns were reduced in the same way.

This practice is very common as a simple form of normalization being done on a published dataset. However, some data is lost along the way - in our examples, we’ve lost the ‘Budget Phase’ column in the former, and ‘Fiscal Year’ column in the latter.

We want to describe this process to allow data consumers to potentially undo it - and to the least resurrect the data that was lost in the process.

In order to do so we need to:

  • Add to the extraFields property a field definition for each column that was reduced (budget phase or fiscal year in our scenario), for example:
"extraFields": [
   { "name": "Budget Phase", "type": "string", ... },
   { "name": "Fiscal Year", "type": "integer", ... },
   ...
]
  • We add a normalize property to each measure in the schema. The value of this property is a mapping between every ‘reduced column’ name to a value, for example:
...
"schema": {
  "fields": [
     ...
   { 
      "name": "Approved 2015", 
      "type": "number", 
      "normalize": {
          "Budget Phase": "approved",
          "Fiscal Year": 2015
      },
      ... 
   },
   { 
      "name": "Executed 2015", 
      "type": "number", 
      "normalize": {
          "Budget Phase": "executed",
          "Fiscal Year": 2015
      },
      ... 
   },
   { 
      "name": "Approved 2016", 
      "type": "number", 
      "normalize": {
          "Budget Phase": "approved",
          "Fiscal Year": 2016
      },
      ... 
   },
   { 
      "name": "Executed 2016", 
      "type": "number", 
      "normalize": {
          "Budget Phase": "executed",
          "Fiscal Year": 2016
      },
      ... 
   },
 ]  
}
...
  • Finally we add to the extraFields property a field definition for the target column for the measures’ values, like so:
"extraFields": [
  ...
  {
    "name": "Fiscal Amount",
    "type": "number",
    "columnType": "value",
    "normalizationTarget": true
  }
]

Constant Fields

In order to complement missing information in the dataset it’s possible to add columns with ‘constant’ values to the schema.

We can do so by adding field definitions to the extraFields property. Each of these field objects must also contain a constant property, holding the constant value.

Provided value might be provided either in its logical representation or its physical representation.

Examples:

"extraFields": [
  ...
  {
    "name": "A String",
    "type": "string",
    "constant": "a value"
  },
  {
    "name": "A Number",
    "type": "number",
    "constant": 5
  },
  {
    "name": "Another Number",
    "type": "number",
    "constant": "5,4",
    "decimalChar": ","
  },
  {
    "name": "A Date",
    "type": "date",
    "constant": "10/1/2015",
    "format": "%m/%d/%Y"
  },
  {
    "name": "An Array",
    "type": "array",
    "constant": "[3.14, 2.78]"
  },
  {
    "name": "Last Example",
    "type": "array",
    "constant": [3.14, 2.78]
  }
]
1reaction
akarivcommented, Oct 5, 2017

Is there really a conceptual difference between virtual columns and foreign keys? Both describe very similar things.


Consider this data set:

Department Phase Phase Name Amount
A 1 PLANNING 1000
A 2 EXECUTION 1200
B 1 PLANNING 2000
B 2 EXECUTION 1800
C 1 PLANNING 3000
C 2 EXECUTION 3500
  • One publisher might publish it as is.
  • Another would publish it using a code list, with two resources connected via a foreign key relation:
Department Phase Amount
A 1 1000
A 2 1200
B 1 2000
B 2 1800
C 1 3000
C 2 3500
Phase Phase Name
1 PLANNING
2 EXECUTION
  • A third publisher would opt for another way to compress the data:
Department Phase Planning Amount Execution Amount
A PLANNING 1000 1200
B PLANNING 2000 1800
C PLANNING 3000 3500

All of them describe exactly the same data. The foreign key method ‘compresses’ the data by removing columns from it (and supplying the values in a separate connected table). The virtual columns method does it by combining rows together (and supplying the missing column values in the metadata).

Why one method belongs to the physical model and the other to the logical model?

Read more comments on GitHub >

github_iconTop Results From Across the Web

SFTP/FTP/FTPS Client - Portal - StreamSets Docs
The SFTP/FTP/FTPS Client origin reads files from a server using the Secure File ... For information about generating the descriptor file, see Protobuf...
Read more >
Using file descriptors - IBM
A file descriptor is an unsigned integer used by a process to identify an open file. ... track each process' access to a...
Read more >
Descriptor Files - iupui
Descriptor Files. To operate make needs to know the relationship between your program's component files and the commands to update each file.
Read more >
File descriptor - Wikipedia
In Unix and Unix-like computer operating systems, a file descriptor (FD, less frequently fildes) is a process-unique identifier (handle) for a file or...
Read more >
Source of bsd/kern/kern_descrip.c (From xnu-8019)
@APPLE_OSREFERENCE_LICENSE_HEADER_START@ * * This file contains Original Code and/or Modifications of Original Code * as defined in and that are subject to ...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found