question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Proposal: Table Dialect Spec

See original GitHub issue

Overview

For now, we have:

  • a resource describes a data entity
  • a resource can be tabular
    • a tabular resource has schema property that must be Tabular Schema
    • a tabular resource has dialect property that must be CSV Dialect

It means that we have only two mechanisms to add tabular information to the resource: schema and dialect properties:

  • schema: what is the data
  • dialect: how to extract the data

Maybe at some point this list can be extended e.g. providing table filtering ability etc but, as for now, I think we definitely can generalize the dialect property. Instead of having it csv-only we can have a general Table Dialect spec helping describe any tabular format details.

The proposed Table Dialect spec will create a nice symmetry with already existent Table Schema spec. Here is a quick overview of the proposal. The spec is hierarchical so e.g. Csv Table Dialect inherits all the props from Table Dialect.

Table Dialect

Core Table Dialect spec will handle header management.

header (bool)

default: true

Whether the table has a header row(s)

headerRows (int[])

default: [1]

An array of header row numbers. Can describe a multiline header.

headerJoin (str)

default: ’ ’ (one space)

A string to concatenate a multiline header. Has no effect for a single row header.

Csv Table Dialect

@amercader hints that we also need to re-review the CSVW spec in case we miss something - https://www.w3.org/TR/2015/REC-tabular-metadata-20151217/#dialect-descriptions

It will support all the header options and the options below which is standard for csv.

delimiter (str)

default: ,

lineTerminator (str)

default: \r\n

quoteChar (str)

default: “”

doubleQuote (bool)

default: true

escapeChar (str)

default: not set

nullSequence (str)

default: not set

skipInitialSpace (bool)

default: false

I propose the following changes to the current Csv Dialect spec:

  • make skipInitialSpace=False by default to sync with Python/Pandas/JS/etc behaviour
  • remove caseSensitiveHeader as I guess it should be an option for some infer function but for general data description I’m not sure what it does
  • review commentChar option as partially its role will be handled by headerRows and, at the same time, there is more functional skipRows supported by the software. In software, I’ve moved all the skip/pick/limit/offset_fields/rows functionality to a separate group called Table Query (or Table Discovery previously) which should probably exist only in software because we don’t want to make ETL from the specs, although I think there are options to consider.

Excel Table Dialect

It will support all the header options and:

sheet (str|int)

default: 1

String or integer to address an excel sheet e.g. 2 or Sheet 2.

Options to consider:

  • fillMergedCells
  • preserveFormatting
  • adjustFloatingPointError

Json Table Dialect

It will support all the header options and:

keyed (bool)

default: false

Whether a source is keyed i.e. an array of dictionaries instead of an array of arrays.

keys (str[])

default: not set

For a keyed source, an array of keys to use as a header row.

Options to consider:

  • property (path to the data within json e.g. dogs/data)

In conclusion, the idea is:

  • csv is not the only tabular format; let’s describe others, the most importantly Excel
  • to have one hierarchical spec which will help standardize different formats’ dialects
  • new formats and properties addition should be considered based on users’ demand and should happen gradually

Issue Analytics

  • State:open
  • Created 3 years ago
  • Reactions:3
  • Comments:7 (7 by maintainers)

github_iconTop GitHub Comments

1reaction
nichtichcommented, Dec 18, 2022

Json Table Dialect requires further discussion as many ways exist to encode tabular data in JSON.

  1. array of arrays
  2. array of objects (aka keyed) so order of columns is unknown
  3. object with property for rows and property for header

See https://www.w3.org/TR/csv2json/ (CSVW) for an example of a specification that supports 1 (simple) and 3 (slightly complicated). A simplified form of this uses an object with property rows for rows and property columns with an array of objects, each having property label at least.

So keyed is probably fine but keys is more complex.

Moreover cells in JSON Tables and Excel Tables can have data types other than plain strings.

  • Excel: text, number, logical, error.
  • JSON: any JSON data type except object and possibly array (string, number, boolean, null)

Datatypes can be defined with columns as done in CSVW but less complex (e.g. only string, number, logical).

0reactions
rollcommented, Dec 23, 2022

Thanks, @nichtich!

I think it should not be a blocker as in specs like this we have a privilege to start from a small core and extend once other properties are discussed and justified

Read more comments on GitHub >

github_iconTop Results From Across the Web

Proposal: Reference shortcut for nesting tables · Issue #744
Idea: add a shortcut character to reference the outer table. Example: ... But I'd love to see it in the next version of...
Read more >
Proposal for a Friendly Dialect of C - Embedded in Academia
If this proposal gains traction, we will work towards an implementable specification that addresses all 203 items listed in Annex J of the ......
Read more >
P4~16~ Language Specification
This specification document defines the structure and interpretation of programs in the P416 language. It defines the syntax, semantic rules, ...
Read more >
Patterns - Frictionless Standards
This pattern introduces the following properties to the Table Schema spec (using the Frictionless Data core dictionary as much as possible):.
Read more >
csv — CSV File Reading and Writing
The Dialect class is a container class whose attributes contain information for how to handle doublequotes, whitespace, delimiters, etc. Due to the lack...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found