question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Adding a field inferencing "subrule", otherwise very often I would have wrong extracted values

See original GitHub issue

Hi, daily I work with data where some columns have only numerical values in the cells, but which are actually string fields.

A typical case is that of administrative codes, of regions, provinces and cities (in the table below codice_regione, codice_provincia and codice_comune fields). In Italy, the city of Losine has the code 0870, and obviously it’s a string, and it must remain a string, otherwise I would lose the chance to JOIN this data with other data.

comune codice_regione codice_provincia codice_comune denominazione_comune sigla_provincia data_entrata_in_carica
030151360 03 015 1360 POLPENAZZE DEL GARDA BS 13/10/2021
030120530 03 012 0530 CAROBBIO DEGLI ANGELI BG 04/10/2021
020040580 02 004 0580 SAINT-DENIS AO 23/09/2020
030150870 03 015 0870 LOSINE BS 04/10/2021
190480090 19 048 0090 CAPO D’ORLANDO ME 27/10/2021

But if I run extract on it

frictionless extract "https://gist.githubusercontent.com/aborruso/076e5fad847b658a535b16cbcf3abdfd/raw/887edc88ab75b01d15ea1ac4bb052ffaf8d5ef9c/tmp.csv" --csv

the codice_regione, codice_provincia and codice_comune fields are no longer strings, they are all numbers and the cell values have changed (in example the code of Losine becomes 870).

comune codice_regione codice_provincia codice_comune denominazione_comune sigla_provincia data_entrata_in_carica
30151360 3 15 1360 POLPENAZZE DEL GARDA BS 13/10/2021
30120530 3 12 530 CAROBBIO DEGLI ANGELI BG 04/10/2021
20040580 2 4 580 SAINT-DENIS AO 23/09/2020
30150870 3 15 870 LOSINE BS 04/10/2021
190480090 19 48 90 CAPO D’ORLANDO ME 27/10/2021

I know, octal numbers exist, but in 99% of cases (I’m talking about my experience), if I have fields with cells starting with zero they are not octal numbers, but string codes.

If it were possible I would add a subrule for all fields mapped as numbers: if there are cells that start with a zero, not followed by a , or ., set that inferenced field as string and not as number.

Thank you.

Issue Analytics

  • State:closed
  • Created a year ago
  • Comments:6 (6 by maintainers)

github_iconTop GitHub Comments

2reactions
rollcommented, Aug 29, 2022

@shashigharti I think we need to return None if it starts from ‘0’ here:

I guess it’s the right solution as leading zeros might be really confusing e.g. Python doesn’t allow them:

SyntaxError: leading zeros in decimal integer literals are not permitted; use an 0o prefix for octal integers
1reaction
aborrusocommented, Nov 1, 2022

Hi @roll , it seems that we have this problem still in 5 release.

If I run

frictionless validate --buffer-size 25000000 "https://gist.githubusercontent.com/aborruso/c99d74e88d0f8037219c958a40fe744c/raw/6d8b602f5692306bbc16ee6b05210363be911c63/tmp.csv"

I have wrong type error. The second columns contains a lot of 01 values (and so on), it’s a string column, but it’s mapped as integer.

I’m using 5.0.0b10.

Thank you

Read more comments on GitHub >

github_iconTop Results From Across the Web

Drools Expert User Guide
Process engines and rules often can work nicely together, so they are not mutually exclusive. One key point to note with rule engines...
Read more >
US9594814B2 - Advanced field extractor with modification of an ...
The technology disclosed relates to formulating and refining field extraction rules that are used at query time on raw data with a late-binding...
Read more >
11 common grammar mistakes that make people cringe—and ...
1. apostrophes ... This is an example of the all-too-frequent attack of the unnecessary apostrophe. People see an “s” at the end of...
Read more >
C Duce: a white paper
... the y value is of type. IntStr; otherwise, the system would have issued a warning. In ... transform construction is very much...
Read more >
Ott: Tool Support for Semantics User Guide version 0.10.14
code extraction facilities can sometimes be used for that). Our focus here is on the problem of writing and editing language definitions, not...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found