Adding a field inferencing "subrule", otherwise very often I would have wrong extracted values
See original GitHub issueHi, daily I work with data where some columns have only numerical values in the cells, but which are actually string fields.
A typical case is that of administrative codes, of regions, provinces and cities (in the table below codice_regione
, codice_provincia
and codice_comune
fields). In Italy, the city of Losine has the code 0870
, and obviously it’s a string, and it must remain a string, otherwise I would lose the chance to JOIN this data with other data.
comune | codice_regione | codice_provincia | codice_comune | denominazione_comune | sigla_provincia | data_entrata_in_carica |
---|---|---|---|---|---|---|
030151360 | 03 | 015 | 1360 | POLPENAZZE DEL GARDA | BS | 13/10/2021 |
030120530 | 03 | 012 | 0530 | CAROBBIO DEGLI ANGELI | BG | 04/10/2021 |
020040580 | 02 | 004 | 0580 | SAINT-DENIS | AO | 23/09/2020 |
030150870 | 03 | 015 | 0870 | LOSINE | BS | 04/10/2021 |
190480090 | 19 | 048 | 0090 | CAPO D’ORLANDO | ME | 27/10/2021 |
But if I run extract
on it
frictionless extract "https://gist.githubusercontent.com/aborruso/076e5fad847b658a535b16cbcf3abdfd/raw/887edc88ab75b01d15ea1ac4bb052ffaf8d5ef9c/tmp.csv" --csv
the codice_regione
, codice_provincia
and codice_comune
fields are no longer strings, they are all numbers and the cell values have changed (in example the code of Losine becomes 870
).
comune | codice_regione | codice_provincia | codice_comune | denominazione_comune | sigla_provincia | data_entrata_in_carica |
---|---|---|---|---|---|---|
30151360 | 3 | 15 | 1360 | POLPENAZZE DEL GARDA | BS | 13/10/2021 |
30120530 | 3 | 12 | 530 | CAROBBIO DEGLI ANGELI | BG | 04/10/2021 |
20040580 | 2 | 4 | 580 | SAINT-DENIS | AO | 23/09/2020 |
30150870 | 3 | 15 | 870 | LOSINE | BS | 04/10/2021 |
190480090 | 19 | 48 | 90 | CAPO D’ORLANDO | ME | 27/10/2021 |
I know, octal numbers exist, but in 99% of cases (I’m talking about my experience), if I have fields with cells starting with zero they are not octal numbers, but string codes.
If it were possible I would add a subrule for all fields mapped as numbers: if there are cells that start with a zero, not followed by a ,
or .
, set that inferenced field as string
and not as number
.
Thank you.
Issue Analytics
- State:
- Created a year ago
- Comments:6 (6 by maintainers)
Top GitHub Comments
@shashigharti I think we need to return None if it starts from ‘0’ here:
I guess it’s the right solution as leading zeros might be really confusing e.g. Python doesn’t allow them:
Hi @roll , it seems that we have this problem still in 5 release.
If I run
I have wrong type error. The second columns contains a lot of
01
values (and so on), it’s a string column, but it’s mapped as integer.I’m using 5.0.0b10.
Thank you