Handle uppercase letters in imported data column names
See original GitHub issueColumn names in imported data are normalized as lower-case identifiers by removing upper case letters and other characters, then adding numerical suffixes if necessary to avoid name collisions. For example, when importing a table with columns ID_X, ID_Y, IC_gral
they get renamed as _, _1, _gral
.
It would be nicer to preserve uppercase letters by converting them to lowercase, but changing this behaviour for existing import sources may break user cases in use.
For new sources, e.g. the BQ connector, we could change it without risk of breaking anything, but it might not be easy (is the name change part of cartodbfication?).
In any case there are some questions I think we should consider for discussion:
- Should be change this behaviour globally (and hope we don’t break too many cases)
- Are uppercase letters common in BQ column names? (same for other data sources)
cc @alejandrohall @alonsogarciapablo @ilbambino
note that a table with all columns in uppercase would be imported as _, _1, _2, _3, ...
which is very unfortunate.
Issue Analytics
- State:
- Created 4 years ago
- Comments:20 (20 by maintainers)
Top GitHub Comments
I was wrong about
Column
usingStringSanitizer
(there’s a method#sanitize
but is not used; Column is used byGeoreferencer
but not the sanitize method), and the whole sanitization picture is more complicated in our current database, but I’ve take a deep look into it and have survived to tell you about.The problem
Imported and synced tables have
Table.sanitize_columns
called on the new tables which eliminates uppercase and special letters.For file imports this has no effect because
ogr2ogr
already converted column names to downcase.For tables created manually (SQL API) and cartodbfied this is not executed when the table is registered, so they can have uppercase letters, etc. in the column names.
The detailed process:
DataImport#run
(dispatch/new_importer/new_importer_with_connecto etc) uses aTableRegistrar
object to create a Table (settingmigrate_existing_table
; the presence of it inTable#before_create
makes it callTable.sanitize_columns
).Synchronization::Adapter#run
calls explicitly (twice)Table.sanitize_columns`` (before cartodbfication through
#import_cleanupand after through
#setup_table`)migrate_existing_table
, so no sanitization.Table.sanitize_columns
explicitly (fromDataImport#from_table
).Proposed Versioning Implementation
Let’s assume we already have some method that applies normalization for any version that will replace
Table.sanitize_columns
: (I’m using normalization here but we may end up calling itsanitization
or some other nonsense)In all cases except for manually created tables (i.e. all cases that sanitize columns) the UserTable register has a non NULL data_import_id. We can use the DataImport record to store the kind of normalization performed (using the
extra_options
JSON field) so that if the table is syncronized, the same normalization can be applied to newly imported data.Normalization can be applied through a Table instance method:
Now,
Table#import_to_cartodb
(called on Table creation for imported tables) should call the new instance mehod (self.normalize_columns
) without arguments instead ofTable.sanitize_columns
.Synchronizations (
Adapter
) must calltable.normalize(result.table_name)
withtable = ::Table.new(name: @table_name, user_id: @user.id)
or[the latter produces a@user.tables.where(name: @table_name)
UserTable
, not aTable
].We’re doing an additional query to get the Table here, which is done also in other places like
TableSetup
. In theSynchronization::Member
class this operation is available and memoized, but not inAdapter
. We could refactor to avoid multiple queries for this. (note that putting the data import or version reference in the synchronizations table would not save doing the query at least once).Then we should make
Table#add_column!
and#modify_column!
consistent with the new normalization and change to call#normalize_columns
on self insteald of the String methods used now.Internally the normalization implementation used in
ColumnNormalization
could use the String methods, or, better,StringSanitizer
which should probably replace all use of the String methods and perhaps be integraged inDB::Sanitize
.The idea is having a single source for (low level) sanitization (e.g.
DB::Sanitize
) then versioned sanitization provided throughTable
.Note normalization should also handle reserved words/column names which is done inconsistently now.
Details
The method
Table.sanitize_columns
is used in:DataImport#from_table
in turn fromDataImport#dispatch
(for table copy & query imports) and passing Column::RESERVED_WORDS (!). This doesn’t happen in syncs, nor user (API) imports, only in internal imports. In any case we have the DataImport to obtain the correct version.Table#import_to_cartodb
fromTable#before_create
(i.e. for all registered tables) and also fromSynchronization::Adapter#setup_table
. When called fromsetup_table
we should get the version from the DataImport corresponding to to Sync, but it should match the table’s DataImport… so… When called from Table creation, we should have assigned a version previously to the table’s data import if it exists… [check how DataImport creates Table, to see if DI is available at construction]Now
Synchronization::Adapter#run
usesTable.sanitize_columns
twice:Synchronization::Adapter#import_cleanup
Synchronization::Adapter#setup_table
Imports call it through the Table constructor (
before_create
withmigrate_existing_table
option set byTableRegistrar
) (if import uses ogr2ogr, i.e. is file-based the table will already be downcase)Note that we have foreign keys
user_tables.data_import_id
anddata_import.table_id
, (soTable <-> DataImport
navigation is possible), but we have onlydata_imports.synchronization_id
which is not indexed soSynchronization::Member -> DataImport
is not possible efficiently.We have
Synchronization::Member#table
that finds the table by user and name, but there’s no such thing in theAdapter
(the importer). The Adapter is instantiated with a table_name: the name of the synchronization (which is the table name) then it uses result.table_name (temp table) to normalize columns, cartodbfy, etc. So we can’t rely on Table doing the normalization with previously registered info, since we must normalize the imported table (unregistered) with info from the sync name table (registered).Note that syncs can’t handle multiple files: https://github.com/CartoDB/cartodb/blob/9391e76933cfc4028aba3675891f32284350244d/app/models/synchronization/adapter.rb#L29-L30
Sanitation Bonanza
There’s a plethora of other sanitization methods in addition to Table.sanitize_columns`
String#sanitize_column_name
It uses
DB::Sanitize::RESERVED_WORDS
andCartoDB::RESERVED_COLUMN_NAMES
. It callsString#sanitize
->String#normalize
. It used to be called byMigrator
, but currenlty that is disabled now: https://github.com/CartoDB/cartodb/blob/9391e76933cfc4028aba3675891f32284350244d/lib/importer/lib/cartodb-migrator/migrator.rb#L44-L46Table#add_column!
and#modify_column!
useString#sanitize
; and usesTable::RESERVED_COLUMN_NAMES
to rename reserved names to _xxx. AlsoTable#create_table_in_database!
calls String#sanitize and some rake tasks, etc.DB::Sanitize#sanitize_identifier
It is used for table names only, not columns. Called by
APiKey#create_db_config
,FDW#server_name
,ValidTableNameProposer#propose_valid_table_name
,ConnectorRunner#result_table_name
, …) UsesDB::Sanitize::RESERVED_WORDS
StringSanitizer#sanitize
Is used by
Column#sanitized_name
(usesColumn::RESERVED_WORDS
). Used byColumn#sanitize
which seems unused.Other Redundancies
Two
RESERVED_WORDS
:DB::Sanitize
used byString#sanitize_column_name
Column
used byDataImport#from_table
andColumn#sanitized_name
Two
RESERVED_COLUMN_NAMES
:CartoDB
used byString#sanitize_column_name
Table
used byTable#add_column!
,Table#modify_column!
(the latter throughrename_column
)Close via #15326