Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Handle uppercase letters in imported data column names

See original GitHub issue

Column names in imported data are normalized as lower-case identifiers by removing upper case letters and other characters, then adding numerical suffixes if necessary to avoid name collisions. For example, when importing a table with columns ID_X, ID_Y, IC_gral they get renamed as _, _1, _gral.

It would be nicer to preserve uppercase letters by converting them to lowercase, but changing this behaviour for existing import sources may break user cases in use.

For new sources, e.g. the BQ connector, we could change it without risk of breaking anything, but it might not be easy (is the name change part of cartodbfication?).

In any case there are some questions I think we should consider for discussion:

Should be change this behaviour globally (and hope we don’t break too many cases)
Are uppercase letters common in BQ column names? (same for other data sources)

cc @alejandrohall @alonsogarciapablo @ilbambino

note that a table with all columns in uppercase would be imported as _, _1, _2, _3, ... which is very unfortunate.

Issue Analytics

State:
Created 4 years ago
Comments:20 (20 by maintainers)

Top GitHub Comments

1reaction

jgoizuetacommented, Dec 18, 2019

I was wrong about Column using StringSanitizer (there’s a method #sanitize but is not used; Column is used by Georeferencer but not the sanitize method), and the whole sanitization picture is more complicated in our current database, but I’ve take a deep look into it and have survived to tell you about.

The problem

Imported and synced tables have Table.sanitize_columns called on the new tables which eliminates uppercase and special letters.

For file imports this has no effect because ogr2ogr already converted column names to downcase.

For tables created manually (SQL API) and cartodbfied this is not executed when the table is registered, so they can have uppercase letters, etc. in the column names.

The detailed process:

Imported tables: DataImport#run (dispatch/new_importer/new_importer_with_connecto etc) uses a TableRegistrar object to create a Table (setting migrate_existing_table; the presence of it in Table#before_create makes it call Table.sanitize_columns).
Sync tables: no new UserTable record created so before_create is not called, but Synchronization::Adapter#run calls explicitly (twice) Table.sanitize_columns`` (before cartodbfication through #import_cleanupand after through#setup_table`)
Manually cartodbfied tables are registered by GTM, table is created without migrate_existing_table, so no sanitization.
Special imports (table_copy, from_query) call Table.sanitize_columns explicitly (from DataImport#from_table).

Proposed Versioning Implementation

Let’s assume we already have some method that applies normalization for any version that will replace Table.sanitize_columns: (I’m using normalization here but we may end up calling it sanitization or some other nonsense)

module ColumnNormalization
  INITIAL_NORMALIZATION = 0
  CURRENT_NORMALIZATION = 1
  def self.normalize_columns(table_name, version=CURRENT_NORMALIZATION)
    # ...
  end
end

In all cases except for manually created tables (i.e. all cases that sanitize columns) the UserTable register has a non NULL data_import_id. We can use the DataImport record to store the kind of normalization performed (using the extra_options JSON field) so that if the table is syncronized, the same normalization can be applied to newly imported data.

class DataImport
  attr_setter :normalization
  def before_create
    # ... (keep existing before create stuff here)
    self.normalization ||= ColumnNormalization::CURRENT_NORMALIZATION
    self.extra_options.merge! normalization: normalization
  end
  def applied_normalization
    extra_options[:normalizatio] || ColumnNormalization::INITIAL_NORMALIZATION
  end
end

Normalization can be applied through a Table instance method:

class Table
  # Use without arguments to normalize the registered table,
  # use with a different table name to apply the same normalization
  # to other (unregistered) table
  def normalize_columns(table_name=name)
    if data_import
      version = data_import.applied_normalization
      ColumnNormalization.normalize_columns(table_name, version)
    end
  end
end

Now, Table#import_to_cartodb (called on Table creation for imported tables) should call the new instance mehod (self.normalize_columns) without arguments instead of Table.sanitize_columns.

Synchronizations (Adapter) must call table.normalize(result.table_name) with table = ::Table.new(name: @table_name, user_id: @user.id) ~~or @user.tables.where(name: @table_name)~~ [the latter produces a UserTable, not a Table].

We’re doing an additional query to get the Table here, which is done also in other places like TableSetup. In the Synchronization::Member class this operation is available and memoized, but not in Adapter. We could refactor to avoid multiple queries for this. (note that putting the data import or version reference in the synchronizations table would not save doing the query at least once).

Then we should make Table#add_column! and #modify_column! consistent with the new normalization and change to call #normalize_columns on self insteald of the String methods used now.

Internally the normalization implementation used in ColumnNormalization could use the String methods, or, better, StringSanitizer which should probably replace all use of the String methods and perhaps be integraged in DB::Sanitize.

The idea is having a single source for (low level) sanitization (e.g. DB::Sanitize) then versioned sanitization provided through Table.

Note normalization should also handle reserved words/column names which is done inconsistently now.

Unified sanitatization: Remove Column & String sanitization methods, move into DB::Sanitize? Remove reserved words redundancy/inconsistency
Implement versioned normalization module
Implement per data-import versioned normalization
Add tests
Add documentation

Details

The method Table.sanitize_columns is used in:

DataImport#from_table in turn from DataImport#dispatch (for table copy & query imports) and passing Column::RESERVED_WORDS (!). This doesn’t happen in syncs, nor user (API) imports, only in internal imports. In any case we have the DataImport to obtain the correct version.
Table#import_to_cartodb from Table#before_create (i.e. for all registered tables) and also from Synchronization::Adapter#setup_table. When called from setup_table we should get the version from the DataImport corresponding to to Sync, but it should match the table’s DataImport… so… When called from Table creation, we should have assigned a version previously to the table’s data import if it exists… [check how DataImport creates Table, to see if DI is available at construction]
Synchronization::Adapter#import_cleanup

Now Synchronization::Adapter#run uses Table.sanitize_columns twice:

Before cartodbification it calls Synchronization::Adapter#import_cleanup
After it it calls Synchronization::Adapter#setup_table

Imports call it through the Table constructor (before_create with migrate_existing_table option set by TableRegistrar) (if import uses ogr2ogr, i.e. is file-based the table will already be downcase)

Note that we have foreign keys user_tables.data_import_id and data_import.table_id, (so Table <-> DataImport navigation is possible), but we have only data_imports.synchronization_id which is not indexed so Synchronization::Member -> DataImport is not possible efficiently.

We have Synchronization::Member#table that finds the table by user and name, but there’s no such thing in the Adapter (the importer). The Adapter is instantiated with a table_name: the name of the synchronization (which is the table name) then it uses result.table_name (temp table) to normalize columns, cartodbfy, etc. So we can’t rely on Table doing the normalization with previously registered info, since we must normalize the imported table (unregistered) with info from the sync name table (registered).

Note that syncs can’t handle multiple files: https://github.com/CartoDB/cartodb/blob/9391e76933cfc4028aba3675891f32284350244d/app/models/synchronization/adapter.rb#L29-L30

Sanitation Bonanza

There’s a plethora of other sanitization methods in addition to Table.sanitize_columns`

`String#sanitize_column_name`

It uses DB::Sanitize::RESERVED_WORDS and CartoDB::RESERVED_COLUMN_NAMES. It calls String#sanitize -> String#normalize. It used to be called by Migrator, but currenlty that is disabled now: https://github.com/CartoDB/cartodb/blob/9391e76933cfc4028aba3675891f32284350244d/lib/importer/lib/cartodb-migrator/migrator.rb#L44-L46

Table#add_column! and #modify_column! use String#sanitize; and uses Table::RESERVED_COLUMN_NAMES to rename reserved names to _xxx. Also Table#create_table_in_database! calls String#sanitize and some rake tasks, etc.

`DB::Sanitize#sanitize_identifier`

It is used for table names only, not columns. Called by APiKey#create_db_config, FDW#server_name, ValidTableNameProposer#propose_valid_table_name, ConnectorRunner#result_table_name, …) Uses DB::Sanitize::RESERVED_WORDS