question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Handle uppercase letters in imported data column names

See original GitHub issue

Column names in imported data are normalized as lower-case identifiers by removing upper case letters and other characters, then adding numerical suffixes if necessary to avoid name collisions. For example, when importing a table with columns ID_X, ID_Y, IC_gral they get renamed as _, _1, _gral.

It would be nicer to preserve uppercase letters by converting them to lowercase, but changing this behaviour for existing import sources may break user cases in use.

For new sources, e.g. the BQ connector, we could change it without risk of breaking anything, but it might not be easy (is the name change part of cartodbfication?).

In any case there are some questions I think we should consider for discussion:

  • Should be change this behaviour globally (and hope we don’t break too many cases)
  • Are uppercase letters common in BQ column names? (same for other data sources)

cc @alejandrohall @alonsogarciapablo @ilbambino

note that a table with all columns in uppercase would be imported as _, _1, _2, _3, ... which is very unfortunate.

Issue Analytics

  • State:closed
  • Created 4 years ago
  • Comments:20 (20 by maintainers)

github_iconTop GitHub Comments

1reaction
jgoizuetacommented, Dec 18, 2019

I was wrong about Column using StringSanitizer (there’s a method #sanitize but is not used; Column is used by Georeferencer but not the sanitize method), and the whole sanitization picture is more complicated in our current database, but I’ve take a deep look into it and have survived to tell you about.

The problem

Imported and synced tables have Table.sanitize_columns called on the new tables which eliminates uppercase and special letters.

For file imports this has no effect because ogr2ogr already converted column names to downcase.

For tables created manually (SQL API) and cartodbfied this is not executed when the table is registered, so they can have uppercase letters, etc. in the column names.

The detailed process:

  • Imported tables: DataImport#run (dispatch/new_importer/new_importer_with_connecto etc) uses a TableRegistrar object to create a Table (setting migrate_existing_table; the presence of it in Table#before_create makes it call Table.sanitize_columns).
  • Sync tables: no new UserTable record created so before_create is not called, but Synchronization::Adapter#run calls explicitly (twice) Table.sanitize_columns`` (before cartodbfication through #import_cleanupand after through#setup_table`)
  • Manually cartodbfied tables are registered by GTM, table is created without migrate_existing_table, so no sanitization.
  • Special imports (table_copy, from_query) call Table.sanitize_columns explicitly (from DataImport#from_table).

Proposed Versioning Implementation

Let’s assume we already have some method that applies normalization for any version that will replace Table.sanitize_columns: (I’m using normalization here but we may end up calling it sanitization or some other nonsense)

module ColumnNormalization
  INITIAL_NORMALIZATION = 0
  CURRENT_NORMALIZATION = 1
  def self.normalize_columns(table_name, version=CURRENT_NORMALIZATION)
    # ...
  end
end

In all cases except for manually created tables (i.e. all cases that sanitize columns) the UserTable register has a non NULL data_import_id. We can use the DataImport record to store the kind of normalization performed (using the extra_options JSON field) so that if the table is syncronized, the same normalization can be applied to newly imported data.

class DataImport
  attr_setter :normalization
  def before_create
    # ... (keep existing before create stuff here)
    self.normalization ||= ColumnNormalization::CURRENT_NORMALIZATION
    self.extra_options.merge! normalization: normalization
  end
  def applied_normalization
    extra_options[:normalizatio] || ColumnNormalization::INITIAL_NORMALIZATION
  end
end

Normalization can be applied through a Table instance method:

class Table
  # Use without arguments to normalize the registered table,
  # use with a different table name to apply the same normalization
  # to other (unregistered) table
  def normalize_columns(table_name=name)
    if data_import
      version = data_import.applied_normalization
      ColumnNormalization.normalize_columns(table_name, version)
    end
  end
end

Now, Table#import_to_cartodb (called on Table creation for imported tables) should call the new instance mehod (self.normalize_columns) without arguments instead of Table.sanitize_columns.

Synchronizations (Adapter) must call table.normalize(result.table_name) with table = ::Table.new(name: @table_name, user_id: @user.id) or @user.tables.where(name: @table_name) [the latter produces a UserTable, not a Table].

We’re doing an additional query to get the Table here, which is done also in other places like TableSetup. In the Synchronization::Member class this operation is available and memoized, but not in Adapter. We could refactor to avoid multiple queries for this. (note that putting the data import or version reference in the synchronizations table would not save doing the query at least once).

Then we should make Table#add_column! and #modify_column! consistent with the new normalization and change to call #normalize_columns on self insteald of the String methods used now.

Internally the normalization implementation used in ColumnNormalization could use the String methods, or, better, StringSanitizer which should probably replace all use of the String methods and perhaps be integraged in DB::Sanitize.

The idea is having a single source for (low level) sanitization (e.g. DB::Sanitize) then versioned sanitization provided through Table.

Note normalization should also handle reserved words/column names which is done inconsistently now.

  • Unified sanitatization: Remove Column & String sanitization methods, move into DB::Sanitize? Remove reserved words redundancy/inconsistency
  • Implement versioned normalization module
  • Implement per data-import versioned normalization
  • Add tests
  • Add documentation

Details

The method Table.sanitize_columns is used in:

  • DataImport#from_table in turn from DataImport#dispatch (for table copy & query imports) and passing Column::RESERVED_WORDS (!). This doesn’t happen in syncs, nor user (API) imports, only in internal imports. In any case we have the DataImport to obtain the correct version.
  • Table#import_to_cartodb from Table#before_create (i.e. for all registered tables) and also from Synchronization::Adapter#setup_table. When called from setup_table we should get the version from the DataImport corresponding to to Sync, but it should match the table’s DataImport… so… When called from Table creation, we should have assigned a version previously to the table’s data import if it exists… [check how DataImport creates Table, to see if DI is available at construction]
  • Synchronization::Adapter#import_cleanup

Now Synchronization::Adapter#run uses Table.sanitize_columns twice:

  • Before cartodbification it calls Synchronization::Adapter#import_cleanup
  • After it it calls Synchronization::Adapter#setup_table

Imports call it through the Table constructor (before_create with migrate_existing_table option set by TableRegistrar) (if import uses ogr2ogr, i.e. is file-based the table will already be downcase)

Note that we have foreign keys user_tables.data_import_id and data_import.table_id, (so Table <-> DataImport navigation is possible), but we have only data_imports.synchronization_id which is not indexed so Synchronization::Member -> DataImport is not possible efficiently.

We have Synchronization::Member#table that finds the table by user and name, but there’s no such thing in the Adapter (the importer). The Adapter is instantiated with a table_name: the name of the synchronization (which is the table name) then it uses result.table_name (temp table) to normalize columns, cartodbfy, etc. So we can’t rely on Table doing the normalization with previously registered info, since we must normalize the imported table (unregistered) with info from the sync name table (registered).

Note that syncs can’t handle multiple files: https://github.com/CartoDB/cartodb/blob/9391e76933cfc4028aba3675891f32284350244d/app/models/synchronization/adapter.rb#L29-L30

Sanitation Bonanza

There’s a plethora of other sanitization methods in addition to Table.sanitize_columns`

String#sanitize_column_name

It uses DB::Sanitize::RESERVED_WORDS and CartoDB::RESERVED_COLUMN_NAMES. It calls String#sanitize -> String#normalize. It used to be called by Migrator, but currenlty that is disabled now: https://github.com/CartoDB/cartodb/blob/9391e76933cfc4028aba3675891f32284350244d/lib/importer/lib/cartodb-migrator/migrator.rb#L44-L46

Table#add_column! and #modify_column! use String#sanitize; and uses Table::RESERVED_COLUMN_NAMES to rename reserved names to _xxx. Also Table#create_table_in_database! calls String#sanitize and some rake tasks, etc.

DB::Sanitize#sanitize_identifier

It is used for table names only, not columns. Called by APiKey#create_db_config, FDW#server_name, ValidTableNameProposer#propose_valid_table_name, ConnectorRunner#result_table_name, …) Uses DB::Sanitize::RESERVED_WORDS

StringSanitizer#sanitize

Is used by Column#sanitized_name (uses Column::RESERVED_WORDS). Used by Column#sanitize which seems unused.

Other Redundancies

Two RESERVED_WORDS:

  • In DB::Sanitize used by String#sanitize_column_name
  • In Column used by DataImport#from_table and Column#sanitized_name

Two RESERVED_COLUMN_NAMES:

  • In CartoDB used by String#sanitize_column_name
  • In Table used by Table#add_column!, Table#modify_column! (the latter through rename_column)
0reactions
jgoizuetacommented, Jan 20, 2020

Close via #15326

Read more comments on GitHub >

github_iconTop Results From Across the Web

Change letter case of column names - Stack Overflow
In some files the column names are all capital letters and in some files only the first letter of the column names is...
Read more >
Change Letter Case of Column Names in R (2 Examples)
The following R code shows how to capitalize all letters in the column names of a data frame. For this task, we can...
Read more >
Capitalize Column Names in a Dataframe
To capitalize the column names, we can simply invoke the upper() method on the Index object in which the column names are stored....
Read more >
Convert Column Names to Uppercase in Pandas Dataframe
Convert Column Names to Uppercase using str.upper(). We will get the dataframe column labels in an Index object by using the columns attribute...
Read more >
How to Change the Case of Column Names in R [Examples]
The easiest way to change the case of a column name in R is by using the names() function and the tolower() (for...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found