Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

SQL with Japanese/Chinese characters not applied correctly in Databricks

See original GitHub issue

NB

A redirect of this issue. I’ve copied the whole thing, hope you don’t mind.

Current Behavior

Have a model like this (reduced the division list for brevity):

{{ config(materialized='view') }}

{% set divisions = [
    ('JP-23', 'Aichi', 'Aichi', '愛知県', 'Kanjii', 'Chūbu', 'JP'),
    ('JP-05', 'Akita', 'Akita', '秋田県', 'Kanjii', 'Tōhoku', 'JP'),
]
%}

{% for state_iso_code, state_name, state_name_2, state_name_local, state_name_local_type, region, country_code in divisions %}
select
    '{{ state_iso_code }}'        as state_iso_code
  , '{{ state_name }}'            as state_name
  , '{{ state_name_2 }}'          as state_name_2
  , '{{ state_name_local }}'      as state_name_local
  , '{{ state_name_local_type }}' as state_name_local_type
  , '{{ region }}'                as region
  , '{{ country_code }}'          as country_code
union all
{% endfor %}
select
    '-1'      as state_iso_code
  , '(Blank)' as state_name
  , '(Blank)' as state_name_2
  , '(Blank)' as state_name_local
  , '(Blank)' as state_name_local_type
  , '(Blank)' as region
  , '-1'      as country_code

When dbt compile is run, everything seems fine:

select
    'JP-23'        as state_iso_code
  , 'Aichi'            as state_name
  , 'Aichi'          as state_name_2
  , '愛知県'      as state_name_local
  , 'Kanjii' as state_name_local_type
  , 'Chūbu'                as region
  , 'JP'          as country_code
union all

select
    'JP-05'        as state_iso_code
  , 'Akita'            as state_name
  , 'Akita'          as state_name_2
  , '秋田県'      as state_name_local
  , 'Kanjii' as state_name_local_type
  , 'Tōhoku'                as region
  , 'JP'          as country_code

union all

select
    '-1'      as state_iso_code
  , '(Blank)' as state_name
  , '(Blank)' as state_name_2
  , '(Blank)' as state_name_local
  , '(Blank)' as state_name_local_type
  , '(Blank)' as region
  , '-1'      as country_code

Now when dbt run --model administrative_divisions is executed against Databricks profile (type: databricks), the resuling view is this:

CREATE VIEW `auto_replenishment_george_test`.`administrative_divisions` (
  `state_iso_code`,
  `state_name`,
  `state_name_2`,
  `state_name_local`,
  `state_name_local_type`,
  `region`,
  `country_code`)
TBLPROPERTIES (
  'transient_lastDdlTime' = '1645023802')
AS select
    'JP-23'        as state_iso_code
  , 'Aichi'            as state_name
  , 'Aichi'          as state_name_2
  , '???'      as state_name_local
  , 'Kanjii' as state_name_local_type
  , 'Chubu'                as region
  , 'JP'          as country_code
union all

select
    'JP-05'        as state_iso_code
  , 'Akita'            as state_name
  , 'Akita'          as state_name_2
  , '???'      as state_name_local
  , 'Kanjii' as state_name_local_type
  , 'Tohoku'                as region
  , 'JP'          as country_code

union all

select
    '-1'      as state_iso_code
  , '(Blank)' as state_name
  , '(Blank)' as state_name_2
  , '(Blank)' as state_name_local
  , '(Blank)' as state_name_local_type
  , '(Blank)' as region
  , '-1'      as country_code

The ??? is not what we expected to see 😃

Expected Behavior

Japanese/Chinese characters are sent correctly to the actual database, without being replaced with questionmarks.

Steps To Reproduce

Described in Current Behaviour

Relevant log output

Log seems to be fine:

15:03:22.101320 [debug] [Thread-1  ]: On model.auto_replenishment.administrative_divisions: /* {"app": "dbt", "dbt_version": "1.0.1", "profile_name": "auto_replenishment", "target_name": "dev", "node_id": "model.auto_replenishment.administrative_divisions"} */
create or replace view auto_replenishment_george_test.administrative_divisions
  
  as
    




select
    'JP-23'        as state_iso_code
  , 'Aichi'            as state_name
  , 'Aichi'          as state_name_2
  , '愛知県'      as state_name_local
  , 'Kanjii' as state_name_local_type
  , 'Chūbu'                as region
  , 'JP'          as country_code
union all

select
    'JP-05'        as state_iso_code
  , 'Akita'            as state_name
  , 'Akita'          as state_name_2
  , '秋田県'      as state_name_local
  , 'Kanjii' as state_name_local_type
  , 'Tōhoku'                as region
  , 'JP'          as country_code
union all
...



### Environment

```markdown
- OS: Ubuntu 20.04
- Python: 3.8.10
- dbt: 1.0.1

What database are you using dbt with?

other (mention it in “Additional Context”)

Additional Context

installed version: 1.0.1 latest version: 1.0.1

Up to date!

Plugins:

databricks: 1.0.1
spark: 1.0.0

Issue Analytics

State:
Created 2 years ago
Comments:6

Top GitHub Comments

1reaction

george-zubrienkocommented, Feb 28, 2022

Hmm guys I need to correct myself here. @allisonwang-db was right - I missed the fact we have this entity created as a table from our current project (not dbt):

insert into auto_replenishment.administrative_divisions
    values
    ('JP-23', 'Aichi', 'Aichi', '愛知県', 'Kanjii', 'Chūbu', 'JP'),
    ('JP-05', 'Akita', 'Akita', '秋田県', 'Kanjii', 'Tōhoku', 'JP'),
   ...

This works exactly as he explained. Creating a view in any way (SQL endpoint, notebook, dbt) results in ???. I’ll check our hive conf. Just for reference, this is our cluster configuration with external metastore:

spark.hadoop.javax.jdo.option.ConnectionDriverName com.microsoft.sqlserver.jdbc.SQLServerDriver
spark.hadoop.javax.jdo.option.ConnectionURL {{secrets/ds_workspace/hiveMetastoreUrl}}
spark.hadoop.javax.jdo.option.ConnectionUserName delamain
spark.hadoop.javax.jdo.option.ConnectionPassword {{secrets/ds_workspace/hiveMetastoreSecret}}
spark.sql.hive.metastore.jars builtin
spark.sql.hive.metastore.version 2.3.7

0reactions

george-zubrienkocommented, Mar 3, 2022

I experimented with different encoding settings on JDBC driver (encoding, characterEncoding etc.) and neither affects how the view outputs the data. SQL database we use is Azure SQL with their weird SQL_Latin1_… collation, which is apparently a mix of UTF-8 and some other encoding. I’m convinced this is not a dbt issue, but rather some ancient Hive problem, that has a workaround. Thus closing this one.