question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

SQL with Japanese/Chinese characters not applied correctly in Databricks

See original GitHub issue

NB

A redirect of this issue. I’ve copied the whole thing, hope you don’t mind.

Current Behavior

Have a model like this (reduced the division list for brevity):

{{ config(materialized='view') }}

{% set divisions = [
    ('JP-23', 'Aichi', 'Aichi', '愛知県', 'Kanjii', 'Chūbu', 'JP'),
    ('JP-05', 'Akita', 'Akita', '秋田県', 'Kanjii', 'Tōhoku', 'JP'),
]
%}

{% for state_iso_code, state_name, state_name_2, state_name_local, state_name_local_type, region, country_code in divisions %}
select
    '{{ state_iso_code }}'        as state_iso_code
  , '{{ state_name }}'            as state_name
  , '{{ state_name_2 }}'          as state_name_2
  , '{{ state_name_local }}'      as state_name_local
  , '{{ state_name_local_type }}' as state_name_local_type
  , '{{ region }}'                as region
  , '{{ country_code }}'          as country_code
union all
{% endfor %}
select
    '-1'      as state_iso_code
  , '(Blank)' as state_name
  , '(Blank)' as state_name_2
  , '(Blank)' as state_name_local
  , '(Blank)' as state_name_local_type
  , '(Blank)' as region
  , '-1'      as country_code

When dbt compile is run, everything seems fine:

select
    'JP-23'        as state_iso_code
  , 'Aichi'            as state_name
  , 'Aichi'          as state_name_2
  , '愛知県'      as state_name_local
  , 'Kanjii' as state_name_local_type
  , 'Chūbu'                as region
  , 'JP'          as country_code
union all

select
    'JP-05'        as state_iso_code
  , 'Akita'            as state_name
  , 'Akita'          as state_name_2
  , '秋田県'      as state_name_local
  , 'Kanjii' as state_name_local_type
  , 'Tōhoku'                as region
  , 'JP'          as country_code

union all

select
    '-1'      as state_iso_code
  , '(Blank)' as state_name
  , '(Blank)' as state_name_2
  , '(Blank)' as state_name_local
  , '(Blank)' as state_name_local_type
  , '(Blank)' as region
  , '-1'      as country_code

Now when dbt run --model administrative_divisions is executed against Databricks profile (type: databricks), the resuling view is this:

CREATE VIEW `auto_replenishment_george_test`.`administrative_divisions` (
  `state_iso_code`,
  `state_name`,
  `state_name_2`,
  `state_name_local`,
  `state_name_local_type`,
  `region`,
  `country_code`)
TBLPROPERTIES (
  'transient_lastDdlTime' = '1645023802')
AS select
    'JP-23'        as state_iso_code
  , 'Aichi'            as state_name
  , 'Aichi'          as state_name_2
  , '???'      as state_name_local
  , 'Kanjii' as state_name_local_type
  , 'Chubu'                as region
  , 'JP'          as country_code
union all

select
    'JP-05'        as state_iso_code
  , 'Akita'            as state_name
  , 'Akita'          as state_name_2
  , '???'      as state_name_local
  , 'Kanjii' as state_name_local_type
  , 'Tohoku'                as region
  , 'JP'          as country_code

union all

select
    '-1'      as state_iso_code
  , '(Blank)' as state_name
  , '(Blank)' as state_name_2
  , '(Blank)' as state_name_local
  , '(Blank)' as state_name_local_type
  , '(Blank)' as region
  , '-1'      as country_code

The ??? is not what we expected to see 😃

Expected Behavior

Japanese/Chinese characters are sent correctly to the actual database, without being replaced with questionmarks.

Steps To Reproduce

Described in Current Behaviour

Relevant log output

Log seems to be fine:

15:03:22.101320 [debug] [Thread-1  ]: On model.auto_replenishment.administrative_divisions: /* {"app": "dbt", "dbt_version": "1.0.1", "profile_name": "auto_replenishment", "target_name": "dev", "node_id": "model.auto_replenishment.administrative_divisions"} */
create or replace view auto_replenishment_george_test.administrative_divisions
  
  as
    




select
    'JP-23'        as state_iso_code
  , 'Aichi'            as state_name
  , 'Aichi'          as state_name_2
  , '愛知県'      as state_name_local
  , 'Kanjii' as state_name_local_type
  , 'Chūbu'                as region
  , 'JP'          as country_code
union all

select
    'JP-05'        as state_iso_code
  , 'Akita'            as state_name
  , 'Akita'          as state_name_2
  , '秋田県'      as state_name_local
  , 'Kanjii' as state_name_local_type
  , 'Tōhoku'                as region
  , 'JP'          as country_code
union all
...


### Environment

```markdown
- OS: Ubuntu 20.04
- Python: 3.8.10
- dbt: 1.0.1

What database are you using dbt with?

other (mention it in “Additional Context”)

Additional Context

installed version: 1.0.1 latest version: 1.0.1

Up to date!

Plugins:

  • databricks: 1.0.1
  • spark: 1.0.0

Issue Analytics

  • State:closed
  • Created 2 years ago
  • Comments:6

github_iconTop GitHub Comments

1reaction
george-zubrienkocommented, Feb 28, 2022

Hmm guys I need to correct myself here. @allisonwang-db was right - I missed the fact we have this entity created as a table from our current project (not dbt):

insert into auto_replenishment.administrative_divisions
    values
    ('JP-23', 'Aichi', 'Aichi', '愛知県', 'Kanjii', 'Chūbu', 'JP'),
    ('JP-05', 'Akita', 'Akita', '秋田県', 'Kanjii', 'Tōhoku', 'JP'),
   ...

This works exactly as he explained. Creating a view in any way (SQL endpoint, notebook, dbt) results in ???. I’ll check our hive conf. Just for reference, this is our cluster configuration with external metastore:

spark.hadoop.javax.jdo.option.ConnectionDriverName com.microsoft.sqlserver.jdbc.SQLServerDriver
spark.hadoop.javax.jdo.option.ConnectionURL {{secrets/ds_workspace/hiveMetastoreUrl}}
spark.hadoop.javax.jdo.option.ConnectionUserName delamain
spark.hadoop.javax.jdo.option.ConnectionPassword {{secrets/ds_workspace/hiveMetastoreSecret}}
spark.sql.hive.metastore.jars builtin
spark.sql.hive.metastore.version 2.3.7
0reactions
george-zubrienkocommented, Mar 3, 2022

I experimented with different encoding settings on JDBC driver (encoding, characterEncoding etc.) and neither affects how the view outputs the data. SQL database we use is Azure SQL with their weird SQL_Latin1_… collation, which is apparently a mix of UTF-8 and some other encoding. I’m convinced this is not a dbt issue, but rather some ancient Hive problem, that has a workaround. Thus closing this one.

Read more comments on GitHub >

github_iconTop Results From Across the Web

[CT-245] [Bug] SQL with Japanese/Chinese characters not ...
Japanese /Chinese characters are sent correctly to the actual database, without being replaced with questionmarks. Steps To Reproduce. Described ...
Read more >
Japanese character support in external metastore - Databricks
Problem. You are trying to use Japanese characters in your tables, but keep getting errors. Create a table with the OPTIONS keyword.
Read more >
Chinese and Japanese characters not working with mysql
Does setting your connection to Latin1 make it appear correctly? Since Latin1 is an 8-bit character set, it can have UTF-8 data stored...
Read more >
Characters not displaying correctly | ThoughtSpot Software
Your CSV files are more likely to load smoothly if they are encoded with UTF-8. If you're having problems with some characters rendering...
Read more >
Spark read file with special characters using PySpark
Suppose, we have a CSV file that contains some non-English characters (Spanish, Japanese, and etc.) and we want to read this file into...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found