SQL with Japanese/Chinese characters not applied correctly in Databricks
See original GitHub issueNB
A redirect of this issue. I’ve copied the whole thing, hope you don’t mind.
Current Behavior
Have a model like this (reduced the division list for brevity):
{{ config(materialized='view') }}
{% set divisions = [
('JP-23', 'Aichi', 'Aichi', '愛知県', 'Kanjii', 'Chūbu', 'JP'),
('JP-05', 'Akita', 'Akita', '秋田県', 'Kanjii', 'Tōhoku', 'JP'),
]
%}
{% for state_iso_code, state_name, state_name_2, state_name_local, state_name_local_type, region, country_code in divisions %}
select
'{{ state_iso_code }}' as state_iso_code
, '{{ state_name }}' as state_name
, '{{ state_name_2 }}' as state_name_2
, '{{ state_name_local }}' as state_name_local
, '{{ state_name_local_type }}' as state_name_local_type
, '{{ region }}' as region
, '{{ country_code }}' as country_code
union all
{% endfor %}
select
'-1' as state_iso_code
, '(Blank)' as state_name
, '(Blank)' as state_name_2
, '(Blank)' as state_name_local
, '(Blank)' as state_name_local_type
, '(Blank)' as region
, '-1' as country_code
When dbt compile
is run, everything seems fine:
select
'JP-23' as state_iso_code
, 'Aichi' as state_name
, 'Aichi' as state_name_2
, '愛知県' as state_name_local
, 'Kanjii' as state_name_local_type
, 'Chūbu' as region
, 'JP' as country_code
union all
select
'JP-05' as state_iso_code
, 'Akita' as state_name
, 'Akita' as state_name_2
, '秋田県' as state_name_local
, 'Kanjii' as state_name_local_type
, 'Tōhoku' as region
, 'JP' as country_code
union all
select
'-1' as state_iso_code
, '(Blank)' as state_name
, '(Blank)' as state_name_2
, '(Blank)' as state_name_local
, '(Blank)' as state_name_local_type
, '(Blank)' as region
, '-1' as country_code
Now when dbt run --model administrative_divisions
is executed against Databricks profile (type: databricks
), the resuling view is this:
CREATE VIEW `auto_replenishment_george_test`.`administrative_divisions` (
`state_iso_code`,
`state_name`,
`state_name_2`,
`state_name_local`,
`state_name_local_type`,
`region`,
`country_code`)
TBLPROPERTIES (
'transient_lastDdlTime' = '1645023802')
AS select
'JP-23' as state_iso_code
, 'Aichi' as state_name
, 'Aichi' as state_name_2
, '???' as state_name_local
, 'Kanjii' as state_name_local_type
, 'Chubu' as region
, 'JP' as country_code
union all
select
'JP-05' as state_iso_code
, 'Akita' as state_name
, 'Akita' as state_name_2
, '???' as state_name_local
, 'Kanjii' as state_name_local_type
, 'Tohoku' as region
, 'JP' as country_code
union all
select
'-1' as state_iso_code
, '(Blank)' as state_name
, '(Blank)' as state_name_2
, '(Blank)' as state_name_local
, '(Blank)' as state_name_local_type
, '(Blank)' as region
, '-1' as country_code
The ??? is not what we expected to see 😃
Expected Behavior
Japanese/Chinese characters are sent correctly to the actual database, without being replaced with questionmarks.
Steps To Reproduce
Described in Current Behaviour
Relevant log output
Log seems to be fine:
15:03:22.101320 [debug] [Thread-1 ]: On model.auto_replenishment.administrative_divisions: /* {"app": "dbt", "dbt_version": "1.0.1", "profile_name": "auto_replenishment", "target_name": "dev", "node_id": "model.auto_replenishment.administrative_divisions"} */
create or replace view auto_replenishment_george_test.administrative_divisions
as
select
'JP-23' as state_iso_code
, 'Aichi' as state_name
, 'Aichi' as state_name_2
, '愛知県' as state_name_local
, 'Kanjii' as state_name_local_type
, 'Chūbu' as region
, 'JP' as country_code
union all
select
'JP-05' as state_iso_code
, 'Akita' as state_name
, 'Akita' as state_name_2
, '秋田県' as state_name_local
, 'Kanjii' as state_name_local_type
, 'Tōhoku' as region
, 'JP' as country_code
union all
...
### Environment
```markdown
- OS: Ubuntu 20.04
- Python: 3.8.10
- dbt: 1.0.1
What database are you using dbt with?
other (mention it in “Additional Context”)
Additional Context
installed version: 1.0.1 latest version: 1.0.1
Up to date!
Plugins:
- databricks: 1.0.1
- spark: 1.0.0
Issue Analytics
- State:
- Created 2 years ago
- Comments:6
Top Results From Across the Web
[CT-245] [Bug] SQL with Japanese/Chinese characters not ...
Japanese /Chinese characters are sent correctly to the actual database, without being replaced with questionmarks. Steps To Reproduce. Described ...
Read more >Japanese character support in external metastore - Databricks
Problem. You are trying to use Japanese characters in your tables, but keep getting errors. Create a table with the OPTIONS keyword.
Read more >Chinese and Japanese characters not working with mysql
Does setting your connection to Latin1 make it appear correctly? Since Latin1 is an 8-bit character set, it can have UTF-8 data stored...
Read more >Characters not displaying correctly | ThoughtSpot Software
Your CSV files are more likely to load smoothly if they are encoded with UTF-8. If you're having problems with some characters rendering...
Read more >Spark read file with special characters using PySpark
Suppose, we have a CSV file that contains some non-English characters (Spanish, Japanese, and etc.) and we want to read this file into...
Read more >Top Related Medium Post
No results found
Top Related StackOverflow Question
No results found
Troubleshoot Live Code
Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start FreeTop Related Reddit Thread
No results found
Top Related Hackernoon Post
No results found
Top Related Tweet
No results found
Top Related Dev.to Post
No results found
Top Related Hashnode Post
No results found
Top GitHub Comments
Hmm guys I need to correct myself here. @allisonwang-db was right - I missed the fact we have this entity created as a table from our current project (not dbt):
This works exactly as he explained. Creating a view in any way (SQL endpoint, notebook, dbt) results in
???
. I’ll check our hive conf. Just for reference, this is our cluster configuration with external metastore:I experimented with different encoding settings on JDBC driver (encoding, characterEncoding etc.) and neither affects how the view outputs the data. SQL database we use is Azure SQL with their weird SQL_Latin1_… collation, which is apparently a mix of UTF-8 and some other encoding. I’m convinced this is not a dbt issue, but rather some ancient Hive problem, that has a workaround. Thus closing this one.