LanguageDetector return Unknown for long text [BUG]
See original GitHub issueSynapseML version
com.microsoft.azure:synapseml_2.12:0.9.5
System information
- Language : pyspark
- Spark Platform: Databricks Runtime version 10.4 LTS (includes Apache Spark 3.2.1, Scala 2.12)
Describe the problem
From my reproduction code, LanguageDetector can detect only very short words. (Return unknown)
Test results: 日本国 -> Japanese にほんこく - > Japanese 日本国(にほんこく、にっぽんこく - > Unknown 日本国(にほんこく、にっぽんこく、英: Japan)、または日本(にほん、にっぽん)は、東アジアに位置する民主制国家 [1]。首都は東京都[注 2][2][3 - > Unknown
I used to use this code and it can detect langugae with a long paragraph. (No change of environment version at all)
This bug occur few days ago.
Could you check what happen? And how can I solve this issue?
Code to reproduce issue
import synapse.ml
from synapse.ml.cognitive import *
from pyspark.sql.functions import col
print(f"synapse.ml.cognitive version:{synapse.ml.cognitive.__version__}") # synapse.ml.cognitive version:0.9.5
# Set key
key = '' # API key
location = 'japaneast' # Location
language = (LanguageDetector()
.setSubscriptionKey(key)
.setLocation(location)
.setTextCol("text")
.setOutputCol("language")
.setErrorCol("error"))
# Test Text Analytics
test_data = spark.createDataFrame([(1, 'Japan'),
(2, '日本国'),
(3, 'にほんこく'),
(4, '日本国(にほんこく、にっぽんこく'),
(5, '日本国(にほんこく、にっぽんこく、英: Japan)、または日本(にほん、にっぽん)は、東アジアに位置する民主制国家 [1]。首都は東京都[注 2][2][3]。'),
], ["id", "text"])
# display(test_data)
test_data2 = language.transform(test_data)
display(test_data2)
### RETURN
# synapse.ml.cognitive version:0.9.5
# 1
# Japan
# null
# [{"detectedLanguage": {"name": "English", "iso6391Name": "en", "confidenceScore": 0.98}, "warnings": [], "statistics": null, "error-message": null}]
# 2
# 日本国
# null
# [{"detectedLanguage": {"name": "Japanese", "iso6391Name": "ja", "confidenceScore": 1}, "warnings": [], "statistics": null, "error-message": null}]
# 3
# にほんこく
# null
# [{"detectedLanguage": {"name": "Japanese", "iso6391Name": "ja", "confidenceScore": 1}, "warnings": [], "statistics": null, "error-message": null}]
# 4
# 日本国(にほんこく、にっぽんこく
# null
# [{"detectedLanguage": {"name": "(Unknown)", "iso6391Name": "(Unknown)", "confidenceScore": 0}, "warnings": [], "statistics": null, "error-message": null}]
# 5
# 日本国(にほんこく、にっぽんこく、英: Japan)、または日本(にほん、にっぽん)は、東アジアに位置する民主制国家 [1]。首都は東京都[注 2][2][3]。
# null
# [{"detectedLanguage": {"name": "(Unknown)", "iso6391Name": "(Unknown)", "confidenceScore": 0}, "warnings": [], "statistics": null, "error-message": null}]
Other info / logs
No response
What component(s) does this bug affect?
-
area/cognitive
: Cognitive project -
area/core
: Core project -
area/deep-learning
: DeepLearning project -
area/lightgbm
: Lightgbm project -
area/opencv
: Opencv project -
area/vw
: VW project -
area/website
: Website -
area/build
: Project build system -
area/notebooks
: Samples under notebooks folder -
area/docker
: Docker usage -
area/models
: models related issue
What language(s) does this bug affect?
-
language/scala
: Scala source code -
language/python
: Pyspark APIs -
language/r
: R APIs -
language/csharp
: .NET APIs -
language/new
: Proposals for new client languages
What integration(s) does this bug affect?
-
integrations/synapse
: Azure Synapse integrations -
integrations/azureml
: Azure ML integrations -
integrations/databricks
: Databricks integrations
Issue Analytics
- State:
- Created 10 months ago
- Comments:9 (4 by maintainers)
Top Results From Across the Web
Language Identification using the 'fastText' package (a ...
Based on the R package documentation, “The function 'detect_language()' is vectorised and guesses the language of each string in text or returns ......
Read more >API - i18next documentation
Returns a t function that defaults to given language or namespace. All arguments can be optional/null.
Read more >Language detection with Google's Compact Language Detector
The detect method returns a tuple, including the language name and code (such as RUSSIAN , ru ), an isReliable boolean ( True...
Read more >Detect if text in English with python [closed] - Stack Overflow
Language Detector (in ruby not in python :/) Google Translate API v2 (No longer ... 20 characters long have to install PyEnchant or...
Read more >Internationalize your Next application with i18n and TypeScript
It is also to configure colors, font sizes and text direction. ... application with nextJs that has a language detector using i18next.
Read more >Top Related Medium Post
No results found
Top Related StackOverflow Question
No results found
Troubleshoot Live Code
Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start FreeTop Related Reddit Thread
No results found
Top Related Hackernoon Post
No results found
Top Related Tweet
No results found
Top Related Dev.to Post
No results found
Top Related Hashnode Post
No results found
Top GitHub Comments
Hi @jingwora , confirm that I can repro this issue.
This issue is from Cognitive Service. I can repro this issue without using SynapseML.
I have opened a ticket to Cognitive Service Language Detection team and will keep you updated.
Hi @jingwora It can be automatically fixed when Cognitive Service team release a new version of language detection model.
But if you want to manually set a previous version to fix this issue now, the previous build won’t work.