question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

LanguageDetector return Unknown for long text [BUG]

See original GitHub issue

SynapseML version

com.microsoft.azure:synapseml_2.12:0.9.5

System information

  • Language : pyspark
  • Spark Platform: Databricks Runtime version 10.4 LTS (includes Apache Spark 3.2.1, Scala 2.12)

Describe the problem

From my reproduction code, LanguageDetector can detect only very short words. (Return unknown)

Test results: 日本国 -> Japanese にほんこく - > Japanese 日本国(にほんこく、にっぽんこく - > Unknown 日本国(にほんこく、にっぽんこく、英: Japan)、または日本(にほん、にっぽん)は、東アジアに位置する民主制国家 [1]。首都は東京都[注 2][2][3 - > Unknown

I used to use this code and it can detect langugae with a long paragraph. (No change of environment version at all)

This bug occur few days ago.

Could you check what happen? And how can I solve this issue?

Code to reproduce issue

import synapse.ml
from synapse.ml.cognitive import *
from pyspark.sql.functions import col

print(f"synapse.ml.cognitive version:{synapse.ml.cognitive.__version__}")  # synapse.ml.cognitive version:0.9.5

# Set key
key = ''  # API key
location = 'japaneast' # Location

language = (LanguageDetector()
    .setSubscriptionKey(key)
    .setLocation(location)
    .setTextCol("text")
    .setOutputCol("language")
    .setErrorCol("error"))

# Test Text Analytics
test_data = spark.createDataFrame([(1, 'Japan'),
                                   (2, '日本国'),
                                   (3, 'にほんこく'),
                                   (4, '日本国(にほんこく、にっぽんこく'),
                                   (5, '日本国(にほんこく、にっぽんこく、英: Japan)、または日本(にほん、にっぽん)は、東アジアに位置する民主制国家 [1]。首都は東京都[注 2][2][3]。'),
                                  ], ["id", "text"])
# display(test_data)
test_data2 =  language.transform(test_data)
display(test_data2)

### RETURN
# synapse.ml.cognitive version:0.9.5
# 1
# Japan
# null
# [{"detectedLanguage": {"name": "English", "iso6391Name": "en", "confidenceScore": 0.98}, "warnings": [], "statistics": null, "error-message": null}]
# 2
# 日本国
# null
# [{"detectedLanguage": {"name": "Japanese", "iso6391Name": "ja", "confidenceScore": 1}, "warnings": [], "statistics": null, "error-message": null}]
# 3
# にほんこく
# null
# [{"detectedLanguage": {"name": "Japanese", "iso6391Name": "ja", "confidenceScore": 1}, "warnings": [], "statistics": null, "error-message": null}]
# 4
# 日本国(にほんこく、にっぽんこく
# null
# [{"detectedLanguage": {"name": "(Unknown)", "iso6391Name": "(Unknown)", "confidenceScore": 0}, "warnings": [], "statistics": null, "error-message": null}]
# 5
# 日本国(にほんこく、にっぽんこく、英: Japan)、または日本(にほん、にっぽん)は、東アジアに位置する民主制国家 [1]。首都は東京都[注 2][2][3]。
# null
# [{"detectedLanguage": {"name": "(Unknown)", "iso6391Name": "(Unknown)", "confidenceScore": 0}, "warnings": [], "statistics": null, "error-message": null}]

Other info / logs

No response

What component(s) does this bug affect?

  • area/cognitive: Cognitive project
  • area/core: Core project
  • area/deep-learning: DeepLearning project
  • area/lightgbm: Lightgbm project
  • area/opencv: Opencv project
  • area/vw: VW project
  • area/website: Website
  • area/build: Project build system
  • area/notebooks: Samples under notebooks folder
  • area/docker: Docker usage
  • area/models: models related issue

What language(s) does this bug affect?

  • language/scala: Scala source code
  • language/python: Pyspark APIs
  • language/r: R APIs
  • language/csharp: .NET APIs
  • language/new: Proposals for new client languages

What integration(s) does this bug affect?

  • integrations/synapse: Azure Synapse integrations
  • integrations/azureml: Azure ML integrations
  • integrations/databricks: Databricks integrations

Issue Analytics

  • State:open
  • Created 10 months ago
  • Comments:9 (4 by maintainers)

github_iconTop GitHub Comments

2reactions
JessicaXYWangcommented, Dec 1, 2022

Hi @jingwora , confirm that I can repro this issue.

This issue is from Cognitive Service. I can repro this issue without using SynapseML.

key = '' #cognitive service key
endpoint = "" #cognitive service endpoint, eg: https://{yourworkspacename}.cognitiveservices.azure.com/

from azure.ai.textanalytics import TextAnalyticsClient
from azure.core.credentials import AzureKeyCredential

# Authenticate the client using your key and endpoint 
def authenticate_client():
    ta_credential = AzureKeyCredential(key)
    text_analytics_client = TextAnalyticsClient(
            endpoint=endpoint, 
            credential=ta_credential)
    return text_analytics_client

client = authenticate_client()

# Example method for detecting the language of text
def language_detection_example(client):
    try:
        documents = ["日本国(にほんこく、にっぽんこく、英: Japan)、または日本(にほん、にっぽん)は、東アジアに位置する民主制国家 [1]。首都は東京都[注 2][2][3]。"]
        response = client.detect_language(documents = documents, country_hint = 'us')[0]
        print("Language: ", response.primary_language.name)

    except Exception as err:
        print("Encountered exception. {}".format(err))
language_detection_example(client)

I have opened a ticket to Cognitive Service Language Detection team and will keep you updated.

1reaction
JessicaXYWangcommented, Dec 6, 2022

Hi @jingwora It can be automatically fixed when Cognitive Service team release a new version of language detection model.

But if you want to manually set a previous version to fix this issue now, the previous build won’t work.

Read more comments on GitHub >

github_iconTop Results From Across the Web

Language Identification using the 'fastText' package (a ...
Based on the R package documentation, “The function 'detect_language()' is vectorised and guesses the language of each string in text or returns ......
Read more >
API - i18next documentation
Returns a t function that defaults to given language or namespace. All arguments can be optional/null.
Read more >
Language detection with Google's Compact Language Detector
The detect method returns a tuple, including the language name and code (such as RUSSIAN , ru ), an isReliable boolean ( True...
Read more >
Detect if text in English with python [closed] - Stack Overflow
Language Detector (in ruby not in python :/) Google Translate API v2 (No longer ... 20 characters long have to install PyEnchant or...
Read more >
Internationalize your Next application with i18n and TypeScript
It is also to configure colors, font sizes and text direction. ... application with nextJs that has a language detector using i18next.
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found