Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

LanguageDetector return Unknown for long text [BUG]

See original GitHub issue

SynapseML version

com.microsoft.azure:synapseml_2.12:0.9.5

System information

Language : pyspark
Spark Platform: Databricks Runtime version 10.4 LTS (includes Apache Spark 3.2.1, Scala 2.12)

Describe the problem

From my reproduction code, LanguageDetector can detect only very short words. (Return unknown)

Test results: 日本国 -> Japanese にほんこく - > Japanese 日本国（にほんこく、にっぽんこく - > Unknown 日本国（にほんこく、にっぽんこく、英: Japan）、または日本（にほん、にっぽん）は、東アジアに位置する民主制国家 [1]。首都は東京都[注 2][2][3 - > Unknown

I used to use this code and it can detect langugae with a long paragraph. (No change of environment version at all)

This bug occur few days ago.

Could you check what happen? And how can I solve this issue?

Code to reproduce issue

import synapse.ml
from synapse.ml.cognitive import *
from pyspark.sql.functions import col

print(f"synapse.ml.cognitive version:{synapse.ml.cognitive.__version__}")  # synapse.ml.cognitive version:0.9.5

# Set key
key = ''  # API key
location = 'japaneast' # Location

language = (LanguageDetector()
    .setSubscriptionKey(key)
    .setLocation(location)
    .setTextCol("text")
    .setOutputCol("language")
    .setErrorCol("error"))

# Test Text Analytics
test_data = spark.createDataFrame([(1, 'Japan'),
                                   (2, '日本国'),
                                   (3, 'にほんこく'),
                                   (4, '日本国（にほんこく、にっぽんこく'),
                                   (5, '日本国（にほんこく、にっぽんこく、英: Japan）、または日本（にほん、にっぽん）は、東アジアに位置する民主制国家 [1]。首都は東京都[注 2][2][3]。'),
                                  ], ["id", "text"])
# display(test_data)
test_data2 =  language.transform(test_data)
display(test_data2)

### RETURN
# synapse.ml.cognitive version:0.9.5
# 1
# Japan
# null
# [{"detectedLanguage": {"name": "English", "iso6391Name": "en", "confidenceScore": 0.98}, "warnings": [], "statistics": null, "error-message": null}]
# 2
# 日本国
# null
# [{"detectedLanguage": {"name": "Japanese", "iso6391Name": "ja", "confidenceScore": 1}, "warnings": [], "statistics": null, "error-message": null}]
# 3
# にほんこく
# null
# [{"detectedLanguage": {"name": "Japanese", "iso6391Name": "ja", "confidenceScore": 1}, "warnings": [], "statistics": null, "error-message": null}]
# 4
# 日本国（にほんこく、にっぽんこく
# null
# [{"detectedLanguage": {"name": "(Unknown)", "iso6391Name": "(Unknown)", "confidenceScore": 0}, "warnings": [], "statistics": null, "error-message": null}]
# 5
# 日本国（にほんこく、にっぽんこく、英: Japan）、または日本（にほん、にっぽん）は、東アジアに位置する民主制国家 [1]。首都は東京都[注 2][2][3]。
# null
# [{"detectedLanguage": {"name": "(Unknown)", "iso6391Name": "(Unknown)", "confidenceScore": 0}, "warnings": [], "statistics": null, "error-message": null}]

Other info / logs

No response

What component(s) does this bug affect?

area/cognitive: Cognitive project
area/core: Core project
area/deep-learning: DeepLearning project
area/lightgbm: Lightgbm project
area/opencv: Opencv project
area/vw: VW project
area/website: Website
area/build: Project build system
area/notebooks: Samples under notebooks folder
area/docker: Docker usage
area/models: models related issue

What language(s) does this bug affect?

language/scala: Scala source code
language/python: Pyspark APIs
language/r: R APIs
language/csharp: .NET APIs
language/new: Proposals for new client languages

What integration(s) does this bug affect?

integrations/synapse: Azure Synapse integrations
integrations/azureml: Azure ML integrations
integrations/databricks: Databricks integrations

Issue Analytics

State:
Created 10 months ago
Comments:9 (4 by maintainers)

Top GitHub Comments

2reactions

JessicaXYWangcommented, Dec 1, 2022

Hi @jingwora , confirm that I can repro this issue.

This issue is from Cognitive Service. I can repro this issue without using SynapseML.

key = '' #cognitive service key
endpoint = "" #cognitive service endpoint, eg: https://{yourworkspacename}.cognitiveservices.azure.com/

from azure.ai.textanalytics import TextAnalyticsClient
from azure.core.credentials import AzureKeyCredential

# Authenticate the client using your key and endpoint 
def authenticate_client():
    ta_credential = AzureKeyCredential(key)
    text_analytics_client = TextAnalyticsClient(
            endpoint=endpoint, 
            credential=ta_credential)
    return text_analytics_client

client = authenticate_client()

# Example method for detecting the language of text
def language_detection_example(client):
    try:
        documents = ["日本国（にほんこく、にっぽんこく、英: Japan）、または日本（にほん、にっぽん）は、東アジアに位置する民主制国家 [1]。首都は東京都[注 2][2][3]。"]
        response = client.detect_language(documents = documents, country_hint = 'us')[0]
        print("Language: ", response.primary_language.name)

    except Exception as err:
        print("Encountered exception. {}".format(err))
language_detection_example(client)

I have opened a ticket to Cognitive Service Language Detection team and will keep you updated.

1reaction

JessicaXYWangcommented, Dec 6, 2022

Hi @jingwora It can be automatically fixed when Cognitive Service team release a new version of language detection model.

But if you want to manually set a previous version to fix this issue now, the previous build won’t work.

Top Results From Across the Web

Language Identification using the 'fastText' package (a ...

Based on the R package documentation, “The function 'detect_language()' is vectorised and guesses the language of each string in text or returns ......

API - i18next documentation

Returns a t function that defaults to given language or namespace. All arguments can be optional/null.

Language detection with Google's Compact Language Detector

The detect method returns a tuple, including the language name and code (such as RUSSIAN , ru ), an isReliable boolean ( True...

Detect if text in English with python [closed] - Stack Overflow

Language Detector (in ruby not in python :/) Google Translate API v2 (No longer ... 20 characters long have to install PyEnchant or...

Internationalize your Next application with i18n and TypeScript

It is also to configure colors, font sizes and text direction. ... application with nextJs that has a language detector using i18next.