question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Some operations/functions usage need more clarification please

See original GitHub issue

Greetings,

Appreciate the efforts of doing such a grand project and making it accessible to other researchers,

I’m currently trying out the tools to use it in cleaning and extracting features off my data and I’m having trouble using some of the functionalities because their documentation isn’t published yet or isn’t clear enough.

“camel_arclean” I couldn’t find it’s class or the way to invoke it

Utility arclean Cleans Arabic text by

Deleting characters that are not in Arabic, ASCII, or Latin-1. Converting all spacing characters to an ASCII space character. Converting Indic digits into Arabic digits. Converting extended Arabic letters into basic Arabic letters. Converting 1-char presentation froms into simple basic forms.

"dialectid " I’ve tried to run the example provided but I’m getting this error

from camel_tools.dialectid import DialectIdentifier

did = DialectIdentifier.pretrained()

sentences = [
    'مال الهوى و مالي شكون اللي جابني ليك  ما كنت انايا ف حالي بلاو قلبي يانا بيك',
    'بدي دوب قلي قلي بجنون بحبك انا مجنون ما بنسى حبك يوم'
]

predictions = did.predict(sentences)
top_dialects = [p.top for p in predictions]
File "Anaconda3\lib\site-packages\camel_tools\dialectid\__init__.py", line 34, in <module>
    import kenlm

ModuleNotFoundError: No module named 'kenlm'

“CalimaStarAnalyzer” I’m getting POS=noun_prop for all words, and never getting a stem. I’m depending on the first returned list of the list of lists that is returned by the functions, even though I checked the rest and didn’t find any right analysis. I used it on my data and used it on the example provided but couldn’t figure out what’s wrong. for example the verb ‘مشيت’ when analyzed gives a number of possible tags but none of them is ‘verb’

text = 'مشيت في الشارع' #example provided in doc
text2 = 'مقتل ضابط وجندي إسرائيليين في عملية دهس بالضفة الغربية'
from camel_tools.calima_star.database import CalimaStarDB
from camel_tools.calima_star.analyzer import CalimaStarAnalyzer

db = CalimaStarDB('E:\\Anaconda3\\Lib\\site-packages\\camel_tools\\calima_star\\databases\\calima-msa-1.0.db', 'a')
# Create analyzer with no backoff
analyzer = CalimaStarAnalyzer(db)
# Create analyzer with NOAN_ALL backoff
#analyzer = CalimaStarAnalyzer(db, 'NOAN_ALL')
# or
analyzer = CalimaStarAnalyzer(db, backoff='NOAN_ALL')

# To analyze a word, we can use the analyze() method
analyses1 = analyzer.analyze_words(text.split())
analyses = analyzer.analyze('مقتل') # All results=مقتل/NOUN_PROP

A snippet of returned analysis

{'diac': 'مقتل',
 'lex': 'مقتل_0',
 'bw': 'مقتل/NOUN_PROP',
 'gloss': 'NO_ANALYSIS',
 'pos': 'noun_prop',
 'prc3': '0',
 'prc2': '0',
 'prc1': '0',
 'prc0': '0',
 'per': 'na',
 'asp': 'na',
 'vox': 'na',
 'mod': 'na',
 'gen': 'm',
 'num': 's',
 'stt': 'd',
 'cas': 'u',
 'enc0': '0',
 'rat': 'i',
 'source': 'backoff',
 'form_gen': 'm',
 'form_num': 's',
 'catib6': '+NOM+',
 'ud': '+PROPN+',
 'pos_freq': -1.047404,
 'pos_lex_freq': -99.0,
 'lex_freq': -99.0,
 'root': '',
 'pattern': '',
 'caphi': 'm_q_t_l',
 'atbtok': 'مقتل',
 'd2tok': 'مقتل',
 'd1tok': 'مقتل',
 'atbseg': 'مقتل',
 'd3tok': 'مقتل',
 'd3seg': 'مقتل',
 'd2seg': 'مقتل',
 'd1seg': 'مقتل',
 'stem': 'مقتل',
 'stemgloss': 'NO_ANALYSIS',
 'stemcat': 'N0'}

“Generate lemma and features (CalimaStarReinflector)” I couldn’t find the file of the lemma db, and it wasn’t clear the way of constructing the features dictionary.

“CalimaStarGenerator” Same issue as above.

" Morphological Analyzer " I’m not getting any analysis results, and the morphological tokenizer ‘tokenize’ is giving the same results as the ‘simple_word_tokenize’ in tokenizers

from camel_tools.tokenizers import morphological
from camel_tools.disambig.mle import MLEDisambiguator
from camel_tools.calima_star.analyzer import CalimaStarAnalyzer
from camel_tools.calima_star.database import CalimaStarDB

# Initialize database in reinflection mode
db_disa = CalimaStarDB('E:\\Anaconda3\\Lib\\site-packages\\camel_tools\\calima_star\\databases\\morphology_db\\almor-msa-ext\\morphology.db','r')
disa = MLEDisambiguator(CalimaStarAnalyzer(db_disa, backoff='NONE', norm_map='<camel_tools.utils.charmap.CharMapper object>', strict_digit=False, cache_size=0), mle_path=None)

disa_sentence = disa.disambiguate(text_token)#,top=1)

disa_word = disa.disambiguate_word(text_token, word_ndx =0) #,top=1)

res_morph = morphological.MorphologicalTokenizer(disa, scheme='atbtok', split=True, diac=False) #res_morph.scheme_set() #{'atbtok', 'd3tok'}

tokenized_morph = res_morph.tokenize(text_token)  #
text_token = ['مقتل',
 'ضابط',
 'وجندي',
 'إسرائيليين',
 'في',
 'عملية',
 'دهس',
 'بالضفة',
 'الغربية']

DisambiguatedWord(word='مقتل', analyses=[]),
 DisambiguatedWord(word='ضابط', analyses=[]),
 DisambiguatedWord(word='و', analyses=[]),
 DisambiguatedWord(word='جندي', analyses=[]),
 DisambiguatedWord(word='إسرائيليين', analyses=[]),
 DisambiguatedWord(word='في', analyses=[]),
 DisambiguatedWord(word='عملية', analyses=[]),
 DisambiguatedWord(word='دهس', analyses=[]),
 DisambiguatedWord(word='بالضفة', analyses=[]),
 DisambiguatedWord(word='الغربية', analyses=[])]

" MLEDisambiguator "

from camel_tools.disambig.mle import MLEDisambiguator

mle = MLEDisambiguator.pretrained()

sentence = 'الطفلان أكلا الطعام معاً وأخذا 5 تفاحات'.split()
disambig = mle.disambiguate(sentence)

# Let's, for example, use the top disambiguations to generate a diacritized
# version of the above sentence.
# Note that, in practice, you'll need to make sure that each word has a
# non-zero list of analyses.
diacritized = [d.analyses[0].analysis['diac'] for d in disambig]
print(' '.join(diacritized))

I’m getting results on some nouns, but so far I had no luck with POS or other features such as form_num, gen, mod when it comes to plurals, ones that are connected to a pronoun or verbs… etc

#print
الطفلان اكلا الطَعامِ مَعاً واخذا 5 تفاحات

#Analysis
[DisambiguatedWord(word='الطفلان', analyses=[ScoredAnalysis(score=1.0, analysis={'diac': 'الطفلان', 'lex': 'الطفلان_0', 'bw': 'الطفلان/NOUN_PROP', 'gloss': 'NO_ANALYSIS', 'pos': 'noun_prop', 'prc3': '0', 'prc2': '0', 'prc1': '0', 'prc0': '0', 'per': 'na', 'asp': 'na', 'vox': 'na', 'mod': 'na', 'stt': 'i', 'cas': 'u', 'enc0': '0', 'rat': 'i', 'source': 'backoff', 'form_gen': '-', 'form_num': '-', 'gen': '-', 'ud': '+PROPN+', 'catib6': '+NOM+', 'pos_lex_freq': -99.0, 'num': '-', 'pos_freq': -99.0, 'lex_freq': -99.0, 'caphi': '2_l_t._f_l_aa_n', 'atbseg': 'NOAN', 'd3seg': 'NOAN', 'd2tok': 'NOAN', 'root': 'O', 'pattern': 'N1AN', 'd2seg': 'NOAN', 'atbtok': 'NOAN', 'd1tok': 'NOAN', 'd3tok': 'NOAN', 'd1seg': 'NOAN', 'stem': 'الطفلان', 'stemgloss': 'NO_ANALYSIS', 'stemcat': 'N0'})]),
 DisambiguatedWord(word='أكلا', analyses=[ScoredAnalysis(score=1.0, analysis={'diac': 'اكلا', 'lex': 'اكلا_0', 'bw': 'اكلا/NOUN_PROP', 'gloss': 'NO_ANALYSIS', 'pos': 'noun_prop', 'prc3': '0', 'prc2': '0', 'prc1': '0', 'prc0': '0', 'per': 'na', 'asp': 'na', 'vox': 'na', 'mod': 'na', 'stt': 'i', 'cas': 'u', 'enc0': '0', 'rat': 'i', 'source': 'backoff', 'form_gen': '-', 'form_num': '-', 'gen': '-', 'ud': '+PROPN+', 'catib6': '+NOM+', 'pos_lex_freq': -99.0, 'num': '-', 'pos_freq': -99.0, 'lex_freq': -99.0, 'caphi': '2_k_l_aa', 'atbseg': 'NOAN', 'd3seg': 'NOAN', 'd2tok': 'NOAN', 'root': 'O', 'pattern': 'N1AN', 'd2seg': 'NOAN', 'atbtok': 'NOAN', 'd1tok': 'NOAN', 'd3tok': 'NOAN', 'd1seg': 'NOAN', 'stem': 'اكلا', 'stemgloss': 'NO_ANALYSIS', 'stemcat': 'N0'})]),
 DisambiguatedWord(word='الطعام', analyses=[ScoredAnalysis(score=1.0, analysis={'diac': 'الطَعامِ', 'lex': 'طَعام_1', 'bw': 'ال/DET+طَعام/NOUN+ِ/CASE_DEF_GEN', 'gloss': 'the+food+[def.gen.]', 'pos': 'noun', 'prc3': '0', 'prc2': '0', 'prc1': '0', 'prc0': 'Al_det', 'per': 'na', 'asp': 'na', 'vox': 'na', 'mod': 'na', 'form_gen': 'm', 'gen': 'm', 'form_num': 's', 'num': 's', 'stt': 'd', 'cas': 'g', 'enc0': '0', 'rat': 'i', 'source': 'lex', 'stem': 'طَعام', 'stemcat': 'N', 'stemgloss': 'food', 'caphi': '2_a_t._t._a_3_aa_m_i', 'catib6': 'PRT+NOM+', 'ud': 'DET+NOUN+', 'root': 'ط.ع.م', 'pattern': 'ال1َ2ا3ِ', 'd3seg': 'ال+_طَعامِ', 'atbseg': 'الطَعامِ', 'd2seg': 'الطَعامِ', 'd1seg': 'الطَعامِ', 'd1tok': 'الطَّعامِ', 'd2tok': 'الطَّعامِ', 'atbtok': 'الطَّعامِ', 'd3tok': 'ال+_طَعامِ', 'pos_freq': '-0.4344233', 'lex_freq': '-4.660188', 'pos_lex_freq': '-4.660188'})]),
 DisambiguatedWord(word='معاً', analyses=[ScoredAnalysis(score=1.0, analysis={'diac': 'مَعاً', 'lex': 'مَعاً_1', 'bw': 'مَعاً/ADV', 'gloss': 'together', 'pos': 'adv', 'prc3': '0', 'prc2': '0', 'prc1': '0', 'prc0': '0', 'per': 'na', 'asp': 'na', 'vox': 'na', 'mod': 'na', 'form_gen': '-', 'gen': '-', 'form_num': '-', 'num': '-', 'stt': 'i', 'cas': 'u', 'enc0': '0', 'rat': 'y', 'source': 'lex', 'stem': 'مَعاً', 'stemcat': 'FW-Wa', 'stemgloss': 'together', 'caphi': 'm_a_3_a_n', 'catib6': '++', 'ud': '++', 'root': 'مع', 'pattern': '1َ2اً', 'd3seg': 'مَعاً', 'atbseg': 'مَعاً', 'd2seg': 'مَعاً', 'd1seg': 'مَعاً', 'd1tok': 'مَعاً', 'd2tok': 'مَعاً', 'atbtok': 'مَعاً', 'd3tok': 'مَعاً', 'pos_freq': '-99.0', 'lex_freq': '-99.0', 'pos_lex_freq': '-99.0'})]),
 DisambiguatedWord(word='وأخذا', analyses=[ScoredAnalysis(score=1.0, analysis={'diac': 'واخذا', 'lex': 'واخذا_0', 'bw': 'واخذا/NOUN_PROP', 'gloss': 'NO_ANALYSIS', 'pos': 'noun_prop', 'prc3': '0', 'prc2': '0', 'prc1': '0', 'prc0': '0', 'per': 'na', 'asp': 'na', 'vox': 'na', 'mod': 'na', 'stt': 'i', 'cas': 'u', 'enc0': '0', 'rat': 'i', 'source': 'backoff', 'form_gen': '-', 'form_num': '-', 'gen': '-', 'ud': '+PROPN+', 'catib6': '+NOM+', 'pos_lex_freq': -99.0, 'num': '-', 'pos_freq': -99.0, 'lex_freq': -99.0, 'caphi': 'w_aa_kh_dh_aa', 'atbseg': 'NOAN', 'd3seg': 'NOAN', 'd2tok': 'NOAN', 'root': 'O', 'pattern': 'N1AN', 'd2seg': 'NOAN', 'atbtok': 'NOAN', 'd1tok': 'NOAN', 'd3tok': 'NOAN', 'd1seg': 'NOAN', 'stem': 'واخذا', 'stemgloss': 'NO_ANALYSIS', 'stemcat': 'N0'})]),
 DisambiguatedWord(word='5', analyses=[ScoredAnalysis(score=1.0, analysis={'pos': 'digit', 'diac': '5', 'lex': '5_0', 'bw': '5/NOUN_NUM', 'gloss': '5', 'prc3': 'na', 'prc2': 'na', 'prc1': 'na', 'prc0': 'na', 'per': 'na', 'asp': 'na', 'vox': 'na', 'mod': 'na', 'gen': 'na', 'num': 'na', 'stt': 'na', 'cas': 'na', 'enc0': 'na', 'rat': 'na', 'source': 'digit', 'form_gen': 'na', 'form_num': 'na', 'catib6': 'NOM', 'ud': 'NUM', 'd3seg': '5', 'atbseg': '5', 'd2seg': '5', 'd1seg': '5', 'd1tok': '5', 'd2tok': '5', 'atbtok': '5', 'd3tok': '5', 'pos_freq': -99.0, 'pos_lex_freq': -99.0, 'lex_freq': -99.0, 'root': 'DIGIT', 'pattern': 'DIGIT', 'caphi': 'DIGIT', 'stem': '5', 'stemgloss': '5', 'stemcat': None})]),
 DisambiguatedWord(word='تفاحات', analyses=[ScoredAnalysis(score=1.0, analysis={'diac': 'تفاحات', 'lex': 'تفاحات_0', 'bw': 'تفاحات/NOUN_PROP', 'gloss': 'NO_ANALYSIS', 'pos': 'noun_prop', 'prc3': '0', 'prc2': '0', 'prc1': '0', 'prc0': '0', 'per': 'na', 'asp': 'na', 'vox': 'na', 'mod': 'na', 'stt': 'i', 'cas': 'u', 'enc0': '0', 'rat': 'i', 'source': 'backoff', 'form_gen': '-', 'form_num': '-', 'gen': '-', 'ud': '+PROPN+', 'catib6': '+NOM+', 'pos_lex_freq': -99.0, 'num': '-', 'pos_freq': -99.0, 'lex_freq': -99.0, 'caphi': 't_f_aa_7_aa_t', 'atbseg': 'NOAN', 'd3seg': 'NOAN', 'd2tok': 'NOAN', 'root': 'O', 'pattern': 'N1AN', 'd2seg': 'NOAN', 'atbtok': 'NOAN', 'd1tok': 'NOAN', 'd3tok': 'NOAN', 'd1seg': 'NOAN', 'stem': 'تفاحات', 'stemgloss': 'NO_ANALYSIS', 'stemcat': 'N0'})])]

Issue Analytics

  • State:closed
  • Created 3 years ago
  • Comments:5

github_iconTop GitHub Comments

1reaction
owocommented, Sep 28, 2020

Hi @Sue-Fwl ,

I’m glad everything worked out.

On a side note, I would like to suggest creating a notice of some sort when modifying the data.

Yes, we will definitely do so on official releases (those installed from pip), and we will indicate the minimum camel_tools version the current data files are compatible with.

However, we are moving very quickly with development at the moment for the next official release and so you’ll have to expect both code and data will change at any moment. The master branch is not an official release but represents the current state of development.

As a rule of thumb please reinstall the data files whenever you reinstall camel_tools from master (if you do experience any issues at the very least).

0reactions
Sue-Fwlcommented, Sep 27, 2020

Greetings, Many thanks for the prompt replies.

Morphology While I’m still not familiarized with all the features but as far as the ones I know, the results are accurate. Many thanks. Examples :

'لقي قاتل الجندي مصرعه'
'قاتل الجندي في المعركة'
 'صمتٌ قاتل'
analyzer.analyze('قاتل')
analyses[0]:
{'diac': 'قاتَلَ', 'lex': 'قاتَل_1', 'bw': 'قاتَل/PV+َ/PVSUFF_SUBJ:3MS', 'gloss': 'fight+he;it_<verb>', 'pos': 'verb', 'prc3': '0', 'prc2': '0',
 'prc1': '0', 'prc0': '0', 'per': '3', 'asp': 'p', 'vox': 'a', 'mod': 'i', 'stt': 'na', 'cas': 'na', 'enc0': '0', 'rat': 'n', 'source': 'lex', 'form_gen': 'm',
 'form_num': 's', 'catib6': '+VRB+', 'pos_freq': -1.023208, 'd3tok': 'قاتَلَ', 'd2seg': 'قاتَلَ', 'root': 'ق.ت.ل', 'd1seg': 'قاتَلَ', 'gen': 'm',
 'd1tok': 'قاتَلَ', 'caphi': 'q_aa_t_a_l_a', 'd3seg': 'قاتَلَ', 'lex_freq': -4.497461, 'ud': '+VERB+', 'pattern': '1ا2َ3َ', 'atbtok': 'قاتَلَ',
 'pos_lex_freq': -4.497461, 'd2tok': 'قاتَلَ', 'atbseg': 'قاتَلَ', 'num': 's', 'stem': 'قاتَل', 'stemgloss': 'fight', 'stemcat': 'PV'}
analyses[1]:
{'diac': 'قاتِل', 'lex': 'قاتِل_1', 'bw': 'قاتِل/ADJ', 'gloss': 'deadly;fatal', 'pos': 'adj', 'prc3': '0', 'prc2': '0', 'prc1': '0', 'prc0': '0', 'per': 'na', 'asp': 'na', 'vox': 'na', 'mod': 'na', 'stt': 'i', 'cas': 'u', 'enc0': '0', 'rat': 'n', 'source': 'lex', 'form_gen': 'm', 'form_num': 's', 'catib6': '+NOM+', 'pos_freq': -0.9868824, 'd3tok': 'قاتِل', 'd2seg': 'قاتِل', 'root': 'ق.ت.ل', 'd1seg': 'قاتِل', 'gen': 'm', 'd1tok': 'قاتِل', 'caphi': 'q_aa_t_i_l', 'd3seg': 'قاتِل', 'lex_freq': -4.660188, 'ud': '+ADJ+', 'pattern': '1ا2ِ3', 'atbtok': 'قاتِل', 'pos_lex_freq': -4.660188,
 'd2tok': 'قاتِل', 'atbseg': 'قاتِل', 'num': 's', 'stem': 'قاتِل', 'stemgloss': 'deadly;fatal', 'stemcat': 'N-ap'}
analyses[12]:
{'diac': 'قاتِل', 'lex': 'قاتِل_2', 'bw': 'قاتِل/NOUN', 'gloss': 'murderer;assassin', 'pos': 'noun', 'prc3': '0', 'prc2': '0', 'prc1': '0', 'prc0': '0', 'per': 'na', 'asp': 'na', 'vox': 'na', 'mod': 'na', 'stt': 'i', 'cas': 'u', 'enc0': '0', 'rat': 'r', 'source': 'lex', 'form_gen': 'm', 'form_num': 's', 'catib6': '+NOM+', 'pos_freq': -0.4344233, 'd3tok': 'قاتِل', 'd2seg': 'قاتِل', 'root': 'ق.ت.ل', 'd1seg': 'قاتِل', 'gen': 'm', 'd1tok': 'قاتِل',
 'caphi': 'q_aa_t_i_l', 'd3seg': 'قاتِل', 'lex_freq': -4.497461, 'ud': '+NOUN+', 'pattern': '1ا2ِ3', 'atbtok': 'قاتِل', 'pos_lex_freq': -4.497461,
 'd2tok': 'قاتِل', 'atbseg': 'قاتِل', 'num': 's', 'stem': 'قاتِل', 'stemgloss': 'murderer;assassin', 'stemcat': 'Nall'}

On a side note, I would like to suggest creating a notice of some sort when modifying the data. Because while testing the new updates I got an error concerning the database (I can’t remember exactly since I forgot to copy the message), and I figured as the project went through major changes it’s normal for the database to go through changes too. So I reinstalled the data files and replaced the old ones (downloaded 28th of Augest ) with them and the modules worked fine, and so did the MorphologicalTokenizer.

Read more comments on GitHub >

github_iconTop Results From Across the Web

7 Functions of Operations Management and Skills Needed ...
Operations management keeps processes running smoothly within organizations. Learn more about the main responsibilities of operations teams.
Read more >
The Control Function of Management
The need for controls over any particular behavior or operation within an organization depends very simply on the impact of that area on...
Read more >
How to Ask for Clarification - VOA Learning English
After you express your lack of understanding, the next step is to ask the person to clarify what they have said. Here are...
Read more >
Chapter 3. Operationalizing
This process is called operationalization. Your operational definitions describe the variables you will use as indicators and the procedures you will use to ......
Read more >
10 sample customer service email templates - TechTarget
Review and use these sample customer service email templates to improve your customer experience strategy.
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found