Some operations/functions usage need more clarification please
See original GitHub issueGreetings,
Appreciate the efforts of doing such a grand project and making it accessible to other researchers,
I’m currently trying out the tools to use it in cleaning and extracting features off my data and I’m having trouble using some of the functionalities because their documentation isn’t published yet or isn’t clear enough.
“camel_arclean” I couldn’t find it’s class or the way to invoke it
Utility arclean Cleans Arabic text by
Deleting characters that are not in Arabic, ASCII, or Latin-1. Converting all spacing characters to an ASCII space character. Converting Indic digits into Arabic digits. Converting extended Arabic letters into basic Arabic letters. Converting 1-char presentation froms into simple basic forms.
"dialectid " I’ve tried to run the example provided but I’m getting this error
from camel_tools.dialectid import DialectIdentifier
did = DialectIdentifier.pretrained()
sentences = [
'مال الهوى و مالي شكون اللي جابني ليك ما كنت انايا ف حالي بلاو قلبي يانا بيك',
'بدي دوب قلي قلي بجنون بحبك انا مجنون ما بنسى حبك يوم'
]
predictions = did.predict(sentences)
top_dialects = [p.top for p in predictions]
File "Anaconda3\lib\site-packages\camel_tools\dialectid\__init__.py", line 34, in <module>
import kenlm
ModuleNotFoundError: No module named 'kenlm'
“CalimaStarAnalyzer” I’m getting POS=noun_prop for all words, and never getting a stem. I’m depending on the first returned list of the list of lists that is returned by the functions, even though I checked the rest and didn’t find any right analysis. I used it on my data and used it on the example provided but couldn’t figure out what’s wrong. for example the verb ‘مشيت’ when analyzed gives a number of possible tags but none of them is ‘verb’
text = 'مشيت في الشارع' #example provided in doc
text2 = 'مقتل ضابط وجندي إسرائيليين في عملية دهس بالضفة الغربية'
from camel_tools.calima_star.database import CalimaStarDB
from camel_tools.calima_star.analyzer import CalimaStarAnalyzer
db = CalimaStarDB('E:\\Anaconda3\\Lib\\site-packages\\camel_tools\\calima_star\\databases\\calima-msa-1.0.db', 'a')
# Create analyzer with no backoff
analyzer = CalimaStarAnalyzer(db)
# Create analyzer with NOAN_ALL backoff
#analyzer = CalimaStarAnalyzer(db, 'NOAN_ALL')
# or
analyzer = CalimaStarAnalyzer(db, backoff='NOAN_ALL')
# To analyze a word, we can use the analyze() method
analyses1 = analyzer.analyze_words(text.split())
analyses = analyzer.analyze('مقتل') # All results=مقتل/NOUN_PROP
A snippet of returned analysis
{'diac': 'مقتل',
'lex': 'مقتل_0',
'bw': 'مقتل/NOUN_PROP',
'gloss': 'NO_ANALYSIS',
'pos': 'noun_prop',
'prc3': '0',
'prc2': '0',
'prc1': '0',
'prc0': '0',
'per': 'na',
'asp': 'na',
'vox': 'na',
'mod': 'na',
'gen': 'm',
'num': 's',
'stt': 'd',
'cas': 'u',
'enc0': '0',
'rat': 'i',
'source': 'backoff',
'form_gen': 'm',
'form_num': 's',
'catib6': '+NOM+',
'ud': '+PROPN+',
'pos_freq': -1.047404,
'pos_lex_freq': -99.0,
'lex_freq': -99.0,
'root': '',
'pattern': '',
'caphi': 'm_q_t_l',
'atbtok': 'مقتل',
'd2tok': 'مقتل',
'd1tok': 'مقتل',
'atbseg': 'مقتل',
'd3tok': 'مقتل',
'd3seg': 'مقتل',
'd2seg': 'مقتل',
'd1seg': 'مقتل',
'stem': 'مقتل',
'stemgloss': 'NO_ANALYSIS',
'stemcat': 'N0'}
“Generate lemma and features (CalimaStarReinflector)” I couldn’t find the file of the lemma db, and it wasn’t clear the way of constructing the features dictionary.
“CalimaStarGenerator” Same issue as above.
" Morphological Analyzer " I’m not getting any analysis results, and the morphological tokenizer ‘tokenize’ is giving the same results as the ‘simple_word_tokenize’ in tokenizers
from camel_tools.tokenizers import morphological
from camel_tools.disambig.mle import MLEDisambiguator
from camel_tools.calima_star.analyzer import CalimaStarAnalyzer
from camel_tools.calima_star.database import CalimaStarDB
# Initialize database in reinflection mode
db_disa = CalimaStarDB('E:\\Anaconda3\\Lib\\site-packages\\camel_tools\\calima_star\\databases\\morphology_db\\almor-msa-ext\\morphology.db','r')
disa = MLEDisambiguator(CalimaStarAnalyzer(db_disa, backoff='NONE', norm_map='<camel_tools.utils.charmap.CharMapper object>', strict_digit=False, cache_size=0), mle_path=None)
disa_sentence = disa.disambiguate(text_token)#,top=1)
disa_word = disa.disambiguate_word(text_token, word_ndx =0) #,top=1)
res_morph = morphological.MorphologicalTokenizer(disa, scheme='atbtok', split=True, diac=False) #res_morph.scheme_set() #{'atbtok', 'd3tok'}
tokenized_morph = res_morph.tokenize(text_token) #
text_token = ['مقتل',
'ضابط',
'وجندي',
'إسرائيليين',
'في',
'عملية',
'دهس',
'بالضفة',
'الغربية']
DisambiguatedWord(word='مقتل', analyses=[]),
DisambiguatedWord(word='ضابط', analyses=[]),
DisambiguatedWord(word='و', analyses=[]),
DisambiguatedWord(word='جندي', analyses=[]),
DisambiguatedWord(word='إسرائيليين', analyses=[]),
DisambiguatedWord(word='في', analyses=[]),
DisambiguatedWord(word='عملية', analyses=[]),
DisambiguatedWord(word='دهس', analyses=[]),
DisambiguatedWord(word='بالضفة', analyses=[]),
DisambiguatedWord(word='الغربية', analyses=[])]
" MLEDisambiguator "
from camel_tools.disambig.mle import MLEDisambiguator
mle = MLEDisambiguator.pretrained()
sentence = 'الطفلان أكلا الطعام معاً وأخذا 5 تفاحات'.split()
disambig = mle.disambiguate(sentence)
# Let's, for example, use the top disambiguations to generate a diacritized
# version of the above sentence.
# Note that, in practice, you'll need to make sure that each word has a
# non-zero list of analyses.
diacritized = [d.analyses[0].analysis['diac'] for d in disambig]
print(' '.join(diacritized))
I’m getting results on some nouns, but so far I had no luck with POS or other features such as form_num, gen, mod when it comes to plurals, ones that are connected to a pronoun or verbs… etc
#print
الطفلان اكلا الطَعامِ مَعاً واخذا 5 تفاحات
#Analysis
[DisambiguatedWord(word='الطفلان', analyses=[ScoredAnalysis(score=1.0, analysis={'diac': 'الطفلان', 'lex': 'الطفلان_0', 'bw': 'الطفلان/NOUN_PROP', 'gloss': 'NO_ANALYSIS', 'pos': 'noun_prop', 'prc3': '0', 'prc2': '0', 'prc1': '0', 'prc0': '0', 'per': 'na', 'asp': 'na', 'vox': 'na', 'mod': 'na', 'stt': 'i', 'cas': 'u', 'enc0': '0', 'rat': 'i', 'source': 'backoff', 'form_gen': '-', 'form_num': '-', 'gen': '-', 'ud': '+PROPN+', 'catib6': '+NOM+', 'pos_lex_freq': -99.0, 'num': '-', 'pos_freq': -99.0, 'lex_freq': -99.0, 'caphi': '2_l_t._f_l_aa_n', 'atbseg': 'NOAN', 'd3seg': 'NOAN', 'd2tok': 'NOAN', 'root': 'O', 'pattern': 'N1AN', 'd2seg': 'NOAN', 'atbtok': 'NOAN', 'd1tok': 'NOAN', 'd3tok': 'NOAN', 'd1seg': 'NOAN', 'stem': 'الطفلان', 'stemgloss': 'NO_ANALYSIS', 'stemcat': 'N0'})]),
DisambiguatedWord(word='أكلا', analyses=[ScoredAnalysis(score=1.0, analysis={'diac': 'اكلا', 'lex': 'اكلا_0', 'bw': 'اكلا/NOUN_PROP', 'gloss': 'NO_ANALYSIS', 'pos': 'noun_prop', 'prc3': '0', 'prc2': '0', 'prc1': '0', 'prc0': '0', 'per': 'na', 'asp': 'na', 'vox': 'na', 'mod': 'na', 'stt': 'i', 'cas': 'u', 'enc0': '0', 'rat': 'i', 'source': 'backoff', 'form_gen': '-', 'form_num': '-', 'gen': '-', 'ud': '+PROPN+', 'catib6': '+NOM+', 'pos_lex_freq': -99.0, 'num': '-', 'pos_freq': -99.0, 'lex_freq': -99.0, 'caphi': '2_k_l_aa', 'atbseg': 'NOAN', 'd3seg': 'NOAN', 'd2tok': 'NOAN', 'root': 'O', 'pattern': 'N1AN', 'd2seg': 'NOAN', 'atbtok': 'NOAN', 'd1tok': 'NOAN', 'd3tok': 'NOAN', 'd1seg': 'NOAN', 'stem': 'اكلا', 'stemgloss': 'NO_ANALYSIS', 'stemcat': 'N0'})]),
DisambiguatedWord(word='الطعام', analyses=[ScoredAnalysis(score=1.0, analysis={'diac': 'الطَعامِ', 'lex': 'طَعام_1', 'bw': 'ال/DET+طَعام/NOUN+ِ/CASE_DEF_GEN', 'gloss': 'the+food+[def.gen.]', 'pos': 'noun', 'prc3': '0', 'prc2': '0', 'prc1': '0', 'prc0': 'Al_det', 'per': 'na', 'asp': 'na', 'vox': 'na', 'mod': 'na', 'form_gen': 'm', 'gen': 'm', 'form_num': 's', 'num': 's', 'stt': 'd', 'cas': 'g', 'enc0': '0', 'rat': 'i', 'source': 'lex', 'stem': 'طَعام', 'stemcat': 'N', 'stemgloss': 'food', 'caphi': '2_a_t._t._a_3_aa_m_i', 'catib6': 'PRT+NOM+', 'ud': 'DET+NOUN+', 'root': 'ط.ع.م', 'pattern': 'ال1َ2ا3ِ', 'd3seg': 'ال+_طَعامِ', 'atbseg': 'الطَعامِ', 'd2seg': 'الطَعامِ', 'd1seg': 'الطَعامِ', 'd1tok': 'الطَّعامِ', 'd2tok': 'الطَّعامِ', 'atbtok': 'الطَّعامِ', 'd3tok': 'ال+_طَعامِ', 'pos_freq': '-0.4344233', 'lex_freq': '-4.660188', 'pos_lex_freq': '-4.660188'})]),
DisambiguatedWord(word='معاً', analyses=[ScoredAnalysis(score=1.0, analysis={'diac': 'مَعاً', 'lex': 'مَعاً_1', 'bw': 'مَعاً/ADV', 'gloss': 'together', 'pos': 'adv', 'prc3': '0', 'prc2': '0', 'prc1': '0', 'prc0': '0', 'per': 'na', 'asp': 'na', 'vox': 'na', 'mod': 'na', 'form_gen': '-', 'gen': '-', 'form_num': '-', 'num': '-', 'stt': 'i', 'cas': 'u', 'enc0': '0', 'rat': 'y', 'source': 'lex', 'stem': 'مَعاً', 'stemcat': 'FW-Wa', 'stemgloss': 'together', 'caphi': 'm_a_3_a_n', 'catib6': '++', 'ud': '++', 'root': 'مع', 'pattern': '1َ2اً', 'd3seg': 'مَعاً', 'atbseg': 'مَعاً', 'd2seg': 'مَعاً', 'd1seg': 'مَعاً', 'd1tok': 'مَعاً', 'd2tok': 'مَعاً', 'atbtok': 'مَعاً', 'd3tok': 'مَعاً', 'pos_freq': '-99.0', 'lex_freq': '-99.0', 'pos_lex_freq': '-99.0'})]),
DisambiguatedWord(word='وأخذا', analyses=[ScoredAnalysis(score=1.0, analysis={'diac': 'واخذا', 'lex': 'واخذا_0', 'bw': 'واخذا/NOUN_PROP', 'gloss': 'NO_ANALYSIS', 'pos': 'noun_prop', 'prc3': '0', 'prc2': '0', 'prc1': '0', 'prc0': '0', 'per': 'na', 'asp': 'na', 'vox': 'na', 'mod': 'na', 'stt': 'i', 'cas': 'u', 'enc0': '0', 'rat': 'i', 'source': 'backoff', 'form_gen': '-', 'form_num': '-', 'gen': '-', 'ud': '+PROPN+', 'catib6': '+NOM+', 'pos_lex_freq': -99.0, 'num': '-', 'pos_freq': -99.0, 'lex_freq': -99.0, 'caphi': 'w_aa_kh_dh_aa', 'atbseg': 'NOAN', 'd3seg': 'NOAN', 'd2tok': 'NOAN', 'root': 'O', 'pattern': 'N1AN', 'd2seg': 'NOAN', 'atbtok': 'NOAN', 'd1tok': 'NOAN', 'd3tok': 'NOAN', 'd1seg': 'NOAN', 'stem': 'واخذا', 'stemgloss': 'NO_ANALYSIS', 'stemcat': 'N0'})]),
DisambiguatedWord(word='5', analyses=[ScoredAnalysis(score=1.0, analysis={'pos': 'digit', 'diac': '5', 'lex': '5_0', 'bw': '5/NOUN_NUM', 'gloss': '5', 'prc3': 'na', 'prc2': 'na', 'prc1': 'na', 'prc0': 'na', 'per': 'na', 'asp': 'na', 'vox': 'na', 'mod': 'na', 'gen': 'na', 'num': 'na', 'stt': 'na', 'cas': 'na', 'enc0': 'na', 'rat': 'na', 'source': 'digit', 'form_gen': 'na', 'form_num': 'na', 'catib6': 'NOM', 'ud': 'NUM', 'd3seg': '5', 'atbseg': '5', 'd2seg': '5', 'd1seg': '5', 'd1tok': '5', 'd2tok': '5', 'atbtok': '5', 'd3tok': '5', 'pos_freq': -99.0, 'pos_lex_freq': -99.0, 'lex_freq': -99.0, 'root': 'DIGIT', 'pattern': 'DIGIT', 'caphi': 'DIGIT', 'stem': '5', 'stemgloss': '5', 'stemcat': None})]),
DisambiguatedWord(word='تفاحات', analyses=[ScoredAnalysis(score=1.0, analysis={'diac': 'تفاحات', 'lex': 'تفاحات_0', 'bw': 'تفاحات/NOUN_PROP', 'gloss': 'NO_ANALYSIS', 'pos': 'noun_prop', 'prc3': '0', 'prc2': '0', 'prc1': '0', 'prc0': '0', 'per': 'na', 'asp': 'na', 'vox': 'na', 'mod': 'na', 'stt': 'i', 'cas': 'u', 'enc0': '0', 'rat': 'i', 'source': 'backoff', 'form_gen': '-', 'form_num': '-', 'gen': '-', 'ud': '+PROPN+', 'catib6': '+NOM+', 'pos_lex_freq': -99.0, 'num': '-', 'pos_freq': -99.0, 'lex_freq': -99.0, 'caphi': 't_f_aa_7_aa_t', 'atbseg': 'NOAN', 'd3seg': 'NOAN', 'd2tok': 'NOAN', 'root': 'O', 'pattern': 'N1AN', 'd2seg': 'NOAN', 'atbtok': 'NOAN', 'd1tok': 'NOAN', 'd3tok': 'NOAN', 'd1seg': 'NOAN', 'stem': 'تفاحات', 'stemgloss': 'NO_ANALYSIS', 'stemcat': 'N0'})])]
Issue Analytics
- State:
- Created 3 years ago
- Comments:5
Top GitHub Comments
Hi @Sue-Fwl ,
I’m glad everything worked out.
Yes, we will definitely do so on official releases (those installed from pip), and we will indicate the minimum camel_tools version the current data files are compatible with.
However, we are moving very quickly with development at the moment for the next official release and so you’ll have to expect both code and data will change at any moment. The master branch is not an official release but represents the current state of development.
As a rule of thumb please reinstall the data files whenever you reinstall camel_tools from master (if you do experience any issues at the very least).
Greetings, Many thanks for the prompt replies.
Morphology While I’m still not familiarized with all the features but as far as the ones I know, the results are accurate. Many thanks. Examples :
On a side note, I would like to suggest creating a notice of some sort when modifying the data. Because while testing the new updates I got an error concerning the database (I can’t remember exactly since I forgot to copy the message), and I figured as the project went through major changes it’s normal for the database to go through changes too. So I reinstalled the data files and replaced the old ones (downloaded 28th of Augest ) with them and the modules worked fine, and so did the MorphologicalTokenizer.