CLTK Arabic support
See original GitHub issueThis list for Arabic support issues and todos
General Improvements
- Make the code looks good and clean (Optimization/Performance/Quality) [WIP].
- Remove duplicated code [WIP].
1. Romanization Systems
- Improve Arabic Romanization system [WIP].
- Add Buckwalter transliteration.
- Add ISO233-2 transliteration.
- Add ArabTex transliteration [WIP].
- Arabtex and ISO 8859-6 need individual handling because in some cases are using one-to-two mapping.
- Add ISO 8859-6 [WIP].
- Add ASMO 449 [WIP].
- Add Arabic Windows[WIP].
- Add Phonetic Transcription is based on the scheme found in Arabic Through the Quran by Alan Jones (Islamic Texts Society, 2008) [WIP].
- Guess romanization system function.
2. Arabic Stop words project
- Improve the list of Arabic stop words [WIP].
3. Support remote libs
- Added pyarabic to CLTK without using pip package in this namespace
cltk.corpus.utils.arabic.pyarabic
. I suggest this solution for avoid problems of remote libs during installation or usage , most of users don’t have time to install extra packages [WIP]. - Add number function : transform arabic numbers <-> arabic strings [WIP].
- Add Araby_Statistics : a module to calculate different statistics on Quranic text [WIP].
- Add Arabic Normalizer[WIP]
- Add documentation for pyarabic lib [WIP] .
- Add unit testing for pyarabic lib [WIP] .
4. Arabic Tokenization
- Add Arabic word Tokenization.
- Add Arabic Sentence Tokenization.
5. Arabic Stemming
- Add Tashaphyne Arabic light stemming[WIP] and make it compatible with python3.6 [WIP].
- Add and Rewrite Snowball Arabic Stemmer support both(light stemming/ root-based stemming) in python3.6 and make it adapt with Classical Arabic[WIP].
- Re-implement Khoja’s Arabic Stemmer(root-based stemmer) with python 3.6. WIP via @ibrahimsharaf .
6. Arabic IR
- Add Alfanous Quranic search engine lib and make it compatible with python3.6.
- Make whoosh integration support Classical Arabic as well as.
7. Arabic Corpus
- Arabic Alphabet.
- arabic_text_perseus.
- arabic_morphology_quranic-corpus.
- arabic_morphology_quranic-corpus xml format.
- arabic_text_quranic_corpus.
- add cltk/sql_db_quranic to
cltk/corpus/arabic/corpora.py
file. - Add Shakkala
Issue Analytics
- State:
- Created 6 years ago
- Reactions:6
- Comments:6 (3 by maintainers)
Top Results From Across the Web
Arabic — Classical Language Toolkit documentation
CLTK Arabic Support ¶. 1. Pyarabic¶. Specific Arabic language library for Python, provides basic functions to manipulate Arabic letters and text, like detecting ......
Read more >An NLP Framework for Pre-Modern Languages - ACL Anthology
This paper announces version 1.0 of the Clas- sical Language Toolkit (CLTK), an NLP frame- work for pre-modern languages. The vast ma-.
Read more >Arabic - Lewis & Clark - LClark.edu
Arabic is the native language of more than 250 million people worldwide, ... At Lewis & Clark, courses focus on MSA with an...
Read more >Arabic Language Testing: The State of the Art - jstor
This article is an attempt to characterize and discuss Arabic language test- ... and John Clark's Arabic Proficiency Tests designed for DLI graduating...
Read more >Gender, Authorship, and Translation in Modern Arabic ...
Rather, as Clark himself points out, ʿUjaylī is “unknown and therefore a risk,” making him “like a first-time English-language novelist, ...
Read more >
Top Related Medium Post
No results found
Top Related StackOverflow Question
No results found
Troubleshoot Live Code
Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free
Top Related Reddit Thread
No results found
Top Related Hackernoon Post
No results found
Top Related Tweet
No results found
Top Related Dev.to Post
No results found
Top Related Hashnode Post
No results found
Hi @LBenzahia, @kylepjohnson, I am a native Arabic speaker with intermediate Python background, and I am willing to contribute to this, where can I start?
Hi @ibrahimsharaf, yes you can, It’ll be a great contribution khoja’s stemmer fits with classical arabic, Please i would you to Remember that we are working on Classical arabic not Modern arabic, you have to remove some rules in khoja stemmer : على سبيل المثال في العربية الكلاسكية نستعمل ألف الاستفهام بكثرة بدل من هل خاصة في نصو ص القرآن الكريم بعض الاوزان الدخيلة ﻻيمكنك معالجتها مثل وزن فعالة اسم الآلة غير موجودة في اللغة العربية الكلاسكية هي بعض .الفروقات أروجو منك أخذ هذا بعين الاعتبار Let me know if you want to solve any issue above and i’ll mention you there! For the implementation of the stemmer i would you to take a look at stem module for to do similar work with cltk style code and to respect their convention, Let me know if there’s any question. good luck