Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

[QUESTION] Information on NER component

See original GitHub issue

Describe what you would like to know about CAMeL Tools.

Hello, I wanted to know if you could provide some information regarding the NER component of the library.

In the catalog JSON file, you mention that you are using a finetuned AraBERT model, with the specified version being 1.0.0. So from here, I wanted to know:

whether the model used as base was indeed AraBERTv1 from this repo ?
which dataset you used ?
whether you used the FARASA preprocessing for the finetuning or your own given that they used the former for pretraining ?

I ask because while doing some research I saw that your lab has produced multiple arabic BERT models, which have the benefit of:

having used the camel_tools preprocessing rather the FARASA for both pretraining and finetuning
have dialect-specific variants, which may be interesting in some cases
seem to outperform the AraBERTv1 on NER tasks according to your paper

I was wondering whether you would consider making these models available for use in this library ? I know you have released the code and pretrained model, and I am planning on experimenting with this, but thought it would be a nice addition.

Issue Analytics

State:
Created 2 years ago
Comments:6 (3 by maintainers)

Top GitHub Comments

1reaction

balhafnicommented, Oct 3, 2021

So for the NER component, we didn’t do any preprocessing before fine-tuning and we used aubmindlab/bert-base-arabertv01, which did not use FARASA segmentation before the pretraining.

We also just released new NER models which were fine-tuned using our own CAMeLBERT models on Hugging Face’s model hub. Here’s an example on how to use the CAMeLBERT NER MSA model. Disclaimer: Although in the example we use the NER component from CAMeL Tools to load the model directly from hub, this is still a work in progress so please use with caution.

1reaction

owocommented, Oct 1, 2021

Hi @rom1K ,

The version numbers in catalogue.json are our own internal versioning for datasets and have nothing to do with the AraBERT version used. @balhafni could tell you the exact AraBERT version we used in our current model.

We fine-tune using the ANERcorp dataset (you can read more about that in our paper) but we don’t use FARASA for pereprocessing. Again, @balhafni can tell you exactly what preprocessing we perform.

We definitely plan to incorporate the new BERT models in a future release of camel-tools 😃

Top Results From Across the Web

Named Entity Recognition for Question Answering

Current text-based question answering (QA) systems usually contain a named en- tity recogniser (NER) as a core compo- nent. Named entity recognition has...

Improving Question Answering Using Named Entity Recognition

This paper studies the use of Named Entity Recognition (NER) for the Question Anwering (QA) task in Spanish texts. NER applied as a...

Named Entity Recognition with NLTK and SpaCy | by Susan Li

Named entity recognition (NER)is probably the first step towards information extraction that seeks to locate and classify named entities in ...

What is NER And Why It's Hard to Get Right - Galileo

NER is a very important upstream component because it supports real-world applications like conversational agents, information retrieval, ...

A Quick Overview: Named Entity Recognition (NER) in Natural ...

NER suits the intent of Information Extraction (IE) which is to produce a knowledge base. It can organize and arrange the information in...