Training data issue
See original GitHub issueConsider this my code:
test = nlp(u"b-602 tower-3 mantri apartments mayanagar road pune maharashtra 400015")
for entity in test.ents:
print(entity.label_, ' | ', entity.text
I am trying this training set as below.
(u"Unitnumber Level Buildingname Streetname Sublocality Locality City State ",{‘entities’:[(0,11,‘UNITNUMBER’),(12,17,‘Level’),(18,30,‘Building Name’),(31,41,‘STREET’),(42,53,‘SUBLOCALITY’),(54,62,‘LOCALITY’),(63,67,‘CITY’),(68,73,‘STATE’)]}),
But no results. When I use this below set, I am getting some results.
(u"504 Purple Pride Accord IT Park Baner Road Baner Pune 411045 ", {‘entities’: [ ( 0, 3, ‘UNITNUMBER’), (4,16,‘Building’),(17,31,‘Locality’),(32,42,‘Road’), (43,48,‘Suburb’),(49,53,‘City’),(54,60,‘Pincode’)]}),
Result:
UNITNUMBER | b-602
Building | tower-3 mantri
Road | mayanagar road
Suburb | pune
City | maharashtra
Pincode | 400015
What is wrong with the first training data? Please guide. regards NK
Your Environment
- Operating System:
- Python Version Used:
- spaCy Version Used:
- Environment Information:
Issue Analytics
- State:
- Created 3 years ago
- Comments:15 (2 by maintainers)
Top Results From Across the Web
Training Data: The Overlooked Problem Of Modern AI - Forbes
The importance of data labeling had been hugely underestimated and treated as a nontechnological, ineffective and boring management task.
Read more >3 big problems with datasets in AI and machine learning
Datasets in AI and machine learning contain many flaws. Some might be fixable, according to experts -- given enough time and resources.
Read more >How to Deal With the Lack of Data in Machine Learning
Training data shortage represents a crucial issue, also because if AI hesitates about the result, it won't signalize to show its uncertainty but...
Read more >Problems in Machine Learning Models? Check your Data First
Nonrepresentative Training Data It is, however, harder than it sounds. If the sample is too small, you will have sampling noise, which is...
Read more >Challenges in Training Models | Security Kiwi
Challenges in training can be encountered which result in the model's accuracy being lower than expected. These can be caused by the data, ......
Read more >Top Related Medium Post
No results found
Top Related StackOverflow Question
No results found
Troubleshoot Live Code
Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start FreeTop Related Reddit Thread
No results found
Top Related Hackernoon Post
No results found
Top Related Tweet
No results found
Top Related Dev.to Post
No results found
Top Related Hashnode Post
No results found
Top GitHub Comments
Index need not contain commas, because you are tagging the entities. First char would be zero.
It’s not the indexes that is the main issue, its your choice of entities to label. Make sure your entities are thoroughly distinguishable. Since your example are mostly unstructured, you have to try to ensure your entities can be diffrentiated.
Try thinking from the perspective of the model. For instanc from your example, on simply seeing
Nitya-Nilayam Sri Venkatesa Mills
its possible to tag it in more than one way due to the lack of contextual information.Also some entities like State would be much easily extracted by comparing it with a list of states (Or you could make use of spaCy’s rule based matcher too). Pincode can also be extracted using pattern matching. Maybe ‘Unit’ can also be done in the same way. The remaining entities Suburb,Locality,Roads are all basically ‘LOC’ entities in spaCy’s provided models. Try applying the model to tag these as such and then try writing some logic to maybe separate them.
It seems your model is not trained sufficiently. This is inferred from the poor perfomance of the model on the training data. To get better output it is imperative that you have sufficient number of examples. Also try increasing the number of iterations for training. But with such a less number of examples for training, there is a high chance the model will be overfitted. Also your last training example (which you also use as test string) seems to be tagged incorrectly.
("B-602,Tower 3, Mantri Apartments, Baner, Pune,India",{"entities": [(0, 5, "UNIT")]})