question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Training data issue

See original GitHub issue

Consider this my code:

test = nlp(u"b-602 tower-3 mantri apartments mayanagar road pune maharashtra 400015")
for entity in test.ents:
  print(entity.label_, ' | ', entity.text

I am trying this training set as below.

(u"Unitnumber Level Buildingname Streetname Sublocality Locality City State ",{‘entities’:[(0,11,‘UNITNUMBER’),(12,17,‘Level’),(18,30,‘Building Name’),(31,41,‘STREET’),(42,53,‘SUBLOCALITY’),(54,62,‘LOCALITY’),(63,67,‘CITY’),(68,73,‘STATE’)]}),

But no results. When I use this below set, I am getting some results.

(u"504 Purple Pride Accord IT Park Baner Road Baner Pune 411045 ", {‘entities’: [ ( 0, 3, ‘UNITNUMBER’), (4,16,‘Building’),(17,31,‘Locality’),(32,42,‘Road’), (43,48,‘Suburb’),(49,53,‘City’),(54,60,‘Pincode’)]}),

Result:

UNITNUMBER  |  b-602
Building  |  tower-3 mantri
Road  |  mayanagar road
Suburb  |  pune
City  |  maharashtra
Pincode  |  400015

What is wrong with the first training data? Please guide. regards NK

Your Environment

  • Operating System:
  • Python Version Used:
  • spaCy Version Used:
  • Environment Information:

Issue Analytics

  • State:closed
  • Created 3 years ago
  • Comments:15 (2 by maintainers)

github_iconTop GitHub Comments

1reaction
abinpaul1commented, Apr 21, 2020

Index need not contain commas, because you are tagging the entities. First char would be zero.

It’s not the indexes that is the main issue, its your choice of entities to label. Make sure your entities are thoroughly distinguishable. Since your example are mostly unstructured, you have to try to ensure your entities can be diffrentiated.

Try thinking from the perspective of the model. For instanc from your example, on simply seeing Nitya-Nilayam Sri Venkatesa Mills its possible to tag it in more than one way due to the lack of contextual information.

Also some entities like State would be much easily extracted by comparing it with a list of states (Or you could make use of spaCy’s rule based matcher too). Pincode can also be extracted using pattern matching. Maybe ‘Unit’ can also be done in the same way. The remaining entities Suburb,Locality,Roads are all basically ‘LOC’ entities in spaCy’s provided models. Try applying the model to tag these as such and then try writing some logic to maybe separate them.

1reaction
abinpaul1commented, Apr 21, 2020

It seems your model is not trained sufficiently. This is inferred from the poor perfomance of the model on the training data. To get better output it is imperative that you have sufficient number of examples. Also try increasing the number of iterations for training. But with such a less number of examples for training, there is a high chance the model will be overfitted. Also your last training example (which you also use as test string) seems to be tagged incorrectly. ("B-602,Tower 3, Mantri Apartments, Baner, Pune,India",{"entities": [(0, 5, "UNIT")]})

Read more comments on GitHub >

github_iconTop Results From Across the Web

Training Data: The Overlooked Problem Of Modern AI - Forbes
The importance of data labeling had been hugely underestimated and treated as a nontechnological, ineffective and boring management task.
Read more >
3 big problems with datasets in AI and machine learning
Datasets in AI and machine learning contain many flaws. Some might be fixable, according to experts -- given enough time and resources.
Read more >
How to Deal With the Lack of Data in Machine Learning
Training data shortage represents a crucial issue, also because if AI hesitates about the result, it won't signalize to show its uncertainty but...
Read more >
Problems in Machine Learning Models? Check your Data First
Nonrepresentative Training Data​​ It is, however, harder than it sounds. If the sample is too small, you will have sampling noise, which is...
Read more >
Challenges in Training Models | Security Kiwi
Challenges in training can be encountered which result in the model's accuracy being lower than expected. These can be caused by the data, ......
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found