Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Training data issue

See original GitHub issue

Consider this my code:

test = nlp(u"b-602 tower-3 mantri apartments mayanagar road pune maharashtra 400015")
for entity in test.ents:
  print(entity.label_, ' | ', entity.text

I am trying this training set as below.

(u"Unitnumber Level Buildingname Streetname Sublocality Locality City State ",{‘entities’:[(0,11,‘UNITNUMBER’),(12,17,‘Level’),(18,30,‘Building Name’),(31,41,‘STREET’),(42,53,‘SUBLOCALITY’),(54,62,‘LOCALITY’),(63,67,‘CITY’),(68,73,‘STATE’)]}),

But no results. When I use this below set, I am getting some results.

(u"504 Purple Pride Accord IT Park Baner Road Baner Pune 411045 ", {‘entities’: [ ( 0, 3, ‘UNITNUMBER’), (4,16,‘Building’),(17,31,‘Locality’),(32,42,‘Road’), (43,48,‘Suburb’),(49,53,‘City’),(54,60,‘Pincode’)]}),

Result:

UNITNUMBER  |  b-602
Building  |  tower-3 mantri
Road  |  mayanagar road
Suburb  |  pune
City  |  maharashtra
Pincode  |  400015

What is wrong with the first training data? Please guide. regards NK

Your Environment

Operating System:
Python Version Used:
spaCy Version Used:
Environment Information:

Issue Analytics

State:
Created 3 years ago
Comments:15 (2 by maintainers)

Top GitHub Comments

1reaction

abinpaul1commented, Apr 21, 2020

Index need not contain commas, because you are tagging the entities. First char would be zero.

It’s not the indexes that is the main issue, its your choice of entities to label. Make sure your entities are thoroughly distinguishable. Since your example are mostly unstructured, you have to try to ensure your entities can be diffrentiated.

Try thinking from the perspective of the model. For instanc from your example, on simply seeing Nitya-Nilayam Sri Venkatesa Mills its possible to tag it in more than one way due to the lack of contextual information.

Also some entities like State would be much easily extracted by comparing it with a list of states (Or you could make use of spaCy’s rule based matcher too). Pincode can also be extracted using pattern matching. Maybe ‘Unit’ can also be done in the same way. The remaining entities Suburb,Locality,Roads are all basically ‘LOC’ entities in spaCy’s provided models. Try applying the model to tag these as such and then try writing some logic to maybe separate them.

1reaction

abinpaul1commented, Apr 21, 2020

It seems your model is not trained sufficiently. This is inferred from the poor perfomance of the model on the training data. To get better output it is imperative that you have sufficient number of examples. Also try increasing the number of iterations for training. But with such a less number of examples for training, there is a high chance the model will be overfitted. Also your last training example (which you also use as test string) seems to be tagged incorrectly. ("B-602,Tower 3, Mantri Apartments, Baner, Pune,India",{"entities": [(0, 5, "UNIT")]})