Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

v7 Named Entity Recognition is not grouping #FirstName and #LastName together and also .topics() function is missing which in v6 provided more generic keywords, describing what a piece of text is about

See original GitHub issue

@spencermountain

Let’s take this text as an example:

Manchester United have rejected a £19 million bid from Everton for Morgan Schneiderlin, as the January transfer window continues to heat-up.

As reported by the Guardian, United are looking to recoup the £24 million they paid for Schneiderlin back in the summer of 2015, with the French midfielder struggling badly for game time under Jose Mourinho this season. With that in mind, the club turned down Everton’s first official offer just days into the winter transfer window.

Get 24 Hours of Sky Sports and watch all of the action live with NOW TV for £6.99. Sign up and get a 10% Discount.

The newspaper reports that the Toffees’ interest in Schneiderlin has been known for several weeks, as Ronald Koeman looks to strengthen his midfield options going into the second half of the season. The Dutchman worked with the 27-year-old during his spell in charge of Southampton, so knows what he’s capable of at this level.

With Idrissa Gueye now heading off to the African Cup of Nations and James McCarthy’s fitness still a cause for concern, Everton are eyeing several additions to their squad this month, while West Bromwich Albion are also believed to be interested in Schneiderlin’s services, according to the paper.

The French international was a consistent figure under Louis van Gaal last season, featuring alongside the likes of Michael Carrick, Ander Herrera and Bastian Schweinsteiger in midfield, but his situation has drastically changed this term, with Paul Pogba arriving at Old Trafford for a world-record fee in the summer.

Schneiderlin has won 33% of his average duels and registered an average pass accuracy of 87% in three Premier League appearances this season, as United sit sixth on 39 points, one behind the top four and nine clear of Everton in seventh.

Next up for them is a home clash with Reading in the FA Cup on Saturday, before taking on Hull City at Old Trafford.

64.3% of Morgan Schneiderlin’s passes have been forward in the Premier League this season.

right now r.match('(#Person|#Place|#Organization)').data() returns pretty good results:

 [ { normal: 'manchester united', text: 'Manchester United' },
  { normal: 'morgan', text: ' Morgan' },
  { normal: 'jose', text: ' Jose' },
  { normal: 'mourinho', text: ' Mourinho' },
  { normal: 'ronald', text: ' Ronald' },
  { normal: 'koeman', text: ' Koeman' },
  { normal: 'james', text: ' James' },
  { normal: 'mccarthy\'s', text: ' McCarthy's' },
  { normal: 'louis', text: ' Louis' },
  { normal: 'van', text: ' van' },
  { normal: 'gaal', text: ' Gaal' },
  { normal: 'michael', text: ' Michael' },
  { normal: 'carrick', text: ' Carrick,' },
  { normal: 'paul', text: ' Paul' },
  { normal: 'pogba', text: ' Pogba' },
  { normal: 'morgan', text: ' Morgan' } ]

however, it would be better/nicer if instead of first name and last names being scattered all over place, they would be grouped - Jose Mourinho, Ronald Koeman, James McCarthy’s, Louis van Gaal, Michael Carrick, Paul Pogba etc

cause right now I have to write extra nlp_compromise code that groups first names and last names together and I am only using very simple logic where I am assuming they are presented in order (and in this case, they are, lucky me), otherwise there is no way to tell which first name goes with which last name, but that still is not 100% reliable (cause nlp_compromise could present them not in order and I wouldn’t know), just good enough 😃

the old v6 .topics() function would have returned this (for the same text):

text topics  [ { count: 4, text: 'everton' },
  { count: 3, text: 'schneiderlin' },
  { count: 2, text: 'summer' },
  { count: 2, text: 'trafford' },
  { count: 2, text: 'morgan schneiderlin' },
  { count: 1, text: 'albion' },
  { count: 1, text: 'cup of nations' },
  { count: 1, text: 'premier league' },
  { count: 1, text: 'idrissa gueye' },
  { count: 1, text: 'guardian' },
  { count: 1, text: 'manchester' },
  { count: 1, text: 'discount' },
  { count: 1, text: 'paul pogba' },
  { count: 1, text: 'toffees\' interest' },
  { count: 1, text: 'bastian schweinsteiger' },
  { count: 1, text: 'southampton' },
  { count: 1, text: 'ander herrera' },
  { count: 1, text: 'jose mourinho' },
  { count: 1, text: 'michael carrick' },
  { count: 1, text: 'james mccarthy' },
  { count: 1, text: 'ronald koeman' },
  { count: 1, text: 'fa cup' },
  { count: 1, text: 'hull city' },
  { count: 1, text: 'premier league appearance' },
  { count: 1, text: 'louis-van gaal' },
  { count: 1, text: 'he\'s' },
  { count: 1, text: 'tv' },
  { count: 1, text: 'west bromwich' },
  { count: 1, text: 'dutchman' } ]

which is clearly better at grouping first names and last names together (v7 doesn’t do it at all, even though in the v7 documentation examples it clearly shows that it does)

so what would make v7 named entity recognition perfect is grouping first and last names into full names and also ability to extract more generic topic keywords like the old v6 .topics() function (e.g. ‘football’, ‘premier league’, etc), but without extracting pointless keywords that old v6 .topics() function sometimes extract like ‘guardian’ and ‘he's’ or ‘tv’, ‘summer’ etc.

Issue Analytics

State:
Created 7 years ago
Comments:7 (2 by maintainers)

Top GitHub Comments

2reactions

spencermountaincommented, Jan 9, 2017

thanks for the good issue. shouldn’t be too bad. Can do it this week

1reaction

spencermountaincommented, Jan 13, 2017

hey @codepreneur this should be fixed now in compromise@7.0.2

nlp(myText).topics().data()

Top Results From Across the Web

Named Entity Recognition in NLP - Towards Data Science

In natural language processing, named entity recognition (NER) is the problem of recognizing and extracting specific types of entities in text.

Named Entity Recognition: Concept, Tools and Tutorial

Named entity recognition (NER) helps you easily identify the key elements in a text, like names of people, places, brands, monetary values, and ......

IBM SPSS Modeler Text Analytics 18.3 User's Guide

This edition applies to version 18.3.0 of IBM® SPSS® Modeler Text Analytics ... positive and negative words, first names, places, organizations, and more....

Math208 Discrete Mathematics - College of Arts & Sciences

Discrete math — together with calculus and abstract algebra — is one ... Disjunction is also called inclusive-or, since it includes the possi-....

REDCap Change Log - Eastern Virginia Medical School

Bug fix: When using the Text-to-Speech survey feature, any fields initially hidden by branching logic on the survey would mistakenly not have the...