v7 Named Entity Recognition is not grouping #FirstName and #LastName together and also .topics() function is missing which in v6 provided more generic keywords, describing what a piece of text is about
See original GitHub issueLet’s take this text as an example:
Manchester United have rejected a £19 million bid from Everton for Morgan Schneiderlin, as the January transfer window continues to heat-up.
As reported by the Guardian, United are looking to recoup the £24 million they paid for Schneiderlin back in the summer of 2015, with the French midfielder struggling badly for game time under Jose Mourinho this season. With that in mind, the club turned down Everton’s first official offer just days into the winter transfer window.
Get 24 Hours of Sky Sports and watch all of the action live with NOW TV for £6.99. Sign up and get a 10% Discount.
The newspaper reports that the Toffees’ interest in Schneiderlin has been known for several weeks, as Ronald Koeman looks to strengthen his midfield options going into the second half of the season. The Dutchman worked with the 27-year-old during his spell in charge of Southampton, so knows what he’s capable of at this level.
With Idrissa Gueye now heading off to the African Cup of Nations and James McCarthy’s fitness still a cause for concern, Everton are eyeing several additions to their squad this month, while West Bromwich Albion are also believed to be interested in Schneiderlin’s services, according to the paper.
The French international was a consistent figure under Louis van Gaal last season, featuring alongside the likes of Michael Carrick, Ander Herrera and Bastian Schweinsteiger in midfield, but his situation has drastically changed this term, with Paul Pogba arriving at Old Trafford for a world-record fee in the summer.
Schneiderlin has won 33% of his average duels and registered an average pass accuracy of 87% in three Premier League appearances this season, as United sit sixth on 39 points, one behind the top four and nine clear of Everton in seventh.
Next up for them is a home clash with Reading in the FA Cup on Saturday, before taking on Hull City at Old Trafford.
64.3% of Morgan Schneiderlin’s passes have been forward in the Premier League this season.
right now r.match('(#Person|#Place|#Organization)').data()
returns pretty good results:
[ { normal: 'manchester united', text: 'Manchester United' },
{ normal: 'morgan', text: ' Morgan' },
{ normal: 'jose', text: ' Jose' },
{ normal: 'mourinho', text: ' Mourinho' },
{ normal: 'ronald', text: ' Ronald' },
{ normal: 'koeman', text: ' Koeman' },
{ normal: 'james', text: ' James' },
{ normal: 'mccarthy\'s', text: ' McCarthy's' },
{ normal: 'louis', text: ' Louis' },
{ normal: 'van', text: ' van' },
{ normal: 'gaal', text: ' Gaal' },
{ normal: 'michael', text: ' Michael' },
{ normal: 'carrick', text: ' Carrick,' },
{ normal: 'paul', text: ' Paul' },
{ normal: 'pogba', text: ' Pogba' },
{ normal: 'morgan', text: ' Morgan' } ]
however, it would be better/nicer if instead of first name and last names being scattered all over place, they would be grouped - Jose Mourinho, Ronald Koeman, James McCarthy’s, Louis van Gaal, Michael Carrick, Paul Pogba etc
cause right now I have to write extra nlp_compromise
code that groups first names and last names together and I am only using very simple logic where I am assuming they are presented in order (and in this case, they are, lucky me), otherwise there is no way to tell which first name goes with which last name, but that still is not 100% reliable (cause nlp_compromise could present them not in order and I wouldn’t know), just good enough 😃
the old v6 .topics()
function would have returned this (for the same text):
text topics [ { count: 4, text: 'everton' },
{ count: 3, text: 'schneiderlin' },
{ count: 2, text: 'summer' },
{ count: 2, text: 'trafford' },
{ count: 2, text: 'morgan schneiderlin' },
{ count: 1, text: 'albion' },
{ count: 1, text: 'cup of nations' },
{ count: 1, text: 'premier league' },
{ count: 1, text: 'idrissa gueye' },
{ count: 1, text: 'guardian' },
{ count: 1, text: 'manchester' },
{ count: 1, text: 'discount' },
{ count: 1, text: 'paul pogba' },
{ count: 1, text: 'toffees\' interest' },
{ count: 1, text: 'bastian schweinsteiger' },
{ count: 1, text: 'southampton' },
{ count: 1, text: 'ander herrera' },
{ count: 1, text: 'jose mourinho' },
{ count: 1, text: 'michael carrick' },
{ count: 1, text: 'james mccarthy' },
{ count: 1, text: 'ronald koeman' },
{ count: 1, text: 'fa cup' },
{ count: 1, text: 'hull city' },
{ count: 1, text: 'premier league appearance' },
{ count: 1, text: 'louis-van gaal' },
{ count: 1, text: 'he\'s' },
{ count: 1, text: 'tv' },
{ count: 1, text: 'west bromwich' },
{ count: 1, text: 'dutchman' } ]
which is clearly better at grouping first names and last names together (v7 doesn’t do it at all, even though in the v7 documentation examples it clearly shows that it does)
so what would make v7 named entity recognition perfect is grouping first and last names into full names and also ability to extract more generic topic keywords like the old v6 .topics()
function (e.g. ‘football’, ‘premier league’, etc), but without extracting pointless keywords that old v6 .topics()
function sometimes extract like ‘guardian’ and ‘he's’ or ‘tv’, ‘summer’ etc.
Issue Analytics
- State:
- Created 7 years ago
- Comments:7 (2 by maintainers)
Top GitHub Comments
thanks for the good issue. shouldn’t be too bad. Can do it this week
hey @codepreneur this should be fixed now in
compromise@7.0.2