Emoji with multiple code units not detected
See original GitHub issueFirst: apologies if I provide insufficient information or use wrong terminology. This is my first GitHub issue ever, so please be kind
Demo code works fine for me, including emojis. However, the demo emoji are described by a single code unit. Emojis with more than one, e.g. “red heart” (2764 FE0F) are not detected, despite being in the lexicon.
from vaderSentiment.vaderSentiment import SentimentIntensityAnalyzer
sentences = ["Catch utf-8 emoji such as such as 💘 and 💋 and 😁", # emojis handled
"Not bad at all", # Capitalized negation
"Me and Fay are 4 years old today ❤️ (ft Grumio)…"
]
analyzer = SentimentIntensityAnalyzer()
for sentence in sentences:
vs = analyzer.polarity_scores(sentence)
print("{:-<65} {}".format(sentence, str(vs)))`
returns
Catch utf-8 emoji such as such as 💘 and 💋 and 😁------------------ {‘neg’: 0.0, ‘neu’: 0.615, ‘pos’: 0.385, ‘compound’: 0.875} Not bad at all--------------------------------------------------- {‘neg’: 0.0, ‘neu’: 0.513, ‘pos’: 0.487, ‘compound’: 0.431} Me and Fay are 4 years old today ❤️ (ft Grumio)… {‘neg’: 0.0, ‘neu’: 1.0, ‘pos’: 0.0, ‘compound’: 0.0}
Issue Analytics
- State:
- Created 3 years ago
- Comments:5
Top Results From Across the Web
How can you verify a multiple code point emoji is supported?
Using isEmoji for single code points is problematic too. It returns true for things like digits.
Read more >Everything You Need To Know About Emoji
Units of a coded character set are known as code points. A code point value represents the position of a character in the...
Read more >UTS #51: Unicode Emoji
Emoji are pictographs (pictorial symbols) that are typically presented in a colorful cartoon form and used inline in text. They represent things such...
Read more >Unicode - 19.1 Code points vs. code units - Exploring JS
Code units are numbers that encode code points, to store or transmit Unicode text. One or more code units encode a single code...
Read more >How to detect Emojis in JavaScript strings - Stefan Judis
A snippets to detect and replace Emojis in JavaScript strings using Unicode ... as a sequence of Unicode code points instead of code...
Read more >
Top Related Medium Post
No results found
Top Related StackOverflow Question
No results found
Troubleshoot Live Code
Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free
Top Related Reddit Thread
No results found
Top Related Hackernoon Post
No results found
Top Related Tweet
No results found
Top Related Dev.to Post
No results found
Top Related Hashnode Post
No results found
Thanks for getting back about this. After a hiatus in research, I plan to get back to sentiment analysis soon, so I’ll watch this space and see whether there is anything I can contribute.
Found the problem, not sure about a fix. I understand now that the sentiment scores are derived from the emoji description words, like normal text, in the sentiment_valence function. So, in case of ☹️ (“frowning face”) the word found in the lexicon is “frowning”, and in case of ❤️ (“red heart”) it’s “heart”. I tested it extensively, the former works (so the problem does not occur with all emojis with multiple code points), the latter does not. The problem only occurs when the lexicon word is the last word of the emoji description, because the loop (see my previous comment) only looks at the first code point to find the description, but then adds the second code point to the last word of that description. This changes the last character of that unigram, making the sentiment_valence look-up miss it. The change is barely visible in control print outs (e.g. the letter “t” becomes a tiny bit smaller), that’s why it took me so long to figure out what’s going on. How to fix it? Dealing properly with emojis with multiple code points would need some serious changes in the loop. I have chosen for a quick and dirty fix: Since the major culprit for sentiment-relevant emojis is “FE0F”, I changed the “else:” statement into “elif ord(chr) != 65039:” to completely ignore it. Seems to work.