Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Feature request: pretrained English sentiment model

See original GitHub issue

Having pretrained GloVe vectors easily accessible is great for quick experimentation. It would be great if there was a pretrained sentiment model I could use too. Right now all the lexemes in the english vocabulary have sentiment set to 0.

Issue Analytics

State:
Created 7 years ago
Reactions:1
Comments:10 (4 by maintainers)

Top GitHub Comments

3reactions

honnibalcommented, Jan 22, 2017

Hi Pete,

First, thanks for your work on React 😃. I remember seeing your talks right after it came out and thinking it really made sense.

Pre-trained models for a variety of languages, genres and use-cases are actually the main commercial offering we’re working on for spaCy. You’ll be able to download a data pack for a small one-time fee, and you’ll get 12 months of upgrades as they’re published.

After you download the data, it’s yours — you can run the model however you like, without pinging an external service. Crucially, you’ll also be able to backpropagate into it, something that no cloud provider will be able to offer you.

Timelines are always tricky, but think weeks, not months 😃.

Our data packs will have sentiment models you’ll be able to use out-of-the-box. However, the model will get much much better on your use case if you “fine tune” it on your own data. There’s not really any such thing as “sentiment”, in general. The exact behaviours you need from the model will be specific to your application. The design we’re going for is that the pre-trained model gives you the basic knowledge about the language and the world, and your own data programs the system to do what you need.

To get you moving for now, the code in this example is a pretty good sentiment model for long texts: https://github.com/explosion/spaCy/blob/master/examples/deep_learning_keras.py . It projects the labels down from the document level to the sentence, and then uses a bidirectional LSTM model to encode position-sensitive features onto the words. This means that the model is capable of seeing that “charge” is positive in some contexts as a noun, but “charge back” is almost always negative. The position-sensitive features are then pooled, and a model predicts over the resulting vector for each sentence. The document prediction is a simple sum of the sentence predictions.

I would recommend using this “bag of sentences” approach in most long-document scenarios. Current models don’t get useful signal between sentences, and predicting the sentences in parallel is a huge improvement for tractability.

1reaction

honnibalcommented, Mar 10, 2018

@vcovo There’s no update on this sorry — we haven’t been able to find a publicly available dataset that we were happy with, and we didn’t want to put out something that we didn’t think would be useful.

This applied more generally across the idea of the data store mentioned above: for almost anything we wanted to do, we found we wanted to annotate fresh data. We therefore put our annotation tool project Prodigy ahead of the data store in our work queue – now that Prodigy’s out and being used, the data store is back on the agenda.

We do have the text classifier in spaCy, so training it yourself on any of the publicly available datasets should be quite easy. See here for the example on training on IMDB: https://github.com/explosion/spaCy/blob/master/examples/training/train_textcat.py . Training should complete in a few hours on a CPU.

Be sure to also benchmark the models you trained against the text classification functions in other open-source libraries, especially Vowpal Wabbit, scikit-learn and FastText. I’ve set up spaCy’s text classifier in a way that I’ve found to be generally good on the problems I’ve been working on, and it’s particularly well suited for short texts. However, one model isn’t best across the board — so you’ll do well to check the other open-source solutions as well, which are faster due to different algorithmic choices.

If you need to annotate data as well as use existing resources, do have a look at Prodigy — it’s very efficient at training a new model.

Top Results From Across the Web

3 Pre-Trained Model Series to Use for NLP with Transfer ...

Here are the three pre-trained network series you can use for natural language processing tasks ranging from text classification, sentiment ...

Hugging Face Pre-trained Models: Find the Best One for Your ...

The model is trained once for all the languages and provides a set of parameters that can be fine-tuned for supervised(Sentence and document...

Using the Pretrained Models - Oracle Help Center

Requests support single record and multi-record batches. Supported Languages for Input Text. English; Spanish. Aspect-Based Sentiment Analysis Example ...

Quick tour — transformers 3.5.0 documentation - Hugging Face

The library downloads pretrained models for Natural Language Understanding (NLU) tasks, such as analyzing the sentiment of a text, and Natural Language ...

Sentiment Analysis: First Steps With Python's NLTK Library

Using NLTK's Pre-Trained Sentiment Analyzer; Customizing NLTK's Sentiment Analysis ... and others are data models that certain NLTK functions require.