Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Low accuracy of POS tagger trained on Universal Dependency French corpus

See original GitHub issue

Your Environment

Operating System: Ubuntu 16.04
Python Version Used: Python 3.5 (Miniconda)
spaCy Version Used: 1.6.0 & 1.5.0

Hi,

I am trying to train a POS tagger for French. I started by getting the Universal Dependencies corpus for French : Then, I followed the SpaCy pos tagger tutorial , However, the results are quite disappointing because the accuracy obtained is low :

             precision    recall  f1-score   support
        ADJ       0.28      0.22      0.24       517
        ADP       0.30      0.43      0.35       713
        ADV       0.10      0.36      0.16        98
        AUX       0.09      0.15      0.11       100
       CONJ       0.03      0.07      0.05        87
        DET       0.66      0.39      0.49      1719
       INTJ       0.00      0.00      0.00         0
       NOUN       0.62      0.74      0.67      1054
        NUM       0.12      0.21      0.15        82
       PART       0.06      0.08      0.07        51
       PRON       0.17      0.21      0.19       354
      PROPN       0.39      0.16      0.23       728
      PUNCT       0.35      0.42      0.38       713
      SCONJ       0.19      0.27      0.22        79
        SYM       0.13      0.27      0.18        11
       VERB       0.37      0.37      0.37       713
          X       0.00      0.00      0.00         1
avg / total       0.44      0.39      0.39      7020

I saw the issue #773 about NER post training and I suppose it is maybe linked. Is there any step I missed ? Thanks by advance for your help,

Thomas

Here is the process I followed :

$ head  ./UD_French/fr-ud-train.conllu

    # sentid: fr-ud-train_00001
    # sentence-text: Les commotions cérébrales sont devenu si courantes dans ce sport qu'on les considére presque comme la routine.
    1	Les	le	DET	_	Definite=Def|Gender=Fem|Number=Plur|PronType=Art	2	det	_	_
    2	commotions	commotion	NOUN	_	Gender=Fem|Number=Plur	5	nsubj	_	_
    3	cérébrales	cérébral	ADJ	_	Gender=Fem|Number=Plur	2	amod	_	_
    4	sont	être	AUX	_	Mood=Ind|Number=Plur|Person=3|Tense=Pres|VerbForm=Fin	5	aux	_	_
    5	devenu	devenir	VERB	_	Gender=Masc|Number=Sing|Tense=Past|VerbForm=Part	0	root	_	_
    6	si	si	ADV	_	_	7	advmod	_	_
    7	courantes	courant	ADJ	_	Gender=Fem|Number=Plur	5	xcomp	_	_
    8	dans	dans	ADP	_	_	10	case	_	_


    import random
    from collections import Counter

    from pathlib import Path
    from spacy.vocab import Vocab
    from spacy.tagger import Tagger
    from spacy.tokens import Doc
    from spacy.gold import GoldParse

    from sklearn.metrics import accuracy_score, classification_report


    TAG_MAP = {
        'X':{'pos':"X"},
        'NOUN':{'pos':"NOUN"},
        'DET':{'pos':"DET"},
        'ADV':{'pos':"ADV"},
        'AUX':{'pos':"AUX"},
        'PRON':{'pos':"PRON"},
        'SCONJ':{'pos':"SCONJ"},
        'CONJ':{'pos':"CONJ"},
        'VERB':{'pos':"VERB"},
        'PROPN':{'pos':"PROPN"},
        'PUNCT':{'pos':"PUNCT"},
        'INTJ':{'pos':"INTJ"},
        'ADJ':{'pos':"ADJ"},
        'ADP':{'pos':"ADP"},
        'NUM':{'pos':"NUM"},
        'SYM':{'pos':"SYM"},
        'PART':{'pos':"PART"}
    }


    def gen_corpus(path):
        doc = []
        tagset = set()
        with open(path) as file:
            for line in file:
                if line[0].isdigit():
                    features = line.split()
                    word, pos= features[1], features[3]
                    if pos != "_":
                        tagset.add(pos)
                        doc.append((word, pos)) 
                elif len(line.strip()) == 0:
                    if len(doc) > 0:
                        words, tags = zip(*doc)
                        yield (list(words), list(tags))
                    doc = []


    def evaluation(TEST_DATA):
        counter = Counter()
        y_pred, y_true = [], []
        for words, tags in TEST_DATA:
            doc = Doc(vocab, words=words)
            tagger(doc)
            for i, word in enumerate(doc):
                counter[word.pos_ == tags[i]] += 1
                y_pred.append(word.pos_)
                y_true.append(tags[i])
        print(counter)
        return y_pred, y_true


    def ensure_dir(path):
        if not path.exists():
            path.mkdir()


    def gen_tagger(TRAIN_DATA, output_dir=None):
        if output_dir is not None:
            output_dir = Path(output_dir)
            ensure_dir(output_dir)
            ensure_dir(output_dir / "pos")
            ensure_dir(output_dir / "vocab")

        vocab = Vocab(tag_map=TAG_MAP)
        tagger = Tagger(vocab)
        for i in range(50):
            print(i)
            for words, tags in TRAIN_DATA:
                doc = Doc(vocab, words=words)
                gold = GoldParse(doc, tags=tags)
                tagger.update(doc, gold)
            random.shuffle(TRAIN_DATA)
        # tagger.model.end_training()
        
        if output_dir is not None:
            tagger.model.dump(str(output_dir / 'pos' / 'model'))
        with (output_dir / 'vocab' / 'strings.json').open('w') as file_:
            tagger.vocab.strings.dump(file_)

        return tagger, vocab

if __name__ == '__main__':

    train_path = "./UD_French/fr-ud-train.conllu" 
    test_path = "./UD_French/fr-ud-test.conllu" 

    TRAIN_DATA = list(gen_corpus(path))
    TEST_DATA = list(gen_corpus(test_path))
    tagger, vocab = gen_tagger(TRAIN_DATA, "./spacy_postagger")
    y_pred, y_true = evaluation(TEST_DATA)
    classification_report(y_pred, y_true)

Issue Analytics

State:
Created 7 years ago
Reactions:1
Comments:11 (3 by maintainers)

Top GitHub Comments

1reaction

raphael0202commented, Mar 18, 2017

@thomasgirault Hi, I’ve noticed you’re using Spacy 1.5/1.6 to train the tagger. I’ve work on spacy tokenization for French (it is available on master, and should be available in the next release), and it should (hopefully) improve the tagger accuracy.

0reactions

lock[bot]commented, May 8, 2018

This thread has been automatically locked since there has not been any recent activity after it was closed. Please open a new issue for related bugs.