question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Low accuracy of POS tagger trained on Universal Dependency French corpus

See original GitHub issue

Your Environment

  • Operating System: Ubuntu 16.04
  • Python Version Used: Python 3.5 (Miniconda)
  • spaCy Version Used: 1.6.0 & 1.5.0

Hi,

I am trying to train a POS tagger for French. I started by getting the Universal Dependencies corpus for French : Then, I followed the SpaCy pos tagger tutorial , However, the results are quite disappointing because the accuracy obtained is low :

             precision    recall  f1-score   support
        ADJ       0.28      0.22      0.24       517
        ADP       0.30      0.43      0.35       713
        ADV       0.10      0.36      0.16        98
        AUX       0.09      0.15      0.11       100
       CONJ       0.03      0.07      0.05        87
        DET       0.66      0.39      0.49      1719
       INTJ       0.00      0.00      0.00         0
       NOUN       0.62      0.74      0.67      1054
        NUM       0.12      0.21      0.15        82
       PART       0.06      0.08      0.07        51
       PRON       0.17      0.21      0.19       354
      PROPN       0.39      0.16      0.23       728
      PUNCT       0.35      0.42      0.38       713
      SCONJ       0.19      0.27      0.22        79
        SYM       0.13      0.27      0.18        11
       VERB       0.37      0.37      0.37       713
          X       0.00      0.00      0.00         1
avg / total       0.44      0.39      0.39      7020

I saw the issue #773 about NER post training and I suppose it is maybe linked. Is there any step I missed ? Thanks by advance for your help,

Thomas

Here is the process I followed :

$ head  ./UD_French/fr-ud-train.conllu

    # sentid: fr-ud-train_00001
    # sentence-text: Les commotions cérébrales sont devenu si courantes dans ce sport qu'on les considére presque comme la routine.
    1	Les	le	DET	_	Definite=Def|Gender=Fem|Number=Plur|PronType=Art	2	det	_	_
    2	commotions	commotion	NOUN	_	Gender=Fem|Number=Plur	5	nsubj	_	_
    3	cérébrales	cérébral	ADJ	_	Gender=Fem|Number=Plur	2	amod	_	_
    4	sont	être	AUX	_	Mood=Ind|Number=Plur|Person=3|Tense=Pres|VerbForm=Fin	5	aux	_	_
    5	devenu	devenir	VERB	_	Gender=Masc|Number=Sing|Tense=Past|VerbForm=Part	0	root	_	_
    6	si	si	ADV	_	_	7	advmod	_	_
    7	courantes	courant	ADJ	_	Gender=Fem|Number=Plur	5	xcomp	_	_
    8	dans	dans	ADP	_	_	10	case	_	_


    import random
    from collections import Counter

    from pathlib import Path
    from spacy.vocab import Vocab
    from spacy.tagger import Tagger
    from spacy.tokens import Doc
    from spacy.gold import GoldParse

    from sklearn.metrics import accuracy_score, classification_report


    TAG_MAP = {
        'X':{'pos':"X"},
        'NOUN':{'pos':"NOUN"},
        'DET':{'pos':"DET"},
        'ADV':{'pos':"ADV"},
        'AUX':{'pos':"AUX"},
        'PRON':{'pos':"PRON"},
        'SCONJ':{'pos':"SCONJ"},
        'CONJ':{'pos':"CONJ"},
        'VERB':{'pos':"VERB"},
        'PROPN':{'pos':"PROPN"},
        'PUNCT':{'pos':"PUNCT"},
        'INTJ':{'pos':"INTJ"},
        'ADJ':{'pos':"ADJ"},
        'ADP':{'pos':"ADP"},
        'NUM':{'pos':"NUM"},
        'SYM':{'pos':"SYM"},
        'PART':{'pos':"PART"}
    }


    def gen_corpus(path):
        doc = []
        tagset = set()
        with open(path) as file:
            for line in file:
                if line[0].isdigit():
                    features = line.split()
                    word, pos= features[1], features[3]
                    if pos != "_":
                        tagset.add(pos)
                        doc.append((word, pos)) 
                elif len(line.strip()) == 0:
                    if len(doc) > 0:
                        words, tags = zip(*doc)
                        yield (list(words), list(tags))
                    doc = []


    def evaluation(TEST_DATA):
        counter = Counter()
        y_pred, y_true = [], []
        for words, tags in TEST_DATA:
            doc = Doc(vocab, words=words)
            tagger(doc)
            for i, word in enumerate(doc):
                counter[word.pos_ == tags[i]] += 1
                y_pred.append(word.pos_)
                y_true.append(tags[i])
        print(counter)
        return y_pred, y_true


    def ensure_dir(path):
        if not path.exists():
            path.mkdir()


    def gen_tagger(TRAIN_DATA, output_dir=None):
        if output_dir is not None:
            output_dir = Path(output_dir)
            ensure_dir(output_dir)
            ensure_dir(output_dir / "pos")
            ensure_dir(output_dir / "vocab")

        vocab = Vocab(tag_map=TAG_MAP)
        tagger = Tagger(vocab)
        for i in range(50):
            print(i)
            for words, tags in TRAIN_DATA:
                doc = Doc(vocab, words=words)
                gold = GoldParse(doc, tags=tags)
                tagger.update(doc, gold)
            random.shuffle(TRAIN_DATA)
        # tagger.model.end_training()
        
        if output_dir is not None:
            tagger.model.dump(str(output_dir / 'pos' / 'model'))
        with (output_dir / 'vocab' / 'strings.json').open('w') as file_:
            tagger.vocab.strings.dump(file_)

        return tagger, vocab

if __name__ == '__main__':

    train_path = "./UD_French/fr-ud-train.conllu" 
    test_path = "./UD_French/fr-ud-test.conllu" 

    TRAIN_DATA = list(gen_corpus(path))
    TEST_DATA = list(gen_corpus(test_path))
    tagger, vocab = gen_tagger(TRAIN_DATA, "./spacy_postagger")
    y_pred, y_true = evaluation(TEST_DATA)
    classification_report(y_pred, y_true)

Issue Analytics

  • State:closed
  • Created 7 years ago
  • Reactions:1
  • Comments:11 (3 by maintainers)

github_iconTop GitHub Comments

1reaction
raphael0202commented, Mar 18, 2017

@thomasgirault Hi, I’ve noticed you’re using Spacy 1.5/1.6 to train the tagger. I’ve work on spacy tokenization for French (it is available on master, and should be available in the next release), and it should (hopefully) improve the tagger accuracy.

0reactions
lock[bot]commented, May 8, 2018

This thread has been automatically locked since there has not been any recent activity after it was closed. Please open a new issue for related bugs.

Read more comments on GitHub >

github_iconTop Results From Across the Web

How Bad are PoS Tagger in Cross ... - Archive ouverte HAL
Our PoS tagger achieves an average precision of 91.10% over all UD treebanks, a result comparable to the performance of UDPipe 1.2 (Straka...
Read more >
From POS tagging to dependency parsing for ... - NCBI
Table 3 presents POS tagging accuracy of each model on the test set, based on retraining of the POS tagging models on each...
Read more >
How Bad are PoS Tagger in Cross-Corpora ... - ACL Anthology
Our PoS tagger achieves an average precision of 91.10% over all UD treebanks, a result comparable to the performance of UDPipe 1.2 (Straka...
Read more >
From POS tagging to dependency ... - BMC Bioinformatics
Table 3 presents POS tagging accuracy of each model on the test set, based on retraining of the POS tagging models on each...
Read more >
Probing BERT models with part-of-speech tagging - Lis-Lab
The <pad> tag will have an other role when batching sentences together and we will ignore it when evaluating the accuracy of the...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found