Low accuracy of POS tagger trained on Universal Dependency French corpus
See original GitHub issueYour Environment
- Operating System: Ubuntu 16.04
- Python Version Used: Python 3.5 (Miniconda)
- spaCy Version Used: 1.6.0 & 1.5.0
Hi,
I am trying to train a POS tagger for French. I started by getting the Universal Dependencies corpus for French : Then, I followed the SpaCy pos tagger tutorial , However, the results are quite disappointing because the accuracy obtained is low :
precision recall f1-score support
ADJ 0.28 0.22 0.24 517
ADP 0.30 0.43 0.35 713
ADV 0.10 0.36 0.16 98
AUX 0.09 0.15 0.11 100
CONJ 0.03 0.07 0.05 87
DET 0.66 0.39 0.49 1719
INTJ 0.00 0.00 0.00 0
NOUN 0.62 0.74 0.67 1054
NUM 0.12 0.21 0.15 82
PART 0.06 0.08 0.07 51
PRON 0.17 0.21 0.19 354
PROPN 0.39 0.16 0.23 728
PUNCT 0.35 0.42 0.38 713
SCONJ 0.19 0.27 0.22 79
SYM 0.13 0.27 0.18 11
VERB 0.37 0.37 0.37 713
X 0.00 0.00 0.00 1
avg / total 0.44 0.39 0.39 7020
I saw the issue #773 about NER post training and I suppose it is maybe linked. Is there any step I missed ? Thanks by advance for your help,
Thomas
Here is the process I followed :
$ head ./UD_French/fr-ud-train.conllu
# sentid: fr-ud-train_00001
# sentence-text: Les commotions cérébrales sont devenu si courantes dans ce sport qu'on les considére presque comme la routine.
1 Les le DET _ Definite=Def|Gender=Fem|Number=Plur|PronType=Art 2 det _ _
2 commotions commotion NOUN _ Gender=Fem|Number=Plur 5 nsubj _ _
3 cérébrales cérébral ADJ _ Gender=Fem|Number=Plur 2 amod _ _
4 sont être AUX _ Mood=Ind|Number=Plur|Person=3|Tense=Pres|VerbForm=Fin 5 aux _ _
5 devenu devenir VERB _ Gender=Masc|Number=Sing|Tense=Past|VerbForm=Part 0 root _ _
6 si si ADV _ _ 7 advmod _ _
7 courantes courant ADJ _ Gender=Fem|Number=Plur 5 xcomp _ _
8 dans dans ADP _ _ 10 case _ _
import random
from collections import Counter
from pathlib import Path
from spacy.vocab import Vocab
from spacy.tagger import Tagger
from spacy.tokens import Doc
from spacy.gold import GoldParse
from sklearn.metrics import accuracy_score, classification_report
TAG_MAP = {
'X':{'pos':"X"},
'NOUN':{'pos':"NOUN"},
'DET':{'pos':"DET"},
'ADV':{'pos':"ADV"},
'AUX':{'pos':"AUX"},
'PRON':{'pos':"PRON"},
'SCONJ':{'pos':"SCONJ"},
'CONJ':{'pos':"CONJ"},
'VERB':{'pos':"VERB"},
'PROPN':{'pos':"PROPN"},
'PUNCT':{'pos':"PUNCT"},
'INTJ':{'pos':"INTJ"},
'ADJ':{'pos':"ADJ"},
'ADP':{'pos':"ADP"},
'NUM':{'pos':"NUM"},
'SYM':{'pos':"SYM"},
'PART':{'pos':"PART"}
}
def gen_corpus(path):
doc = []
tagset = set()
with open(path) as file:
for line in file:
if line[0].isdigit():
features = line.split()
word, pos= features[1], features[3]
if pos != "_":
tagset.add(pos)
doc.append((word, pos))
elif len(line.strip()) == 0:
if len(doc) > 0:
words, tags = zip(*doc)
yield (list(words), list(tags))
doc = []
def evaluation(TEST_DATA):
counter = Counter()
y_pred, y_true = [], []
for words, tags in TEST_DATA:
doc = Doc(vocab, words=words)
tagger(doc)
for i, word in enumerate(doc):
counter[word.pos_ == tags[i]] += 1
y_pred.append(word.pos_)
y_true.append(tags[i])
print(counter)
return y_pred, y_true
def ensure_dir(path):
if not path.exists():
path.mkdir()
def gen_tagger(TRAIN_DATA, output_dir=None):
if output_dir is not None:
output_dir = Path(output_dir)
ensure_dir(output_dir)
ensure_dir(output_dir / "pos")
ensure_dir(output_dir / "vocab")
vocab = Vocab(tag_map=TAG_MAP)
tagger = Tagger(vocab)
for i in range(50):
print(i)
for words, tags in TRAIN_DATA:
doc = Doc(vocab, words=words)
gold = GoldParse(doc, tags=tags)
tagger.update(doc, gold)
random.shuffle(TRAIN_DATA)
# tagger.model.end_training()
if output_dir is not None:
tagger.model.dump(str(output_dir / 'pos' / 'model'))
with (output_dir / 'vocab' / 'strings.json').open('w') as file_:
tagger.vocab.strings.dump(file_)
return tagger, vocab
if __name__ == '__main__':
train_path = "./UD_French/fr-ud-train.conllu"
test_path = "./UD_French/fr-ud-test.conllu"
TRAIN_DATA = list(gen_corpus(path))
TEST_DATA = list(gen_corpus(test_path))
tagger, vocab = gen_tagger(TRAIN_DATA, "./spacy_postagger")
y_pred, y_true = evaluation(TEST_DATA)
classification_report(y_pred, y_true)
Issue Analytics
- State:
- Created 7 years ago
- Reactions:1
- Comments:11 (3 by maintainers)
Top Results From Across the Web
How Bad are PoS Tagger in Cross ... - Archive ouverte HAL
Our PoS tagger achieves an average precision of 91.10% over all UD treebanks, a result comparable to the performance of UDPipe 1.2 (Straka...
Read more >From POS tagging to dependency parsing for ... - NCBI
Table 3 presents POS tagging accuracy of each model on the test set, based on retraining of the POS tagging models on each...
Read more >How Bad are PoS Tagger in Cross-Corpora ... - ACL Anthology
Our PoS tagger achieves an average precision of 91.10% over all UD treebanks, a result comparable to the performance of UDPipe 1.2 (Straka...
Read more >From POS tagging to dependency ... - BMC Bioinformatics
Table 3 presents POS tagging accuracy of each model on the test set, based on retraining of the POS tagging models on each...
Read more >Probing BERT models with part-of-speech tagging - Lis-Lab
The <pad> tag will have an other role when batching sentences together and we will ignore it when evaluating the accuracy of the...
Read more >Top Related Medium Post
No results found
Top Related StackOverflow Question
No results found
Troubleshoot Live Code
Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start FreeTop Related Reddit Thread
No results found
Top Related Hackernoon Post
No results found
Top Related Tweet
No results found
Top Related Dev.to Post
No results found
Top Related Hashnode Post
No results found
Top GitHub Comments
@thomasgirault Hi, I’ve noticed you’re using Spacy 1.5/1.6 to train the tagger. I’ve work on spacy tokenization for French (it is available on master, and should be available in the next release), and it should (hopefully) improve the tagger accuracy.
This thread has been automatically locked since there has not been any recent activity after it was closed. Please open a new issue for related bugs.