Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Reproduction of experimental results

See original GitHub issue

First of all, thanks for sharing this cleaned and object-oriented code! I have learned a lot from this repo. I even want to say Wow, you can really code! ^_^

I have training the model on CoNLL04 dataset with the default configuration, according to the README, and the test results as follows:

--- Entities (NER) ---

                type    precision       recall     f1-score      support
                 Org        79.43        83.84        81.57          198
                 Loc        91.51        90.87        91.19          427
               Other        76.61        71.43        73.93          133
                Peop        92.17        95.33        93.72          321

               micro        87.70        88.51        88.10         1079
               macro        84.93        85.37        85.10         1079

--- Relations ---

Without NER
                type    precision       recall     f1-score      support
                Kill        84.78        82.98        83.87           47
               OrgBI        73.86        61.90        67.36          105
                Work        61.54        63.16        62.34           76
               LocIn        74.36        61.70        67.44           94
                Live        74.04        77.00        75.49          100

               micro        72.84        68.01        70.34          422
               macro        73.72        69.35        71.30          422

With NER
                type    precision       recall     f1-score      support
                Kill        84.78        82.98        83.87           47
               OrgBI        73.86        61.90        67.36          105
                Work        61.54        63.16        62.34           76
               LocIn        73.08        60.64        66.28           94
                Live        74.04        77.00        75.49          100

               micro        72.59        67.77        70.10          422
               macro        73.46        69.14        71.07          422

The test result is worse than the original paper, especially for macro-average metrics.

Is it possible that the random seed is different? I just set seed=42 in example_train.conf

Thanks!

Issue Analytics

State:
Created 4 years ago
Comments:7 (3 by maintainers)

Top GitHub Comments

1reaction

markus-ebertscommented, Jul 15, 2020

Yes, you should evaluate the provided model on the test set. However, the provided model is the best out of 5 runs, whereas we report the average of 5 runs in our paper (…and due to random weight initialization and sampling the performance varies between runs). That’s why you get a better performance compared to the results we reported in our paper.

Thanks 😃!

0reactions

JackySnakecommented, Jul 15, 2020

Yes, you should evaluate the provided model on the test set. However, the provided model is the best out of 5 runs, whereas we report the average of 5 runs in our paper (…and due to random weight initialization and sampling the performance varies between runs). That’s why you get a better performance compared to the results we reported in our paper.

Thanks 😃!

I understand. Thanks a lot.

Top Results From Across the Web

Six factors affecting reproducibility in life science research ...

In theory, researchers should be able to re-create experiments, generate the same results, and arrive at the same conclusions, thus helping to validate...

Understanding Reproducibility and Replicability - NCBI - NIH

When a new study is conducted and new data are collected, aimed at the same or a similar scientific question as a previous...

Having hard times reproducing your experiments?

Alltough the results are striking, less than 31% of those surveyed think that failure to reproduce published results means that the result is...

Reproducibility of Scientific Results

In those disciplines, replication describes the redoing of whole experiments (Barba 2017, Other Internet Resources). In psychology and other ...

Why can't we reproduce so many scientific findings?

When they could replicate the experiments, the researchers found that the results were less impressive than the original findings; average ...