Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Cannot retrain full text model

See original GitHub issue

Dear,

I am trying to retrain fulltext model with new pdf files (with docker (image lfoppiano/grobid:0.7.1). Everything works quite smoothly when creating training data. I am only interested in bibliographical reference markers, and I have mostly added <ref type="biblio">....</ref> tags and deleted some wrong ref tags from tei files.

Next, I put newly created tie.xml files under (C:\grobid\grobid-trainer\resources\dataset\fulltext\corpus\tei). There are also a few files under evaluation folder.

However, when I run the command below:

java -Xmx4G -jar grobid-trainer/build/libs/grobid-trainer-0.7.1-onejar.jar 2 fulltext -gH grobid-home -s 0.75

I got:

path2GbdHome=grobid-home path2GbdProperties=grobid-home/config/grobid.properties SLF4J: Failed to load class “org.slf4j.impl.StaticLoggerBinder”. SLF4J: Defaulting to no-operation (NOP) logger implementation SLF4J: See http://www.slf4j.org/codes.html#StaticLoggerBinder for further details. sourceTEIPathLabel: /opt/grobid/retrain/resources/dataset/fulltext/corpus/tei sourceRawPathLabel: /opt/grobid/retrain/resources/dataset/fulltext/corpus/raw trainingOutputPath: /opt/grobid/retrain/…/grobid-home/tmp/fulltext3115104342537113071.train evalOutputPath: /opt/grobid/retrain/…/grobid-home/tmp/fulltext15117863942333625584.test epsilon: 1.0E-5 window: 20 nb max iterations: 2000 nb threads: 16 Error: Could not create the Java Virtual Machine. Error: A fatal exception has occurred. Program will exit.

And also tried it with gradle (./gradlew train_fulltext --stacktrace --info) and I got:

Watching 8 directories to track changes Watching 8 directories to track changes Build cache key for task ‘:grobid-core:compileJava’ is 9c13110ecb6cd4046c97c9c27de4c2dd Task ‘:grobid-core:compileJava’ is not up-to-date because: Task has failed previously. Watching 6 directories to track changes Watching 4 directories to track changes Watching 2 directories to track changes Watching 1 directories to track changes The input changes require a full rebuild for incremental task ‘:grobid-core:compileJava’. Full recompilation is required because no incremental change information is available. This is usually caused by clean builds or changing compiler arguments. Compiling with toolchain ‘/usr/local/openjdk-11’. Compiling with JDK Java compiler API. Watching 3 directories to track changes Watching 5 directories to track changes Watching 7 directories to track changes Watching 8 directories to track changes

Task :grobid-core:compileJava FAILED :grobid-core:compileJava (Thread[Daemon worker,5,main]) completed. Took 2.651 secs.

FAILURE: Build failed with an exception.

What went wrong: Execution failed for task ‘:grobid-core:compileJava’.

java.lang.NullPointerException (no error message)

Apparently, java cannot be compiled. In the docker image it is java 11 and I am aware that java 8, 9, 10 versions are expected. Could it be the issue? If so, how can I downgrade the version in the container?

Thanks in advance!

Issue Analytics

State:
Created a year ago
Comments:5 (2 by maintainers)

Top GitHub Comments

1reaction

kermitt2commented, Oct 13, 2022

However

It takes quite long.

Yes but time is relative 😄

I think it takes 5-6 hours of Wapiti training for the 40 XML - training of the segmentation model is 12 hours if I remember well (8 threads).

It does not really depend on the XML/PDF used for training. The reasons are that CRF training is slow (GPU won’t help, it’s probabilistic graphical model, not tensors stuff) and that with 40 or 35 “training examples” there is no parallelization. So we could use 2 or 100 threads, we will have the same runtime, with very few memory used. There is a minimal number of training examples for every parallel fold, and 40 is not enough. With a few hundred examples, the multi-threading is efficient.

Plan to improve this are:

remove figures and tables from the fulltext model (ongoing in branch fix-vector-graphics), to have less labels and much less content
maybe separate at some point the fulltext back-bone and the paragraph content with 2 specific models (it’s also to be able to use deep learning models for the second one, the problem being the length of the input sequence here)

And, once the training is finished, it will replace the one under grobid-home/models/fulltext right?

yes !

1reaction

kermitt2commented, Oct 12, 2022

To train with the docker image, use the training web API - it was developed for this and the batch commands are not expected to be used with the docker image.

Be sure to have the modified/added training data at the right place in the docker container (so in the container under /opt/grobid/grobid-trainer/resources/dataset/fulltext/corpus/tei and /opt/grobid/grobid-trainer/resources/dataset/fulltext/corpus/raw).

From the error trace, the problem apparently is that the trainer founds no training file (it should indicate the number of correctly parsed training file to be used) and the CRF Wapiti JNI process fails. You could try first to retrain with the existing training files for a test.

(The amount of memory in the command line should not be a problem, the Wapiti training will use memory as needed - but enough memory should be available for the container. There should be also no problem normally with Java 11, it’s more a problem for the Deep Learning training that are hard to integrate beyond java 10).