Cannot retrain full text model
See original GitHub issueDear,
I am trying to retrain fulltext model with new pdf files (with docker (image lfoppiano/grobid:0.7.1). Everything works quite smoothly when creating training data. I am only interested in bibliographical reference markers, and I have mostly added <ref type="biblio">....</ref>
tags and deleted some wrong ref tags from tei files.
Next, I put newly created tie.xml files under (C:\grobid\grobid-trainer\resources\dataset\fulltext\corpus\tei). There are also a few files under evaluation folder.
However, when I run the command below:
java -Xmx4G -jar grobid-trainer/build/libs/grobid-trainer-0.7.1-onejar.jar 2 fulltext -gH grobid-home -s 0.75
I got:
path2GbdHome=grobid-home path2GbdProperties=grobid-home/config/grobid.properties SLF4J: Failed to load class “org.slf4j.impl.StaticLoggerBinder”. SLF4J: Defaulting to no-operation (NOP) logger implementation SLF4J: See http://www.slf4j.org/codes.html#StaticLoggerBinder for further details. sourceTEIPathLabel: /opt/grobid/retrain/resources/dataset/fulltext/corpus/tei sourceRawPathLabel: /opt/grobid/retrain/resources/dataset/fulltext/corpus/raw trainingOutputPath: /opt/grobid/retrain/…/grobid-home/tmp/fulltext3115104342537113071.train evalOutputPath: /opt/grobid/retrain/…/grobid-home/tmp/fulltext15117863942333625584.test epsilon: 1.0E-5 window: 20 nb max iterations: 2000 nb threads: 16 Error: Could not create the Java Virtual Machine. Error: A fatal exception has occurred. Program will exit.
And also tried it with gradle (./gradlew train_fulltext --stacktrace --info
) and I got:
Watching 8 directories to track changes Watching 8 directories to track changes Build cache key for task ‘:grobid-core:compileJava’ is 9c13110ecb6cd4046c97c9c27de4c2dd Task ‘:grobid-core:compileJava’ is not up-to-date because: Task has failed previously. Watching 6 directories to track changes Watching 4 directories to track changes Watching 2 directories to track changes Watching 1 directories to track changes The input changes require a full rebuild for incremental task ‘:grobid-core:compileJava’. Full recompilation is required because no incremental change information is available. This is usually caused by clean builds or changing compiler arguments. Compiling with toolchain ‘/usr/local/openjdk-11’. Compiling with JDK Java compiler API. Watching 3 directories to track changes Watching 5 directories to track changes Watching 7 directories to track changes Watching 8 directories to track changes
Task :grobid-core:compileJava FAILED :grobid-core:compileJava (Thread[Daemon worker,5,main]) completed. Took 2.651 secs.
FAILURE: Build failed with an exception.
- What went wrong: Execution failed for task ‘:grobid-core:compileJava’.
java.lang.NullPointerException (no error message)
Apparently, java cannot be compiled. In the docker image it is java 11 and I am aware that java 8, 9, 10 versions are expected. Could it be the issue? If so, how can I downgrade the version in the container?
Thanks in advance!
Issue Analytics
- State:
- Created a year ago
- Comments:5 (2 by maintainers)
Top GitHub Comments
Yes but time is relative 😄
I think it takes 5-6 hours of Wapiti training for the 40 XML - training of the segmentation model is 12 hours if I remember well (8 threads).
It does not really depend on the XML/PDF used for training. The reasons are that CRF training is slow (GPU won’t help, it’s probabilistic graphical model, not tensors stuff) and that with 40 or 35 “training examples” there is no parallelization. So we could use 2 or 100 threads, we will have the same runtime, with very few memory used. There is a minimal number of training examples for every parallel fold, and 40 is not enough. With a few hundred examples, the multi-threading is efficient.
Plan to improve this are:
fix-vector-graphics
), to have less labels and much less contentyes !
To train with the docker image, use the training web API - it was developed for this and the batch commands are not expected to be used with the docker image.
Be sure to have the modified/added training data at the right place in the docker container (so in the container under
/opt/grobid/grobid-trainer/resources/dataset/fulltext/corpus/tei
and/opt/grobid/grobid-trainer/resources/dataset/fulltext/corpus/raw
).From the error trace, the problem apparently is that the trainer founds no training file (it should indicate the number of correctly parsed training file to be used) and the CRF Wapiti JNI process fails. You could try first to retrain with the existing training files for a test.
(The amount of memory in the command line should not be a problem, the Wapiti training will use memory as needed - but enough memory should be available for the container. There should be also no problem normally with Java 11, it’s more a problem for the Deep Learning training that are hard to integrate beyond java 10).