Exception during runtime TrainMnist.java
See original GitHub issueDescription
Expected Behavior
build success
Error Message
`“C:\Program Files\Java\jdk1.8.0_251\bin\java.exe” “-javaagent:D:\JetBrains\IntelliJ IDEA 2020.1.2\lib\idea_rt.jar=54222:D:\JetBrains\IntelliJ IDEA 2020.1.2\bin” -Dfile.encoding=UTF-8 -classpath “C:\Program Files\Java\jdk1.8.0_251\jre\lib\charsets.jar;C:\Program Files\Java\jdk1.8.0_251\jre\lib\deploy.jar;C:\Program Files\Java\jdk1.8.0_251\jre\lib\ext\access-bridge-64.jar;C:\Program Files\Java\jdk1.8.0_251\jre\lib\ext\cldrdata.jar;C:\Program Files\Java\jdk1.8.0_251\jre\lib\ext\dnsns.jar;C:\Program Files\Java\jdk1.8.0_251\jre\lib\ext\jaccess.jar;C:\Program Files\Java\jdk1.8.0_251\jre\lib\ext\jfxrt.jar;C:\Program Files\Java\jdk1.8.0_251\jre\lib\ext\localedata.jar;C:\Program Files\Java\jdk1.8.0_251\jre\lib\ext\nashorn.jar;C:\Program Files\Java\jdk1.8.0_251\jre\lib\ext\sunec.jar;C:\Program Files\Java\jdk1.8.0_251\jre\lib\ext\sunjce_provider.jar;C:\Program Files\Java\jdk1.8.0_251\jre\lib\ext\sunmscapi.jar;C:\Program Files\Java\jdk1.8.0_251\jre\lib\ext\sunpkcs11.jar;C:\Program Files\Java\jdk1.8.0_251\jre\lib\ext\zipfs.jar;C:\Program Files\Java\jdk1.8.0_251\jre\lib\javaws.jar;C:\Program Files\Java\jdk1.8.0_251\jre\lib\jce.jar;C:\Program Files\Java\jdk1.8.0_251\jre\lib\jfr.jar;C:\Program Files\Java\jdk1.8.0_251\jre\lib\jfxswt.jar;C:\Program Files\Java\jdk1.8.0_251\jre\lib\jsse.jar;C:\Program Files\Java\jdk1.8.0_251\jre\lib\management-agent.jar;C:\Program Files\Java\jdk1.8.0_251\jre\lib\plugin.jar;C:\Program Files\Java\jdk1.8.0_251\jre\lib\resources.jar;C:\Program Files\Java\jdk1.8.0_251\jre\lib\rt.jar;D:\demo\target\classes;C:\Users\Administrator.m2\repository\com\vmware\vijava\5.1\vijava-5.1.jar;C:\Users\Administrator.m2\repository\dom4j\dom4j\1.6.1\dom4j-1.6.1.jar;C:\Users\Administrator.m2\repository\xml-apis\xml-apis\1.0.b2\xml-apis-1.0.b2.jar;C:\Users\Administrator.m2\repository\commons-cli\commons-cli\1.4\commons-cli-1.4.jar;C:\Users\Administrator.m2\repository\org\apache\logging\log4j\log4j-slf4j-impl\2.12.1\log4j-slf4j-impl-2.12.1.jar;C:\Users\Administrator.m2\repository\org\slf4j\slf4j-api\1.7.25\slf4j-api-1.7.25.jar;C:\Users\Administrator.m2\repository\org\apache\logging\log4j\log4j-api\2.12.1\log4j-api-2.12.1.jar;C:\Users\Administrator.m2\repository\org\apache\logging\log4j\log4j-core\2.12.1\log4j-core-2.12.1.jar;C:\Users\Administrator.m2\repository\com\google\code\gson\gson\2.8.5\gson-2.8.5.jar;C:\Users\Administrator.m2\repository\ai\djl\api\0.6.0\api-0.6.0.jar;C:\Users\Administrator.m2\repository\net\java\dev\jna\jna\5.3.0\jna-5.3.0.jar;C:\Users\Administrator.m2\repository\org\apache\commons\commons-compress\1.20\commons-compress-1.20.jar;C:\Users\Administrator.m2\repository\ai\djl\basicdataset\0.6.0\basicdataset-0.6.0.jar;C:\Users\Administrator.m2\repository\ai\djl\model-zoo\0.6.0\model-zoo-0.6.0.jar;C:\Users\Administrator.m2\repository\ai\djl\mxnet\mxnet-model-zoo\0.6.0\mxnet-model-zoo-0.6.0.jar;C:\Users\Administrator.m2\repository\ai\djl\mxnet\mxnet-engine\0.6.0\mxnet-engine-0.6.0.jar;C:\Users\Administrator.m2\repository\ai\djl\mxnet\mxnet-native-auto\1.7.0-b\mxnet-native-auto-1.7.0-b.jar” com.zhaowei.training.TrainMnist [INFO ] - Training on: cpu(). [INFO ] - Load MXNet Engine Version 1.7.0 in 0.211 ms. Training: 17% |███████ | Accuracy: 0.86, SoftmaxCrossEntropyLoss: 0.50, speed: 1416.17 items/sec[INFO ] - train P50: 23.255 ms, P90: 30.021 ms [INFO ] - forward P50: 0.874 ms, P90: 1.021 ms [INFO ] - training-metrics P50: 0.027 ms, P90: 0.035 ms [INFO ] - backward P50: 1.305 ms, P90: 1.576 ms [INFO ] - step P50: 1.676 ms, P90: 2.138 ms
Exception in thread “main” ai.djl.engine.EngineException: MXNet engine call failed: MXNetError: can’t alloc at ai.djl.mxnet.jna.JnaUtils.checkCall(JnaUtils.java:1788) at ai.djl.mxnet.jna.JnaUtils.syncCopyToCPU(JnaUtils.java:473) at ai.djl.mxnet.engine.MxNDArray.toByteBuffer(MxNDArray.java:294) at ai.djl.ndarray.NDArray.toLongArray(NDArray.java:300) at ai.djl.ndarray.NDArray.getLong(NDArray.java:558) at ai.djl.training.evaluator.AbstractAccuracy.lambda$updateAccumulator$1(AbstractAccuracy.java:85) at java.util.concurrent.ConcurrentHashMap.compute(ConcurrentHashMap.java:1877) at ai.djl.training.evaluator.AbstractAccuracy.updateAccumulator(AbstractAccuracy.java:85) at ai.djl.training.listener.EvaluatorTrainingListener.updateEvaluators(EvaluatorTrainingListener.java:153) at ai.djl.training.listener.EvaluatorTrainingListener.onTrainingBatch(EvaluatorTrainingListener.java:112) at ai.djl.training.EasyTrain.lambda$trainBatch$1(EasyTrain.java:86) at java.util.ArrayList.forEach(ArrayList.java:1257) at ai.djl.training.Trainer.notifyListeners(Trainer.java:249) at ai.djl.training.EasyTrain.trainBatch(EasyTrain.java:86) at ai.djl.training.EasyTrain.fit(EasyTrain.java:39) at com.zhaowei.training.TrainMnist.runExample(TrainMnist.java:84) at com.zhaowei.training.TrainMnist.main(TrainMnist.java:49) Suppressed: java.lang.NullPointerException at com.zhaowei.training.TrainMnist.lambda$setupTrainingConfig$0(TrainMnist.java:98) at ai.djl.training.listener.CheckpointsTrainingListener.saveModel(CheckpointsTrainingListener.java:144) at ai.djl.training.listener.CheckpointsTrainingListener.onTrainingEnd(CheckpointsTrainingListener.java:102) at ai.djl.training.Trainer.lambda$close$2(Trainer.java:295) at java.util.ArrayList.forEach(ArrayList.java:1257) at ai.djl.training.Trainer.notifyListeners(Trainer.java:249) at ai.djl.training.Trainer.close(Trainer.java:295) at com.zhaowei.training.TrainMnist.runExample(TrainMnist.java:87) … 1 more Suppressed: ai.djl.engine.EngineException: MXNet engine call failed: MXNetError: can’t alloc
at ai.djl.mxnet.jna.JnaUtils.checkCall(JnaUtils.java:1788)
at ai.djl.mxnet.jna.JnaUtils.waitAll(JnaUtils.java:466)
at ai.djl.mxnet.engine.MxModel.close(MxModel.java:176)
at com.zhaowei.training.TrainMnist.runExample(TrainMnist.java:88)
... 1 more
Process finished with exit code 1 `
Environment Info
JDK 8 Windows 10 X64 CPU I5 8G Maven Compile `<?xml version="1.0" encoding="UTF-8"?> <project xmlns="http://maven.apache.org/POM/4.0.0" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 https://maven.apache.org/xsd/maven-4.0.0.xsd"> <modelVersion>4.0.0</modelVersion> <groupId>com.example</groupId> <artifactId>demo</artifactId> <version>0.0.1-SNAPSHOT</version> <packaging>jar</packaging> <name>demo</name> <description>Demo project for Spring Boot</description>
<properties>
<java.version>8</java.version>
<maven.compiler.source>1.8</maven.compiler.source>
<maven.compiler.target>1.8</maven.compiler.target>
<djl.version>0.6.0</djl.version>
</properties>
<repositories>
<repository>
<id>djl.ai</id>
<url>https://oss.sonatype.org/content/repositories/snapshots/</url>
</repository>
</repositories>
<dependencies>
<dependency>
<groupId>com.vmware</groupId>
<artifactId>vijava</artifactId>
<version>5.1</version>
</dependency>
<dependency>
<groupId>commons-cli</groupId>
<artifactId>commons-cli</artifactId>
<version>1.4</version>
</dependency>
<dependency>
<groupId>org.apache.logging.log4j</groupId>
<artifactId>log4j-slf4j-impl</artifactId>
<version>2.12.1</version>
</dependency>
<dependency>
<groupId>com.google.code.gson</groupId>
<artifactId>gson</artifactId>
<version>2.8.5</version>
</dependency>
<dependency>
<groupId>ai.djl</groupId>
<artifactId>api</artifactId>
<version>${djl.version}</version>
</dependency>
<dependency>
<groupId>ai.djl</groupId>
<artifactId>basicdataset</artifactId>
<version>${djl.version}</version>
</dependency>
<dependency>
<groupId>ai.djl</groupId>
<artifactId>model-zoo</artifactId>
<version>${djl.version}</version>
</dependency>
<dependency>
<groupId>ai.djl.mxnet</groupId>
<artifactId>mxnet-model-zoo</artifactId>
<version>${djl.version}</version>
</dependency>
<dependency>
<groupId>ai.djl.mxnet</groupId>
<artifactId>mxnet-engine</artifactId>
<version>${djl.version}</version>
</dependency>
<dependency>
<groupId>ai.djl.mxnet</groupId>
<artifactId>mxnet-native-auto</artifactId>
<version>1.7.0-b</version>
<scope>runtime</scope>
</dependency>
</dependencies>
<build>
<plugins>
<plugin>
<groupId>org.springframework.boot</groupId>
<artifactId>spring-boot-maven-plugin</artifactId>
</plugin>
</plugins>
</build>
</project>`
` package com.zhaowei.training;
import ai.djl.Device; import ai.djl.Model; import ai.djl.basicdataset.Mnist; import ai.djl.basicmodelzoo.basic.Mlp; import com.zhaowei.training.util.Arguments; import ai.djl.metric.Metrics; import ai.djl.ndarray.types.Shape; import ai.djl.nn.Block; import ai.djl.training.DefaultTrainingConfig; import ai.djl.training.EasyTrain; import ai.djl.training.Trainer; import ai.djl.training.TrainingResult; import ai.djl.training.dataset.Dataset; import ai.djl.training.dataset.RandomAccessDataset; import ai.djl.training.evaluator.Accuracy; import ai.djl.training.listener.CheckpointsTrainingListener; import ai.djl.training.listener.TrainingListener; import ai.djl.training.loss.Loss; import ai.djl.training.util.ProgressBar; import java.io.IOException; import org.apache.commons.cli.ParseException;
public final class TrainMnist {
private TrainMnist() {}
public static void main(String[] args) throws IOException, ParseException {
TrainMnist.runExample(args);
}
public static TrainingResult runExample(String[] args) throws IOException, ParseException {
Arguments arguments = Arguments.parseArgs(args);
// Construct neural network
Block block =
new Mlp(
Mnist.IMAGE_HEIGHT * Mnist.IMAGE_WIDTH,
Mnist.NUM_CLASSES,
new int[] {128, 64});
try (Model model = Model.newInstance("mlp")) {
model.setBlock(block);
// get training and validation dataset
RandomAccessDataset trainingSet = getDataset(Dataset.Usage.TRAIN, arguments);
RandomAccessDataset validateSet = getDataset(Dataset.Usage.TEST, arguments);
// setup training configuration
DefaultTrainingConfig config = setupTrainingConfig(arguments);
try (Trainer trainer = model.newTrainer(config)) {
trainer.setMetrics(new Metrics());
/*
* MNIST is 28x28 grayscale image and pre processed into 28 * 28 NDArray.
* 1st axis is batch axis, we can use 1 for initialization.
*/
Shape inputShape = new Shape(1, Mnist.IMAGE_HEIGHT * Mnist.IMAGE_WIDTH);
// initialize trainer with proper input shape
trainer.initialize(inputShape);
EasyTrain.fit(trainer, arguments.getEpoch(), trainingSet, validateSet);
return trainer.getTrainingResult();
}
}
}
private static DefaultTrainingConfig setupTrainingConfig(Arguments arguments) {
String outputDir = arguments.getOutputDir();
CheckpointsTrainingListener listener = new CheckpointsTrainingListener(outputDir);
listener.setSaveModelCallback(
trainer -> {
TrainingResult result = trainer.getTrainingResult();
Model model = trainer.getModel();
float accuracy = result.getValidateEvaluation("Accuracy");
model.setProperty("Accuracy", String.format("%.5f", accuracy));
model.setProperty("Loss", String.format("%.5f", result.getValidateLoss()));
});
return new DefaultTrainingConfig(Loss.softmaxCrossEntropyLoss())
.addEvaluator(new Accuracy())
.optDevices(Device.getDevices(arguments.getMaxGpus()))
.addTrainingListeners(TrainingListener.Defaults.logging(outputDir))
.addTrainingListeners(listener);
}
private static RandomAccessDataset getDataset(Dataset.Usage usage, Arguments arguments)
throws IOException {
Mnist mnist =
Mnist.builder()
.optUsage(usage)
.setSampling(arguments.getBatchSize(), true)
.optLimit(arguments.getLimit())
.build();
mnist.prepare(new ProgressBar());
return mnist;
}
} `
Issue Analytics
- State:
- Created 3 years ago
- Comments:5 (5 by maintainers)
I have a suggestion, which can be recommended in the official document, such as memory and cpu,Thanks for you
16G should be sufficient