Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

jvm crushes in native when try to running torch model

See original GitHub issue

Description

Hi. I have some problems with running a model from https://github.com/emiliantolo/pytorch_nsfw_model in DJL. Jvm crashes with an error in native. I try to run it with openjdk 8, zulu 8, zulu 13.

Expected Behavior

Expected that a model will run correctly

Error Message

#
# A fatal error has been detected by the Java Runtime Environment:
#
#  SIGSEGV (0xb) at pc=0x0000000134a917ae, pid=12860, tid=9219
#
# JRE version: OpenJDK Runtime Environment 
(Zulu13.28+11-CA) (13.0.1+10) (build 13.0.1+10-MTS)
# Java VM: OpenJDK 64-Bit Server VM (13.0.1+10-MTS, mixed mode, sharing, tiered, compressed oops, g1 gc, bsd-amd64)
# Problematic frame:
# C  [libtorch_cpu.dylib+0x2a957ae]  torch::jit::Expr::Expr(c10::intrusive_ptr<torch::jit::Tree, c10::detail::intrusive_target_default_null_type<torch::jit::Tree> > const&)+0x2e
#
# No core dump will be written. Core dumps have been disabled. To enable core dumping, try "ulimit -c unlimited" before starting Java again
#
# An error report file with more information is saved as:
# /Users/evgenyzakharov/Workspace/pytorch_nsfw_model_jvm/hs_err_pid12860.log
#
# If you would like to submit a bug report, please visit:
#   http://www.azulsystems.com/support/
# The crash happened outside the Java Virtual Machine in native code.
# See problematic frame for where to report the bug.
#

And in dump error message is next:

Stack: [0x000070000ea99000,0x000070000eb99000],  sp=0x000070000eb95f50,  free space=1011k
Native frames: (J=compiled Java code, A=aot compiled Java code, j=interpreted, Vv=VM code, C=native code)
C  [libtorch_cpu.dylib+0x2a957ae]  torch::jit::Expr::Expr(c10::intrusive_ptr<torch::jit::Tree, c10::detail::intrusive_target_default_null_type<torch::jit::Tree> > const&)+0x2e
C  [libtorch_cpu.dylib+0x2d21cb5]  torch::jit::ScriptTypeParser::parseClassConstant(torch::jit::Assign const&)+0xd5
C  [libtorch_cpu.dylib+0x2a9097a]  torch::jit::SourceImporterImpl::importClass(c10::QualifiedName const&, torch::jit::ClassDef const&, bool)+0x210a
C  [libtorch_cpu.dylib+0x2a8c61a]  torch::jit::SourceImporterImpl::importNamedType(std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> > const&, torch::jit::ClassDef const&)+0x64a
C  [libtorch_cpu.dylib+0x2a88d2b]  torch::jit::SourceImporterImpl::findNamedType(c10::QualifiedName const&)+0xfb

Full error log: hs_err_pid12860.log

How to Reproduce?

Download model from repository https://github.com/emiliantolo/pytorch_nsfw_model and try to run with DJL

Steps to reproduce

Download model
Convert model to “.pt” with script or trace:

model = models.resnet50()
model.fc = nn.Sequential(nn.Linear(2048, 512),
                                 nn.ReLU(),
                                 nn.Dropout(0.2),
                                 nn.Linear(512, 10),
                                 nn.LogSoftmax(dim=1))
model.load_state_dict(torch.load('ResNet50_nsfw_model.pth', map_location=torch.device('cpu')))
model.eval()

#script
export = torch.jit.script(model)
torch.jit.save(export, "out.pt")

#trace
image = Image.open(data_dir+"1.jpg")
image_tensor = test_transforms(image).float()
image_tensor = image_tensor.unsqueeze_(0)
input = Variable(image_tensor)
net_trace = torch.jit.trace(model, input)
net_trace.save("out.pt")

Try to use model in DJL with next criteria:

val criteria = Criteria.builder()
        .setTypes(
            Image::class.java,
            Classifications::class.java
        )
        .optModelZoo(DefaultModelZoo("<path to folder with model>"))
        .optModelName("out.pt")
        .optTranslator(translator)
        .optProgress(ProgressBar())
        .build()

Version in build.gradle.kts:


implementation("ai.djl:api:0.6.0")
runtimeOnly("ai.djl.pytorch:pytorch-engine:0.6.0")
runtimeOnly("ai.djl.pytorch:pytorch-native-auto:1.5.0")

What have you tried to solve it?

Try to use different version of jvm (openjdk 8, zulu 8 zulu 13)
Try to use different version of torch (1.6.0, 1.4.0)

Environment Info

OS:uname:Darwin 18.7.0 Darwin Kernel Version 18.7.0: Thu Jun 18 20:50:10 PDT 2020; root:xnu-4903.278.43~1/RELEASE_X86_64 x86_64
rlimit: STACK 8192k, CORE 0k, NPROC 1418, NOFILE 10240, AS infinity, DATA infinity, FSIZE infinity
load average:2.75 2.65 2.72

CPU:total 8 (initial active 8) (4 cores per cpu, 2 threads per core) family 6 model 158 stepping 9, cmov, cx8, fxsr, mmx, sse, sse2, sse3, ssse3, sse4.1, sse4.2, popcnt, avx, avx2, aes, clmul, erms, 3dnowpref, lzcnt, ht, tsc, tscinvbit, bmi1, bmi2, adx, fma

Memory: 4k page, physical 16777216k(62596k free), swap 6291456k(874240k free)

vm_info: OpenJDK 64-Bit Server VM (13.0.1+10-MTS) for bsd-amd64 JRE (13.0.1+10-MTS) (Zulu13.28+11-CA), built on Oct  9 2019 12:07:25 by "zulu_re" with clang 4.2.1 Compatible Apple LLVM 9.0.0 (clang-900.0.39.2)

Issue Analytics

State:
Created 3 years ago
Reactions:1
Comments:19 (10 by maintainers)

Top GitHub Comments

2reactions

evgzakharovcommented, Aug 7, 2020

I have compare image pixels and figure out that some of them slightly not same in jvm (ImageIO) and python (PIL). And found in stackoverflow discussion about it. All looks like that only in OSX results is not same. I will try later to check results in docker image.

2reactions

frankfliucommented, Aug 6, 2020

@evgzakharov Thanks for reporting this issue. Will try to reproduce it.