Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Training speed of mxnet-ssd slows down?

See original GitHub issue

 I have use record file(voc07+12) to train old-style ssd at a speed of 40 images/s ,The speed is about 25 images/s when  I try the new  train_ssd.py in gluoncv.
 I use rec dataset and  transform to replace origin file datasets in new ssd code. But when I set **num-workers=4** the gdata.DetectionDataLoader  failed ,while **num-workers=1** , It works but the speed is almost as  slow as original data reading method.
The  error infomation is as following:

Process Process-3:
Traceback (most recent call last):
  File "/usr/lib/python2.7/multiprocessing/process.py", line 258, in _bootstrap
  File "/usr/lib/python2.7/multiprocessing/process.py", line 258, in _bootstrap
    self.run()
  File "/usr/lib/python2.7/multiprocessing/process.py", line 114, in run
    self._target(*self._args, **self._kwargs)
  File "/home/deep/workssd/mxnet/incubator-mxnet/python/mxnet/gluon/data/dataloader.py", line 134, in worker_loop
    batch = batchify_fn([dataset[i] for i in samples])
  File "/home/deep/workssd/mxnet/incubator-mxnet/python/mxnet/gluon/data/dataset.py", line 126, in __getitem__
    self.run()
    item = self._data[idx]
  File "/home/deep/workssd/mxnet/incubator-mxnet/python/mxnet/gluon/data/vision/datasets.py", line 257, in __getitem__
  File "/usr/lib/python2.7/multiprocessing/process.py", line 114, in run
    record = super(ImageRecordDataset, self).__getitem__(idx)
  File "/home/deep/workssd/mxnet/incubator-mxnet/python/mxnet/gluon/data/dataset.py", line 180, in __getitem__
    return self._record.read_idx(self._record.keys[idx])
  File "/home/deep/workssd/mxnet/incubator-mxnet/python/mxnet/recordio.py", line 265, in read_idx
    self._target(*self._args, **self._kwargs)
  File "/home/deep/workssd/mxnet/incubator-mxnet/python/mxnet/gluon/data/dataloader.py", line 134, in worker_loop
    return self.read()
  File "/home/deep/workssd/mxnet/incubator-mxnet/python/mxnet/recordio.py", line 163, in read
    ctypes.byref(size)))
  File "/home/deep/workssd/mxnet/incubator-mxnet/python/mxnet/base.py", line 149, in check_call
    batch = batchify_fn([dataset[i] for i in samples])
    raise MXNetError(py_str(_LIB.MXGetLastError()))
MXNetError: [16:12:48] src/recordio.cc:65: Check failed: header[0] == RecordIOWriter::kMagic Invalid RecordIO File

It seems a multi-process problem with old rec file dataset?

Issue Analytics

State:
Created 5 years ago
Comments:16 (9 by maintainers)

Top GitHub Comments

1reaction

zhresholdcommented, Jun 21, 2018

@WalterMa Yes, this bug should be easy to fix, but need to be careful not to change current api, so we are still discussing.

0reactions

zhresholdcommented, Aug 13, 2018

An temporary solution is added to RecordFileDetection so multi worker can be enabled. I am closing this due to lack of activity. Feel free to ping me to reopen.

Top Results From Across the Web

Training speed in MXNet is nearly 2.5x times slower than ...

Today I started using MXNet's Gluon.cv imagenet training script. I used the MobileNet1.0 bash config presented here(classification.html).

Performance comparison between MXNet and Tensorflow

Training speed from MXNet is generally slower than TensorFlow in all datasets. However, MXNet is much more efficient than TensorFlow as it only ......

SSD-SGD: Communication Sparsification for Distributed Deep ...

show that SSD-SGD can accelerate distributed training speed under different ... bring about obvious computation overheads, which slow down the train-.

Train With Mixed Precision - NVIDIA Documentation Center

This can greatly improve the training speed as well as the inference speed ... number of layers and parameters, which slows down training....

Simplify mixed precision training with MXNet-AMP - Medium

Novel model architectures tend to have an increasing number of layers and parameters, which slows down training.