question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Training Stuck at 0%

See original GitHub issue

I am trying to fit a model with my own image datasets 1,000 images, 5 labels (Image dimension : 128x128 pixels)

I Have no idea with the output Only 0% from start to finish

train_path = ‘./productV1.1/train_images/'
train_labels = './productV1.1/product.csv'

X, y = load_image_dataset(csv_file_path=train_labels,
                                    images_path=train_path)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

model = ak.ImageClassifier(verbose=True)
model.fit(X_train, y_train, time_limit=seconds)
model.final_fit(X_train, y_train, X_test, y_test, retrain=True)

————

Output :

Preprocessing the images. Preprocessing finished.

Initializing search. Initialization finished.

±---------------------------------------------+ | Training model 0 | ±---------------------------------------------+ Using TensorFlow backend.

Epoch-1, Current Metric - 0: 0%| | 0/1 [00:00<?, ? batch/s] Epoch-1, Current Metric - 0: 10 batch [00:00, 40.43 batch/s]

Epoch-1, Current Metric - 0: 0%| | 0/1 [00:00<?, ? batch/s] Epoch-1, Current Metric - 0: 10 batch [00:00, 68.41 batch/s]

Epoch-2, Current Metric - 0.1111111111111111: 0%| | 0/1 [00:00<?, ? batch/s] Epoch-2, Current Metric - 0.1111111111111111: 10 batch [00:00, 56.75 batch/s]

Epoch-2, Current Metric - 0.1111111111111111: 0%| | 0/1 [00:00<?, ? batch/s] Epoch-2, Current Metric - 0.1111111111111111: 10 batch [00:00, 49.93 batch/s]

. . …

±---------------------------------------------+ | Training model 13 | ±---------------------------------------------+

Epoch-1, Current Metric - 0: 0%| | 0/1 [00:00<?, ? batch/s] Epoch-1, Current Metric - 0: 10 batch [00:00, 19.50 batch/s]

Epoch-1, Current Metric - 0: 0%| | 0/1 [00:00<?, ? batch/s] Epoch-1, Current Metric - 0: 10 batch [00:00, 35.31 batch/s]

Epoch-2, Current Metric - 0.1111111111111111: 0%| | 0/1 [00:00<?, ? batch/s] Epoch-2, Current Metric - 0.1111111111111111: 10 batch [00:00, 32.93 batch/s]

Epoch-2, Current Metric - 0.1111111111111111: 0%| | 0/1 [00:00<?, ? batch/s] Epoch-2, Current Metric - 0.1111111111111111: 10 batch [00:00, 24.65 batch/s]

Epoch-3, Current Metric - 0.1111111111111111: 0%| | 0/1 [00:00<?, ? batch/s] Epoch-3, Current Metric - 0.1111111111111111: 10 batch [00:00, 36.40 batch/s]

Epoch-3, Current Metric - 0.1111111111111111: 0%| | 0/1 [00:00<?, ? batch/s] Epoch-3, Current Metric - 0.1111111111111111: 10 batch [00:00, 26.43 batch/s]

Issue Analytics

  • State:closed
  • Created 5 years ago
  • Reactions:3
  • Comments:10 (1 by maintainers)

github_iconTop GitHub Comments

2reactions
kuba-machacekcommented, Apr 3, 2019

I’m getting the same issue. It seems to be stuck completely as the time_limit param is not working in this case.

2reactions
DGaffneycommented, Feb 15, 2019

Adding my voice to this - I am also running into this issue on 0.3.7 on python 3.6.7, Ubuntu 18.04.1 on a fresh amazon box, only autokeras + dependencies installed, running text classifier as follows:

from autokeras import TextClassifier

import csv
rows = []
labels = []
with open('labeled_data.csv', 'r') as f:
    reader = csv.reader(f)
    for row in reader:
        rows.append(row[0])
        labels.append(int(row[1]))

from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(rows, labels, test_size=0.33, random_state=42)
clf = TextClassifier(verbose=True)
clf.fit(x=X_train, y=y_train, time_limit=12*60*60)
clf.final_fit(X_train, y_train, X_test, y_test, retrain=True)
y_out = clf.evaluate(X_test, y_test)

It gets stuck during fit like so:

Saving model.
+--------------------------------------------------------------------------+
|        Model ID        |          Loss          |      Metric Value      |
+--------------------------------------------------------------------------+
|           31           |   2.545453941822052    |   0.6641509433962264   |
+--------------------------------------------------------------------------+


+----------------------------------------------+
|              Training model 32               |
+----------------------------------------------+

No loss decrease after 5 epochs.


Saving model.
+--------------------------------------------------------------------------+
|        Model ID        |          Loss          |      Metric Value      |
+--------------------------------------------------------------------------+
|           32           |   5.955252933502197    |  0.43018867924528303   |
+--------------------------------------------------------------------------+


+----------------------------------------------+
|              Training model 33               |
+----------------------------------------------+
Epoch-1, Current Metric - 0:   0%|                                        | 0/5 [00:00<?, ? batch/s]

happy to provide the CSV I’m using off-list.

Read more comments on GitHub >

github_iconTop Results From Across the Web

Training stuck at 0% after few epochs while training with DDP
I recently updated to pytorch_lightning 1.1.7 and noticed that after a few epochs of training, the training % is stuck at 0% and...
Read more >
Training hangs at Epoch 0 / 0% on TPU - PyTorch Lightning
Hi, I am very new to PyTorch-Lightning and to Deep Learning as well! I am converting a PyTorch project into Lightning.
Read more >
PyTorch Lightning trainer.fit stuck at epoch 0 - Stack Overflow
I was trying to make a multi-input model using PyTorch and PyTorch Lightning, but I can't figure out why the trainer is stuck...
Read more >
Training stuck for hours in custom vision - Microsoft Q&A
We could be past our training limit - but how do i see that? If we are past that, shouldn't we just get...
Read more >
Distributed training got stuck every few seconds
Hi, everyone When I train my model with DDP, I observe that my training ... There seems always one GPU got stuck whose...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found