question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

F32 Example Training Gets Stuck after One Iteration of For Loop

See original GitHub issue
(base) tedliosu@victus-ted:~/Documents/all_git/intel-extension-for-pytorch/docker$ sudo -H ./build.sh
[a bunch of output here]
(base) tedliosu@victus-ted:~/Documents/all_git/intel-extension-for-pytorch/docker$ IMAGE_NAME=intel-extension-for-pytorch:gpu
(base) tedliosu@victus-ted:~/Documents/all_git/intel-extension-for-pytorch/docker$ VIDEO=$(getent group video | sed -E 's,^video:[^:]*:([^:]*):.*$,\1,')
(base) tedliosu@victus-ted:~/Documents/all_git/intel-extension-for-pytorch/docker$ RENDER=$(getent group render | sed -E 's,^render:[^:]*:([^:]*):.*$,\1,')
(base) tedliosu@victus-ted:~/Documents/all_git/intel-extension-for-pytorch/docker$ test -z "$RENDER" || RENDER_GROUP="--group-add ${RENDER}"
(base) tedliosu@victus-ted:~/Documents/all_git/intel-extension-for-pytorch/docker$ sudo -H docker run --rm -v /home/tedliosu/intel_pytorch_workspace:/workspace --group-add ${VIDEO} ${RENDER_GROUP} --device=/dev/dri --ipc=host -it $IMAGE_NAME bash
[sudo] password for tedliosu: 
groups: cannot find name for group ID 109
root@8e852a62c8b4:/# cd workspace/
root@d4958d53cb7c:/workspace# python3 -m trace -t ipex_f32_example.py 2>&1 | tee ipex_f32_example_py_trace.txt | grep ipex_f32_example
 --- modulename: ipex_f32_example, funcname: <module>
ipex_f32_example.py(1): import torch
<frozen importlib._bootstrap>(186): <frozen importlib._bootstrap>(187): <frozen importlib._bootstrap>(191): <frozen importlib._bootstrap>(192): <frozen importlib._bootstrap>(194): ipex_f32_example.py(2): import torchvision
<frozen importlib._bootstrap>(186): <frozen importlib._bootstrap>(187): <frozen importlib._bootstrap>(191): <frozen importlib._bootstrap>(192): <frozen importlib._bootstrap>(194): ipex_f32_example.py(4): import intel_extension_for_pytorch as ipex
<frozen importlib._bootstrap>(186): <frozen importlib._bootstrap>(187): <frozen importlib._bootstrap>(191): <frozen importlib._bootstrap>(192): <frozen importlib._bootstrap>(194): ipex_f32_example.py(7): LR = 0.001
ipex_f32_example.py(8): DOWNLOAD = True
ipex_f32_example.py(9): DATA = 'datasets/cifar10/'
ipex_f32_example.py(11): transform = torchvision.transforms.Compose([
ipex_f32_example.py(12):     torchvision.transforms.Resize((224, 224)),
ipex_f32_example.py(13):     torchvision.transforms.ToTensor(),
ipex_f32_example.py(14):     torchvision.transforms.Normalize((0.5, 0.5, 0.5), (0.5, 0.5, 0.5))
ipex_f32_example.py(11): transform = torchvision.transforms.Compose([
ipex_f32_example.py(16): train_dataset = torchvision.datasets.CIFAR10(
ipex_f32_example.py(17):         root=DATA,
ipex_f32_example.py(18):         train=True,
ipex_f32_example.py(19):         transform=transform,
ipex_f32_example.py(20):         download=DOWNLOAD,
ipex_f32_example.py(16): train_dataset = torchvision.datasets.CIFAR10(
ipex_f32_example.py(22): train_loader = torch.utils.data.DataLoader(
ipex_f32_example.py(23):         dataset=train_dataset,
ipex_f32_example.py(24):         batch_size=128
ipex_f32_example.py(22): train_loader = torch.utils.data.DataLoader(
ipex_f32_example.py(27): model = torchvision.models.resnet50()
ipex_f32_example.py(28): criterion = torch.nn.CrossEntropyLoss().to("xpu")
ipex_f32_example.py(29): optimizer = torch.optim.SGD(model.parameters(), lr = LR, momentum=0.9)
ipex_f32_example.py(30): model.train()
ipex_f32_example.py(32): model = model.to("xpu")
ipex_f32_example.py(33): model, optimizer = ipex.optimize(model, optimizer=optimizer, dtype=torch.float32)
ipex_f32_example.py(36): for batch_idx, (data, target) in enumerate(train_loader):
ipex_f32_example.py(37):     print("Begin 1 loop iteration")
ipex_f32_example.py(39):     data = data.to("xpu")
ipex_f32_example.py(40):     print("Moved data onto XPU")
ipex_f32_example.py(41):     target = target.to("xpu")
ipex_f32_example.py(42):     print("Moved target onto XPU")
ipex_f32_example.py(44):     optimizer.zero_grad()
ipex_f32_example.py(45):     print("About to apply model to data")
ipex_f32_example.py(46):     output = model(data)
ipex_f32_example.py(47):     print("Finished applying model to data")
ipex_f32_example.py(48):     loss = criterion(output, target)
ipex_f32_example.py(49):     print("About to execute loss.backward()")
ipex_f32_example.py(50):     loss.backward()
ipex_f32_example.py(51):     print("About to execute optimizer.step()")
ipex_f32_example.py(52):     optimizer.step()
ipex_f32_example.py(53):     print("Current batch id : %d" % (batch_idx))
ipex_f32_example.py(54):     data = None
ipex_f32_example.py(55):     target = None
ipex_f32_example.py(36): for batch_idx, (data, target) in enumerate(train_loader):
[I killed the process after ***90 minutes*** of being stuck here]
root@d4958d53cb7c:/workspace# tail -n35 ipex_f32_example_py_trace.txt
 --- modulename: collate, funcname: <genexpr>
collate.py(81):         if not all(len(elem) == elem_size for elem in it):
 --- modulename: collate, funcname: <genexpr>
collate.py(81):         if not all(len(elem) == elem_size for elem in it):
 --- modulename: collate, funcname: <genexpr>
collate.py(81):         if not all(len(elem) == elem_size for elem in it):
 --- modulename: collate, funcname: <genexpr>
collate.py(81):         if not all(len(elem) == elem_size for elem in it):
 --- modulename: collate, funcname: <genexpr>
collate.py(81):         if not all(len(elem) == elem_size for elem in it):
 --- modulename: collate, funcname: <genexpr>
collate.py(81):         if not all(len(elem) == elem_size for elem in it):
 --- modulename: collate, funcname: <genexpr>
collate.py(81):         if not all(len(elem) == elem_size for elem in it):
 --- modulename: collate, funcname: <genexpr>
collate.py(81):         if not all(len(elem) == elem_size for elem in it):
 --- modulename: collate, funcname: <genexpr>
collate.py(81):         if not all(len(elem) == elem_size for elem in it):
 --- modulename: collate, funcname: <genexpr>
collate.py(81):         if not all(len(elem) == elem_size for elem in it):
 --- modulename: collate, funcname: <genexpr>
collate.py(81):         if not all(len(elem) == elem_size for elem in it):
 --- modulename: collate, funcname: <genexpr>
collate.py(81):         if not all(len(elem) == elem_size for elem in it):
 --- modulename: collate, funcname: <genexpr>
collate.py(81):         if not all(len(elem) == elem_size for elem in it):
 --- modulename: collate, funcname: <genexpr>
collate.py(81):         if not all(len(elem) == elem_size for elem in it):
 --- modulename: collate, funcname: <genexpr>
collate.py(81):         if not all(len(elem) == elem_size for elem in it):
 --- modulename: collate, funcname: <genexpr>
collate.py(81):         if not all(len(elem) == elem_size for elem in it):
 --- modulename: collate, funcname: <genexpr>
collate.py(81):         if not all(len(elem) == elem_size for elem in it):
 --- modulename: collate, funcname: <genexpr>
root@d4958d53cb7c:/workspace# pip list
Package                     Version            
--------------------------- -------------------
contourpy                   1.0.6              
cycler                      0.11.0             
fonttools                   4.38.0             
intel-extension-for-pytorch 1.10.200+gpu       
kiwisolver                  1.4.4              
matplotlib                  3.6.1              
numpy                       1.23.4             
packaging                   21.3               
Pillow                      9.3.0              
pip                         20.0.2             
pyparsing                   3.0.9              
python-dateutil             2.8.2              
setuptools                  45.2.0             
six                         1.16.0             
torch                       1.10.0a0+git3d5f2d4
torchvision                 0.11.3             
typing-extensions           4.4.0              
wheel                       0.34.2

Contents of ipex_f32_example.py (as you can see it’s basically the Float32 example from here):

import torch
import torchvision
############# code changes ###############
import intel_extension_for_pytorch as ipex
############# code changes ###############

LR = 0.001
DOWNLOAD = True
DATA = 'datasets/cifar10/'

transform = torchvision.transforms.Compose([
    torchvision.transforms.Resize((224, 224)),
    torchvision.transforms.ToTensor(),
    torchvision.transforms.Normalize((0.5, 0.5, 0.5), (0.5, 0.5, 0.5))
])
train_dataset = torchvision.datasets.CIFAR10(
        root=DATA,
        train=True,
        transform=transform,
        download=DOWNLOAD,
)
train_loader = torch.utils.data.DataLoader(
        dataset=train_dataset,
        batch_size=128
)

model = torchvision.models.resnet50()
criterion = torch.nn.CrossEntropyLoss().to("xpu")
optimizer = torch.optim.SGD(model.parameters(), lr = LR, momentum=0.9)
model.train()
#################################### code changes ################################
model = model.to("xpu")
model, optimizer = ipex.optimize(model, optimizer=optimizer, dtype=torch.float32)
#################################### code changes ################################

for batch_idx, (data, target) in enumerate(train_loader):
    print("Begin 1 loop iteration")
    ########## code changes ##########
    data = data.to("xpu")
    print("Moved data onto XPU")
    target = target.to("xpu")
    print("Moved target onto XPU")
    ########## code changes ##########
    optimizer.zero_grad()
    print("About to apply model to data")
    output = model(data)
    print("Finished applying model to data")
    loss = criterion(output, target)
    print("About to execute loss.backward()")
    loss.backward()
    print("About to execute optimizer.step()")
    optimizer.step()
    print("Current batch id : %d" % (batch_idx))
    data = None
    target = None
torch.save({
     'model_state_dict': model.state_dict(),
     'optimizer_state_dict': optimizer.state_dict(),
     }, 'checkpoint.pth')

As you can see in the command line output I noted that the ipex_f32_example.py script basically froze for 90 minutes when I was running it after it reached the for batch_idx, (data, target) in enumerate(train_loader): line; when I was running it without the tracing it froze at data = data.to("xpu") for over 8 hours before I had to simply kill the process. I have no idea if this is a driver issue or a torchvision issue or whatever, but this is really annoying and I’d be more than happy to provide extra info about my system to help solve this freezing problem. Also note that tail -n35 ipex_f32_example_py_trace.txt displays the last 35 lines of the trace I ran on the script to see exactly where the execution of the script is freezing.

P.S. since I already mentioned this issue in here before I made this separate issue, I saw the reply here to my initial comment about this issue but I have no idea how to apply that person’s comment to help solve this issue 😕

Issue Analytics

  • State:open
  • Created a year ago
  • Comments:18 (8 by maintainers)

github_iconTop GitHub Comments

2reactions
sanchitintelcommented, Nov 7, 2022

Hi @tedliosu! Thanks again for the info! We’ll investigate this issue while enabling Intel Extension for PyTorch for iGPUs. iGPUs are currently unsupported.

2reactions
sanchitintelcommented, Nov 1, 2022

Awesome! Thanks, @tedliosu! Looks like your iGPU is indeed tgllp! 😃

Read more comments on GitHub >

github_iconTop Results From Across the Web

TensorFlow: Each iteration in training for-loop slower [duplicate]
each append a new tf.reduce_mean() node to the graph at each iteration of the while loop, which adds overhead. Try to create them...
Read more >
Training loop stops after the first epoch in PyTorch
I'm trying to train a seq2seq model using PyTorch using the Multi30K dataset from Dutch to English language. Here is my snippet of...
Read more >
How to Train a Progressive Growing GAN in Keras for ...
A single training iteration involves first selecting a half batch of real images from the dataset and generating a half batch of fake...
Read more >
NRSA Individual Postdoctoral Fellowships FAQs (F32)
Most likely, yes. Whether the environment offers opportunities for new training is one of the criteria that reviewers of fellowship applications evaluate. If ......
Read more >
Implementation of a deep learning library in Futhark
type t = f32 ... For example is a composition of nested map-reduce ... model accuracy, since they tend to get ”stuck” in...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found