Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

F32 Example Training Gets Stuck after One Iteration of For Loop

See original GitHub issue

(base) tedliosu@victus-ted:~/Documents/all_git/intel-extension-for-pytorch/docker$ sudo -H ./build.sh
[a bunch of output here]
(base) tedliosu@victus-ted:~/Documents/all_git/intel-extension-for-pytorch/docker$ IMAGE_NAME=intel-extension-for-pytorch:gpu
(base) tedliosu@victus-ted:~/Documents/all_git/intel-extension-for-pytorch/docker$ VIDEO=$(getent group video | sed -E 's,^video:[^:]*:([^:]*):.*$,\1,')
(base) tedliosu@victus-ted:~/Documents/all_git/intel-extension-for-pytorch/docker$ RENDER=$(getent group render | sed -E 's,^render:[^:]*:([^:]*):.*$,\1,')
(base) tedliosu@victus-ted:~/Documents/all_git/intel-extension-for-pytorch/docker$ test -z "$RENDER" || RENDER_GROUP="--group-add ${RENDER}"
(base) tedliosu@victus-ted:~/Documents/all_git/intel-extension-for-pytorch/docker$ sudo -H docker run --rm -v /home/tedliosu/intel_pytorch_workspace:/workspace --group-add ${VIDEO} ${RENDER_GROUP} --device=/dev/dri --ipc=host -it $IMAGE_NAME bash
[sudo] password for tedliosu: 
groups: cannot find name for group ID 109
root@8e852a62c8b4:/# cd workspace/
root@d4958d53cb7c:/workspace# python3 -m trace -t ipex_f32_example.py 2>&1 | tee ipex_f32_example_py_trace.txt | grep ipex_f32_example
 --- modulename: ipex_f32_example, funcname: <module>
ipex_f32_example.py(1): import torch
<frozen importlib._bootstrap>(186): <frozen importlib._bootstrap>(187): <frozen importlib._bootstrap>(191): <frozen importlib._bootstrap>(192): <frozen importlib._bootstrap>(194): ipex_f32_example.py(2): import torchvision
<frozen importlib._bootstrap>(186): <frozen importlib._bootstrap>(187): <frozen importlib._bootstrap>(191): <frozen importlib._bootstrap>(192): <frozen importlib._bootstrap>(194): ipex_f32_example.py(4): import intel_extension_for_pytorch as ipex
<frozen importlib._bootstrap>(186): <frozen importlib._bootstrap>(187): <frozen importlib._bootstrap>(191): <frozen importlib._bootstrap>(192): <frozen importlib._bootstrap>(194): ipex_f32_example.py(7): LR = 0.001
ipex_f32_example.py(8): DOWNLOAD = True
ipex_f32_example.py(9): DATA = 'datasets/cifar10/'
ipex_f32_example.py(11): transform = torchvision.transforms.Compose([
ipex_f32_example.py(12):     torchvision.transforms.Resize((224, 224)),
ipex_f32_example.py(13):     torchvision.transforms.ToTensor(),
ipex_f32_example.py(14):     torchvision.transforms.Normalize((0.5, 0.5, 0.5), (0.5, 0.5, 0.5))
ipex_f32_example.py(11): transform = torchvision.transforms.Compose([
ipex_f32_example.py(16): train_dataset = torchvision.datasets.CIFAR10(
ipex_f32_example.py(17):         root=DATA,
ipex_f32_example.py(18):         train=True,
ipex_f32_example.py(19):         transform=transform,
ipex_f32_example.py(20):         download=DOWNLOAD,
ipex_f32_example.py(16): train_dataset = torchvision.datasets.CIFAR10(
ipex_f32_example.py(22): train_loader = torch.utils.data.DataLoader(
ipex_f32_example.py(23):         dataset=train_dataset,
ipex_f32_example.py(24):         batch_size=128
ipex_f32_example.py(22): train_loader = torch.utils.data.DataLoader(
ipex_f32_example.py(27): model = torchvision.models.resnet50()
ipex_f32_example.py(28): criterion = torch.nn.CrossEntropyLoss().to("xpu")
ipex_f32_example.py(29): optimizer = torch.optim.SGD(model.parameters(), lr = LR, momentum=0.9)
ipex_f32_example.py(30): model.train()
ipex_f32_example.py(32): model = model.to("xpu")
ipex_f32_example.py(33): model, optimizer = ipex.optimize(model, optimizer=optimizer, dtype=torch.float32)
ipex_f32_example.py(36): for batch_idx, (data, target) in enumerate(train_loader):
ipex_f32_example.py(37):     print("Begin 1 loop iteration")
ipex_f32_example.py(39):     data = data.to("xpu")
ipex_f32_example.py(40):     print("Moved data onto XPU")
ipex_f32_example.py(41):     target = target.to("xpu")
ipex_f32_example.py(42):     print("Moved target onto XPU")
ipex_f32_example.py(44):     optimizer.zero_grad()
ipex_f32_example.py(45):     print("About to apply model to data")
ipex_f32_example.py(46):     output = model(data)
ipex_f32_example.py(47):     print("Finished applying model to data")
ipex_f32_example.py(48):     loss = criterion(output, target)
ipex_f32_example.py(49):     print("About to execute loss.backward()")
ipex_f32_example.py(50):     loss.backward()
ipex_f32_example.py(51):     print("About to execute optimizer.step()")
ipex_f32_example.py(52):     optimizer.step()
ipex_f32_example.py(53):     print("Current batch id : %d" % (batch_idx))
ipex_f32_example.py(54):     data = None
ipex_f32_example.py(55):     target = None
ipex_f32_example.py(36): for batch_idx, (data, target) in enumerate(train_loader):
[I killed the process after ***90 minutes*** of being stuck here]
root@d4958d53cb7c:/workspace# tail -n35 ipex_f32_example_py_trace.txt
 --- modulename: collate, funcname: <genexpr>
collate.py(81):         if not all(len(elem) == elem_size for elem in it):
 --- modulename: collate, funcname: <genexpr>
collate.py(81):         if not all(len(elem) == elem_size for elem in it):
 --- modulename: collate, funcname: <genexpr>
collate.py(81):         if not all(len(elem) == elem_size for elem in it):
 --- modulename: collate, funcname: <genexpr>
collate.py(81):         if not all(len(elem) == elem_size for elem in it):
 --- modulename: collate, funcname: <genexpr>
collate.py(81):         if not all(len(elem) == elem_size for elem in it):
 --- modulename: collate, funcname: <genexpr>
collate.py(81):         if not all(len(elem) == elem_size for elem in it):
 --- modulename: collate, funcname: <genexpr>
collate.py(81):         if not all(len(elem) == elem_size for elem in it):
 --- modulename: collate, funcname: <genexpr>
collate.py(81):         if not all(len(elem) == elem_size for elem in it):
 --- modulename: collate, funcname: <genexpr>
collate.py(81):         if not all(len(elem) == elem_size for elem in it):
 --- modulename: collate, funcname: <genexpr>
collate.py(81):         if not all(len(elem) == elem_size for elem in it):
 --- modulename: collate, funcname: <genexpr>
collate.py(81):         if not all(len(elem) == elem_size for elem in it):
 --- modulename: collate, funcname: <genexpr>
collate.py(81):         if not all(len(elem) == elem_size for elem in it):
 --- modulename: collate, funcname: <genexpr>
collate.py(81):         if not all(len(elem) == elem_size for elem in it):
 --- modulename: collate, funcname: <genexpr>
collate.py(81):         if not all(len(elem) == elem_size for elem in it):
 --- modulename: collate, funcname: <genexpr>
collate.py(81):         if not all(len(elem) == elem_size for elem in it):
 --- modulename: collate, funcname: <genexpr>
collate.py(81):         if not all(len(elem) == elem_size for elem in it):
 --- modulename: collate, funcname: <genexpr>
collate.py(81):         if not all(len(elem) == elem_size for elem in it):
 --- modulename: collate, funcname: <genexpr>
root@d4958d53cb7c:/workspace# pip list
Package                     Version            
--------------------------- -------------------
contourpy                   1.0.6              
cycler                      0.11.0             
fonttools                   4.38.0             
intel-extension-for-pytorch 1.10.200+gpu       
kiwisolver                  1.4.4              
matplotlib                  3.6.1              
numpy                       1.23.4             
packaging                   21.3               
Pillow                      9.3.0              
pip                         20.0.2             
pyparsing                   3.0.9              
python-dateutil             2.8.2              
setuptools                  45.2.0             
six                         1.16.0             
torch                       1.10.0a0+git3d5f2d4
torchvision                 0.11.3             
typing-extensions           4.4.0              
wheel                       0.34.2

Contents of ipex_f32_example.py (as you can see it’s basically the Float32 example from here):

import torch
import torchvision
############# code changes ###############
import intel_extension_for_pytorch as ipex
############# code changes ###############

LR = 0.001
DOWNLOAD = True
DATA = 'datasets/cifar10/'

transform = torchvision.transforms.Compose([
    torchvision.transforms.Resize((224, 224)),
    torchvision.transforms.ToTensor(),
    torchvision.transforms.Normalize((0.5, 0.5, 0.5), (0.5, 0.5, 0.5))
])
train_dataset = torchvision.datasets.CIFAR10(
        root=DATA,
        train=True,
        transform=transform,
        download=DOWNLOAD,
)
train_loader = torch.utils.data.DataLoader(
        dataset=train_dataset,
        batch_size=128
)

model = torchvision.models.resnet50()
criterion = torch.nn.CrossEntropyLoss().to("xpu")
optimizer = torch.optim.SGD(model.parameters(), lr = LR, momentum=0.9)
model.train()
#################################### code changes ################################
model = model.to("xpu")
model, optimizer = ipex.optimize(model, optimizer=optimizer, dtype=torch.float32)
#################################### code changes ################################

for batch_idx, (data, target) in enumerate(train_loader):
    print("Begin 1 loop iteration")
    ########## code changes ##########
    data = data.to("xpu")
    print("Moved data onto XPU")
    target = target.to("xpu")
    print("Moved target onto XPU")
    ########## code changes ##########
    optimizer.zero_grad()
    print("About to apply model to data")
    output = model(data)
    print("Finished applying model to data")
    loss = criterion(output, target)
    print("About to execute loss.backward()")
    loss.backward()
    print("About to execute optimizer.step()")
    optimizer.step()
    print("Current batch id : %d" % (batch_idx))
    data = None
    target = None
torch.save({
     'model_state_dict': model.state_dict(),
     'optimizer_state_dict': optimizer.state_dict(),
     }, 'checkpoint.pth')

As you can see in the command line output I noted that the ipex_f32_example.py script basically froze for 90 minutes when I was running it after it reached the for batch_idx, (data, target) in enumerate(train_loader): line; when I was running it without the tracing it froze at data = data.to("xpu") for over 8 hours before I had to simply kill the process. I have no idea if this is a driver issue or a torchvision issue or whatever, but this is really annoying and I’d be more than happy to provide extra info about my system to help solve this freezing problem. Also note that tail -n35 ipex_f32_example_py_trace.txt displays the last 35 lines of the trace I ran on the script to see exactly where the execution of the script is freezing.

P.S. since I already mentioned this issue in here before I made this separate issue, I saw the reply here to my initial comment about this issue but I have no idea how to apply that person’s comment to help solve this issue 😕

Issue Analytics

State:
Created a year ago
Comments:18 (8 by maintainers)

Top GitHub Comments

2reactions

sanchitintelcommented, Nov 7, 2022

Hi @tedliosu! Thanks again for the info! We’ll investigate this issue while enabling Intel Extension for PyTorch for iGPUs. iGPUs are currently unsupported.

2reactions

sanchitintelcommented, Nov 1, 2022

Awesome! Thanks, @tedliosu! Looks like your iGPU is indeed tgllp! 😃

Top Results From Across the Web

TensorFlow: Each iteration in training for-loop slower [duplicate]

each append a new tf.reduce_mean() node to the graph at each iteration of the while loop, which adds overhead. Try to create them...

Training loop stops after the first epoch in PyTorch

I'm trying to train a seq2seq model using PyTorch using the Multi30K dataset from Dutch to English language. Here is my snippet of...

How to Train a Progressive Growing GAN in Keras for ...

A single training iteration involves first selecting a half batch of real images from the dataset and generating a half batch of fake...

NRSA Individual Postdoctoral Fellowships FAQs (F32)

Most likely, yes. Whether the environment offers opportunities for new training is one of the criteria that reviewers of fellowship applications evaluate. If ......

Implementation of a deep learning library in Futhark

type t = f32 ... For example is a composition of nested map-reduce ... model accuracy, since they tend to get ”stuck” in...