Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

random seed is wrong implementation

See original GitHub issue

I cloned the latest version open-reid (latest commit is a1df21b). First, I run the example code:

python examples/softmax_loss.py -d viper -b 64 -j 2 -a resnet50 --logs-dir logs/softmax-loss/viper-resnet50

The result is:

Mean AP: 15.5%
CMC Scores    allshots      cuhk03  market1501
  top-1           7.1%       12.2%        7.1%
  top-5          23.6%       35.6%       23.6%
  top-10         32.9%       47.3%       32.9%

Then, I run the same code again on the same machine:

python examples/softmax_loss.py -d viper -b 64 -j 2 -a resnet50 --logs-dir logs/softmax-loss/viper-resnet50

The result is:

Mean AP: 15.6%
CMC Scores    allshots      cuhk03  market1501
  top-1           7.9%       13.0%        7.9%
  top-5          20.9%       32.8%       20.9%
  top-10         30.9%       44.8%       30.9%

It’s weird that they are different. It seems that these two lines are not work: https://github.com/Cysu/open-reid/blob/a1df21b00f9d3ecfce1329fef55af11f406c16a8/examples/softmax_loss.py#L71-L72 In Dataloader, train_transformer use RandomSizedRectCrop and RandomHorizontalFlip: https://github.com/Cysu/open-reid/blob/a1df21b00f9d3ecfce1329fef55af11f406c16a8/examples/softmax_loss.py#L36-L41 But RandomSizedRectCrop and RandomHorizontalFlip use python built-in random module other than numpy.random. https://github.com/Cysu/open-reid/blob/a1df21b00f9d3ecfce1329fef55af11f406c16a8/reid/utils/data/transforms.py#L19-L42


class RandomHorizontalFlip(object):
    """Horizontally flip the given PIL.Image randomly with a probability of 0.5."""

    def __call__(self, img):
        """
        Args:
            img (PIL.Image): Image to be flipped.
        Returns:
            PIL.Image: Randomly flipped image.
        """
        if random.random() < 0.5:
            return img.transpose(Image.FLIP_LEFT_RIGHT)
        return img

(Note: RandomHorizontalFlip source code at here)

So in examples/softmax_loss.py , I import random and change:

def main(args):
    np.random.seed(args.seed)
    torch.manual_seed(args.seed)

to:

def main(args):
    random.seed(args.seed)
    np.random.seed(args.seed)
    torch.manual_seed(args.seed)

Then I run the same example code twice. The results are still different. Then, in reid/utils/data/transforms.py, I change: https://github.com/Cysu/open-reid/blob/a1df21b00f9d3ecfce1329fef55af11f406c16a8/reid/utils/data/transforms.py#L26-L29 to

for attempt in range(10):
    area = img.size[0] * img.size[1]
    target_area = random.uniform(0.64, 1.0) * area
    print(target_area)
    aspect_ratio = random.uniform(2, 3)

Then run the example code twice. The target_area differ in first run and second run, indicating that random.seed(args.seed) is not work. So I rewrite the reid/utils/data/transforms.py with numpy.random. The final reid/utils/data/transforms.py is:

from __future__ import absolute_import

from torchvision.transforms import *
import numpy as np


class RandomHorizontalFlip(object):
    """Horizontally flip the given PIL.Image randomly with a probability of 0.5."""

    def __call__(self, img):
        """
        Args:
            img (PIL.Image): Image to be flipped.
        Returns:
            PIL.Image: Randomly flipped image.
        """
        if np.random.random() < 0.5:
            return img.transpose(Image.FLIP_LEFT_RIGHT)
        return img


class RectScale(object):
    def __init__(self, height, width, interpolation=Image.BILINEAR):
        self.height = height
        self.width = width
        self.interpolation = interpolation

    def __call__(self, img):
        w, h = img.size
        if h == self.height and w == self.width:
            return img
        return img.resize((self.width, self.height), self.interpolation)


class RandomSizedRectCrop(object):
    def __init__(self, height, width, interpolation=Image.BILINEAR):
        self.height = height
        self.width = width
        self.interpolation = interpolation

    def __call__(self, img):
        for attempt in range(10):
            area = img.size[0] * img.size[1]
            target_area = np.random.uniform(0.64, 1.0) * area
            print(target_area)
            aspect_ratio = np.random.uniform(2, 3)

            h = int(round(math.sqrt(target_area * aspect_ratio)))
            w = int(round(math.sqrt(target_area / aspect_ratio)))

            if w <= img.size[0] and h <= img.size[1]:
                x1 = np.random.randint(0, img.size[0] - w + 1)
                y1 = np.random.randint(0, img.size[1] - h + 1)

                img = img.crop((x1, y1, x1 + w, y1 + h))
                assert(img.size == (w, h))

                return img.resize((self.width, self.height), self.interpolation)

        # Fallback
        scale = RectScale(self.height, self.width,
                          interpolation=self.interpolation)
        return scale(img)

Then run the example code twice. The target_area is the same between first run and second run. But the final results (mAP, CMC) are still different. I’m wondering what’s wrong with the code. Could you check the code and answer my quesion?

Issue Analytics

State:
Created 6 years ago
Comments:5 (5 by maintainers)

Top GitHub Comments

1reaction

Cysucommented, Sep 12, 2017

@zydou I mean some of the cuda kernels that used by cudnn or torch C-implementation could be non-deterministic. One reason could be floating number addition is not associative. You can try in python 0.7 + 0.2 + 0.1 == 0.7 + 0.1 + 0.2. It will print False. This implies that the reduce Op with multiple threads / processes is non-deterministic.

When setting batch size to 1, I suspect there is no need to call the reduce Op. And thus lead to the same result.

0reactions

zydoucommented, Sep 12, 2017

@Cysu Thanks a lot!