Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Enable reproducibility

See original GitHub issue

🚀 Feature request

To enable consistent benchmarking using the transformers library, deterministic behaviour needs to be enforced. This could be a simple option in TrainingArguments, e.g., enforce_reproducibility=True. Currently seeds are set, but randomness still occurs as part of CUDA and the dataloaders.

Motivation

I am the maintainer of a Scandinavian benchmarking library for language models, which uses transformers under the hood. The benchmarking results are always slightly different, however, and this could be resolved in PyTorch as described here. See below for the concrete changes.

Your contribution

To ensure reproducibility, the set_seed function in trainer_utils.py needs to include the following:

import torch
import os

# Enable PyTorch deterministic mode. This potentially requires either the environment
# variable 'CUDA_LAUNCH_BLOCKING' or 'CUBLAS_WORKSPACE_CONFIG' to be set,
# depending on the CUDA version, so we set them both here
os.environ['CUDA_LAUNCH_BLOCKING'] = '1'
os.environ['CUBLAS_WORKSPACE_CONFIG'] = ':16:8'
torch.use_deterministic_algorithms(True)

# Enable CUDNN deterministic mode
torch.backends.cudnn.deterministic = True
torch.backends.cudnn.benchmark = False

Furthermore, to enable determinism in PyTorch DataLoaders, the arguments generator and worker_init_fn need to be set. The generator is already set in the transformers library here, so we only need to set the worker_init_fn, as follows:

def seed_worker(_):
    worker_seed = torch.initial_seed() % 2**32
    set_seed(worker_seed)

dataloader = Dataloader(..., worker_init_fn=seed_worker)

Issue Analytics

State:
Created a year ago
Comments:6 (5 by maintainers)

Top GitHub Comments

1reaction

sguggercommented, Apr 20, 2022

I have not 😃

1reaction

saattrupdancommented, Apr 20, 2022

@hasansalimkanmaz I have not started working on it, no, so if you’ve got the time to look at it then go ahead 😊

Unless @sgugger have already started on it?

Top Results From Across the Web

Promoting and Enabling Reproducible Data Science Through ...

Promoting and Enabling Reproducible Data Science. This interactive simulation illuminates the roles of universities, data science ...

Understanding Reproducibility and Replicability - NCBI - NIH

Reproducibility depends only on whether the methods of the computational analysis were transparently and accurately reported and whether that data, code, or ...

Six factors affecting reproducibility in life science research and ...

Here, we review predominant factors affecting reproducibility and outline efforts to ... Such sharing would accelerate scientific discoveries, and enable ...

Dear Colleague Letter: Reproducibility and Replicability in ...

Enabling training in science and engineering communities to identify and encourage best practices for reproducibility and replicability, ...

Toward Enabling Reproducibility for Data-Intensive Research ...

Creating reproducible research objects that enable reproducibility, transparency, and re-execution for computational experiments requiring ...