question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

[HELP] SentencePiece is not compatible with DataLoader with the Windows platform

See original GitHub issue

We added a test to cover the compatibility between SetencePiece and DataLoader. The test passes in the Linux platform but fails under the Windows platform. We need some experts to help debug.

self = <test.experimental.test_transforms_with_asset.TestTransformsWithAsset testMethod=test_sentencepiece_with_dataloader>

    def test_sentencepiece_with_dataloader(self):
        sp_model_path = download_from_url(PRETRAINED_SP_MODEL['text_bpe_25000'])
        spm_processor = sentencepiece_processor(sp_model_path)
        _path = os.path.join(self.project_root, '.data', 'text_bpe_25000.model')
        os.remove(_path)
        example_strings = ['the pretrained spm model names'] * 64
        ref_results = torch.tensor([[13, 1465, 12824, 304, 24935, 5771, 3776]] * 16, dtype=torch.long)
    
        def batch_func(data):
            return torch.tensor([spm_processor(text) for text in data], dtype=torch.long)
    
        dataloader = DataLoader(example_strings, batch_size=16, num_workers=2, collate_fn=batch_func)
>       for item in dataloader:

test\experimental\test_transforms_with_asset.py:185: 
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
env\lib\site-packages\torch\utils\data\dataloader.py:359: in __iter__
    return self._get_iterator()
env\lib\site-packages\torch\utils\data\dataloader.py:301: in _get_iterator
    return _MultiProcessingDataLoaderIter(self)
env\lib\site-packages\torch\utils\data\dataloader.py:885: in __init__
    w.start()
env\lib\multiprocessing\process.py:105: in start
    self._popen = self._Popen(self)
env\lib\multiprocessing\context.py:223: in _Popen
    return _default_context.get_context().Process._Popen(process_obj)
env\lib\multiprocessing\context.py:322: in _Popen
    return Popen(process_obj)
env\lib\multiprocessing\popen_spawn_win32.py:65: in __init__
    reduction.dump(process_obj, to_child)
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _

obj = <Process(Process-1, initial daemon)>, file = <_io.BufferedWriter name=11>
protocol = None

    def dump(obj, file, protocol=None):
        '''Replacement for pickle.dump() using ForkingPickler.'''
>       ForkingPickler(file, protocol).dump(obj)
E       AttributeError: Can't pickle local object 'TestTransformsWithAsset.test_sentencepiece_with_dataloader.<locals>.batch_func'

env\lib\multiprocessing\reduction.py:60: AttributeError

cc @peterjc123 @maxluk @nbcsm @guyang3532 @gunandrose4u @smartcat2010 @mszhanyi

Issue Analytics

  • State:open
  • Created 3 years ago
  • Comments:11 (11 by maintainers)

github_iconTop GitHub Comments

1reaction
peterjc123commented, Nov 9, 2020

Change to sth. like this:

 def batch_func(data):
        return torch.tensor([spm_processor(text) for text in data], dtype=torch.long)

    @unittest.skipIf(platform.system() == "Windows", "Test is known to fail on Windows.")
    def test_sentencepiece_with_dataloader(self):
        sp_model_path = download_from_url(PRETRAINED_SP_MODEL['text_bpe_25000'])
        spm_processor = sentencepiece_processor(sp_model_path)
        _path = os.path.join(self.project_root, '.data', 'text_bpe_25000.model')
        os.remove(_path)
        example_strings = ['the pretrained spm model names'] * 64
        ref_results = torch.tensor([[13, 1465, 12824, 304, 24935, 5771, 3776]] * 16, dtype=torch.long)

        dataloader = DataLoader(example_strings, batch_size=16, num_workers=2, collate_fn=batch_func)
        for item in dataloader:
            self.assertEqual(item, ref_results)

batch_func is a nested function in your PR and it won’t work on Windows.

1reaction
peterjc123commented, Nov 9, 2020

Nested functions are not pickle-able on Windows. Please move it to the global namespace.

Read more comments on GitHub >

github_iconTop Results From Across the Web

sentencepiece library is not being installed in the system
From the above logs, it seems that you are using CPython 3.10 on Windows AMD64. Looking at the available wheels on pypi, there...
Read more >
gluonnlp.data
DataLoaders loads data from a dataset and returns mini-batches of data ... to use for data preprocessing. num_workers > 0 is not supported...
Read more >
NMT with xFormers: Part 1 - Elijah Rippeth
While torchtext natively supports fetching established benchmark datasets, I elected to use a non-encapsulated dataset for two purposes:.
Read more >
Considerations for Installing Data Loader - Salesforce Help
Before you download and install Data Loader, understand the installation and login considerations. Each release of Data Loader for Windows or Data Loader......
Read more >
Transformers Course - Chapter 3 - TF & Torch - Kaggle
Explore and run machine learning code with Kaggle Notebooks | Using data from No attached data sources.
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found