question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

UnicodeDecodeError when using run_mlm_flax.py

See original GitHub issue

Hi, I want to develop BERT model from scratch using Turkish text corpus. First I created tokenizer and I load text data from my local as seen below

from tokenizers import BertWordPieceTokenizer
import glob
tokenizer = BertWordPieceTokenizer(
        clean_text=True,
        handle_chinese_chars=False,
        strip_accents=False,
        lowercase=False,
    )
files = glob.glob('/content/drive/MyDrive/Scorpus.txt')
trainer = tokenizer.train(
    files,
    vocab_size=32000,
    min_frequency=2,
    show_progress=True,
    special_tokens=['[PAD]', '[UNK]', '[CLS]', '[SEP]', '[MASK]'],
    limit_alphabet=1000,
    wordpieces_prefix="##"
)
tokenizer.save_model("/content/bert")
from datasets import load_dataset



# load dataset
datasetr = load_dataset('text', data_files={'train': ['/content/drive/MyDrive/Scorpus.txt']},encoding='utf-8')

Then, I run run_mlm_flax.py

!python run_mlm_flax.py \
    --output_dir="/content/bert" \
    --model_type="bert" \
    --config_name="/content/bert" \
    --tokenizer_name="/content/bert" \
    --line_by_line=True \
    --dataset_name="text" \
    --dataset_config_name="default-b06526c46e9384b1" \
    --max_seq_length="512" \
    --weight_decay="0.01" \
    --per_device_train_batch_size="128" \
    --learning_rate="3e-4" \
    --overwrite_output_dir \
    --num_train_epochs="16" \
    --adam_beta1="0.9" 

And I get an error


[19:02:31] - INFO - __main__ - Training/evaluation parameters TrainingArguments(output_dir='/content/bert', overwrite_output_dir=True, do_train=False, do_eval=False, per_device_train_batch_size=128, per_device_eval_batch_size=8, learning_rate=0.0003, weight_decay=0.01, adam_beta1=0.9, adam_beta2=0.999, adam_epsilon=1e-08, adafactor=False, num_train_epochs=16.0, warmup_steps=0, logging_steps=500, save_steps=500, eval_steps=None, seed=42, push_to_hub=False, hub_model_id=None, hub_token=None)
[19:02:31] - WARNING - datasets.builder - Using custom data configuration default-b06526c46e9384b1-d2418f61cbe4411a
Downloading and preparing dataset text/default-b06526c46e9384b1 to /root/.cache/huggingface/datasets/text/default-b06526c46e9384b1-d2418f61cbe4411a/0.0.0/21a506d1b2b34316b1e82d0bd79066905d846e5d7e619823c0dd338d6f1fa6ad...
Downloading data files: 100% 1/1 [00:00<00:00, 5190.97it/s]
Extracting data files: 100% 1/1 [00:00<00:00, 543.80it/s]
Traceback (most recent call last):
  File "run_mlm_flax.py", line 880, in <module>
    main()
  File "run_mlm_flax.py", line 430, in main
    use_auth_token=True if model_args.use_auth_token else None,
  File "/usr/local/lib/python3.7/dist-packages/datasets/load.py", line 1751, in load_dataset
    use_auth_token=use_auth_token,
  File "/usr/local/lib/python3.7/dist-packages/datasets/builder.py", line 705, in download_and_prepare
    dl_manager=dl_manager, verify_infos=verify_infos, **download_and_prepare_kwargs
  File "/usr/local/lib/python3.7/dist-packages/datasets/builder.py", line 793, in _download_and_prepare
    self._prepare_split(split_generator, **prepare_split_kwargs)
  File "/usr/local/lib/python3.7/dist-packages/datasets/builder.py", line 1275, in _prepare_split
    generator, unit=" tables", leave=False, disable=(not logging.is_progress_bar_enabled())
  File "/usr/local/lib/python3.7/dist-packages/tqdm/std.py", line 1195, in __iter__
    for obj in iterable:
  File "/usr/local/lib/python3.7/dist-packages/datasets/packaged_modules/text/text.py", line 77, in _generate_tables
    batch = f.read(self.config.chunksize)
  File "/usr/lib/python3.7/codecs.py", line 322, in decode
    (result, consumed) = self._buffer_decode(data, self.errors, final)
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xff in position 0: invalid start byte

Note: I use google colab

Issue Analytics

  • State:closed
  • Created a year ago
  • Comments:5 (3 by maintainers)

github_iconTop GitHub Comments

2reactions
LysandreJikcommented, Aug 3, 2022

Let me cc @albertvillanova, who might have seen this error before 😃

1reaction
hazalturkmencommented, Aug 5, 2022

Thank you @albertvillanova ! It fixed the issue 👍

Read more comments on GitHub >

github_iconTop Results From Across the Web

error UnicodeDecodeError: 'utf-8' codec can't decode byte 0xff ...
Python tries to convert a byte-array (a bytes which it assumes to be a utf-8-encoded string) to a unicode string ( str )....
Read more >
UnicodeDecodeError - Python Wiki
The UnicodeDecodeError normally happens when decoding an str string from a certain coding. Since codings map only a limited number of str ...
Read more >
Python UnicodeDecodeError utf-8 codec can t decode byte ...
I was really stuck with this problem. Alternatively we can use "encoding='unicode_escape'" with the same effect. import pandas as pd data=pd.
Read more >
UnicodeDecodeError: 'charmap' codec can't decode byte 0x9d
Traceback (most recent call last): File "Conditional. py ", line 108, in module for line in file1: File "cp1252. py ", line 23,...
Read more >
'charmap' codec can't decode byte 0x81 in position X ... - GitHub
OS: Windows Pytext Version: Tried with 1.10.0 and 1.11.1, error stack on 1.11.1 I am ... manually with the cli command: $ jupytext...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found