UnicodeDecodeError when using run_mlm_flax.py
See original GitHub issueHi, I want to develop BERT model from scratch using Turkish text corpus. First I created tokenizer and I load text data from my local as seen below
from tokenizers import BertWordPieceTokenizer
import glob
tokenizer = BertWordPieceTokenizer(
clean_text=True,
handle_chinese_chars=False,
strip_accents=False,
lowercase=False,
)
files = glob.glob('/content/drive/MyDrive/Scorpus.txt')
trainer = tokenizer.train(
files,
vocab_size=32000,
min_frequency=2,
show_progress=True,
special_tokens=['[PAD]', '[UNK]', '[CLS]', '[SEP]', '[MASK]'],
limit_alphabet=1000,
wordpieces_prefix="##"
)
tokenizer.save_model("/content/bert")
from datasets import load_dataset
# load dataset
datasetr = load_dataset('text', data_files={'train': ['/content/drive/MyDrive/Scorpus.txt']},encoding='utf-8')
Then, I run run_mlm_flax.py
!python run_mlm_flax.py \
--output_dir="/content/bert" \
--model_type="bert" \
--config_name="/content/bert" \
--tokenizer_name="/content/bert" \
--line_by_line=True \
--dataset_name="text" \
--dataset_config_name="default-b06526c46e9384b1" \
--max_seq_length="512" \
--weight_decay="0.01" \
--per_device_train_batch_size="128" \
--learning_rate="3e-4" \
--overwrite_output_dir \
--num_train_epochs="16" \
--adam_beta1="0.9"
And I get an error
[19:02:31] - INFO - __main__ - Training/evaluation parameters TrainingArguments(output_dir='/content/bert', overwrite_output_dir=True, do_train=False, do_eval=False, per_device_train_batch_size=128, per_device_eval_batch_size=8, learning_rate=0.0003, weight_decay=0.01, adam_beta1=0.9, adam_beta2=0.999, adam_epsilon=1e-08, adafactor=False, num_train_epochs=16.0, warmup_steps=0, logging_steps=500, save_steps=500, eval_steps=None, seed=42, push_to_hub=False, hub_model_id=None, hub_token=None)
[19:02:31] - WARNING - datasets.builder - Using custom data configuration default-b06526c46e9384b1-d2418f61cbe4411a
Downloading and preparing dataset text/default-b06526c46e9384b1 to /root/.cache/huggingface/datasets/text/default-b06526c46e9384b1-d2418f61cbe4411a/0.0.0/21a506d1b2b34316b1e82d0bd79066905d846e5d7e619823c0dd338d6f1fa6ad...
Downloading data files: 100% 1/1 [00:00<00:00, 5190.97it/s]
Extracting data files: 100% 1/1 [00:00<00:00, 543.80it/s]
Traceback (most recent call last):
File "run_mlm_flax.py", line 880, in <module>
main()
File "run_mlm_flax.py", line 430, in main
use_auth_token=True if model_args.use_auth_token else None,
File "/usr/local/lib/python3.7/dist-packages/datasets/load.py", line 1751, in load_dataset
use_auth_token=use_auth_token,
File "/usr/local/lib/python3.7/dist-packages/datasets/builder.py", line 705, in download_and_prepare
dl_manager=dl_manager, verify_infos=verify_infos, **download_and_prepare_kwargs
File "/usr/local/lib/python3.7/dist-packages/datasets/builder.py", line 793, in _download_and_prepare
self._prepare_split(split_generator, **prepare_split_kwargs)
File "/usr/local/lib/python3.7/dist-packages/datasets/builder.py", line 1275, in _prepare_split
generator, unit=" tables", leave=False, disable=(not logging.is_progress_bar_enabled())
File "/usr/local/lib/python3.7/dist-packages/tqdm/std.py", line 1195, in __iter__
for obj in iterable:
File "/usr/local/lib/python3.7/dist-packages/datasets/packaged_modules/text/text.py", line 77, in _generate_tables
batch = f.read(self.config.chunksize)
File "/usr/lib/python3.7/codecs.py", line 322, in decode
(result, consumed) = self._buffer_decode(data, self.errors, final)
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xff in position 0: invalid start byte
Note: I use google colab
Issue Analytics
- State:
- Created a year ago
- Comments:5 (3 by maintainers)
Top Results From Across the Web
error UnicodeDecodeError: 'utf-8' codec can't decode byte 0xff ...
Python tries to convert a byte-array (a bytes which it assumes to be a utf-8-encoded string) to a unicode string ( str )....
Read more >UnicodeDecodeError - Python Wiki
The UnicodeDecodeError normally happens when decoding an str string from a certain coding. Since codings map only a limited number of str ...
Read more >Python UnicodeDecodeError utf-8 codec can t decode byte ...
I was really stuck with this problem. Alternatively we can use "encoding='unicode_escape'" with the same effect. import pandas as pd data=pd.
Read more >UnicodeDecodeError: 'charmap' codec can't decode byte 0x9d
Traceback (most recent call last): File "Conditional. py ", line 108, in module for line in file1: File "cp1252. py ", line 23,...
Read more >'charmap' codec can't decode byte 0x81 in position X ... - GitHub
OS: Windows Pytext Version: Tried with 1.10.0 and 1.11.1, error stack on 1.11.1 I am ... manually with the cli command: $ jupytext...
Read more >Top Related Medium Post
No results found
Top Related StackOverflow Question
No results found
Troubleshoot Live Code
Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start FreeTop Related Reddit Thread
No results found
Top Related Hackernoon Post
No results found
Top Related Tweet
No results found
Top Related Dev.to Post
No results found
Top Related Hashnode Post
No results found
Top GitHub Comments
Let me cc @albertvillanova, who might have seen this error before 😃
Thank you @albertvillanova ! It fixed the issue 👍