BertWordPieceTokenizer cannot be pickled
See original GitHub issue🐛 Bug
Information
Model I am using (Bert, XLNet …): Bert
Language I am using the model on (English, Chinese …): English
The problem arises when using:
- my own modified scripts:
The tasks I am working on is:
- my own task or dataset:
To reproduce
import torch
import tokenizers
import pandas as pd
from torch.utils import data
class config:
MAX_LEN = 128
TRAIN_BATCH_SIZE = 64
VALID_BATCH_SIZE = 16
EPOCHS = 5
BERT_PATH = "../input/bert-base-uncased/"
MODEL_PATH = "model.bin"
TRAINING_FILE = "../input/tweet-sentiment-extraction/train_folds.csv"
TOKENIZER = tokenizers.BertWordPieceTokenizer(
f"{BERT_PATH}/vocab.txt",
lowercase=True
)
def process_data(tweet, selected_text, sentiment, tokenizer, max_len):
len_st = len(selected_text)
idx0 = -1
idx1 = -1
for ind in (i for i, e in enumerate(tweet) if e == selected_text[0]):
if tweet[ind: ind+len_st] == selected_text:
idx0 = ind
idx1 = ind + len_st - 1
break
char_targets = [0] * len(tweet)
if idx0 != -1 and idx1 != -1 :
for ct in range(idx0, idx1 + 1):
char_targets[ct] = 1
tok_tweet = tokenizer.encode(tweet)
input_ids_orig = tok_tweet.ids[1:-1]
tweet_offsets = tok_tweet.offsets[1:-1]
target_idx = []
for j, (offset1, offset2) in enumerate(tweet_offsets):
if sum(char_targets[offset1: offset2]) > 0:
target_idx.append(j)
targets_start = target_idx[0]
targets_end = target_idx[-1]
sentiment_id = {
'positive': 3893,
'negative': 4997,
'neutral': 8699
}
input_ids = [101] + [sentiment_id[sentiment]] + [102] + input_ids_orig + [102]
token_type_ids = [0, 0, 0] + [1] * (len(input_ids_orig) + 1)
mask = [1] * len(token_type_ids)
tweet_offsets = [(0, 0)] * 3 + tweet_offsets + [(0, 0)]
targets_start += 3
targets_end += 3
padding_length = max_len - len(input_ids)
if padding_length > 0:
input_ids = input_ids + ([0] * padding_length)
mask = mask + ([0] * padding_length)
token_type_ids = token_type_ids + ([0] * padding_length)
tweet_offsets = tweet_offsets + ([(0, 0)] * padding_length)
return {
'ids': input_ids,
'mask': mask,
'token_type_ids': token_type_ids,
'targets_start': targets_start,
'targets_end': targets_end,
'orig_tweet': tweet,
'orig_selected': selected_text,
'sentiment': sentiment,
'offsets': tweet_offsets
}
class TweetDataset(data.Dataset):
def __init__(self, tweet, sentiment, selected_text):
self.tweet = tweet
self.sentiment = sentiment
self.selected_text = selected_text
self.tokenizer = config.TOKENIZER
self.max_len = config.MAX_LEN
def __len__(self):
return len(self.tweet)
def __getitem__(self, item):
data = process_data(
self.tweet[item],
self.selected_text[item],
self.sentiment[item],
self.tokenizer,
self.max_len
)
return {
'ids': torch.tensor(data["ids"], dtype=torch.long),
'mask': torch.tensor(data["mask"], dtype=torch.long),
'token_type_ids': torch.tensor(data["token_type_ids"], dtype=torch.long),
'targets_start': torch.tensor(data["targets_start"], dtype=torch.long),
'targets_end': torch.tensor(data["targets_end"], dtype=torch.long),
'orig_tweet': data["orig_tweet"],
'orig_selected': data["orig_selected"],
'sentiment': data["sentiment"],
'offsets': torch.tensor(data["offsets"], dtype=torch.long)
}
dfx = pd.read_csv(config.TRAINING_FILE)
fold = 4
df_train = dfx[dfx.kfold != fold].reset_index(drop=True)
df_valid = dfx[dfx.kfold == fold].reset_index(drop=True)
train_dataset = TweetDataset(
tweet=dfx.text.values,
sentiment=dfx.sentiment.values,
selected_text=dfx.selected_text.values
)
train_data_loader = torch.utils.data.DataLoader(
train_dataset,
batch_size=config.TRAIN_BATCH_SIZE,
num_workers=1
)
if __name__ =='__main__':
a = enumerate(train_data_loader)
Expected behavior
The enumerate should return the iterable.
Environment info
Output of transformers-cli env
transformers version: 2.9.1 Platform: Windows-10-10.0.18362-SP0 Python version: 3.8.2 PyTorch version (GPU?): 1.5.0 (True) Tensorflow version (GPU?): not installed (NA) Using GPU in script?: Yes Using distributed or parallel set-up in script?: No
Issue Analytics
- State:
- Created 3 years ago
- Reactions:1
- Comments:7 (1 by maintainers)
Top Results From Across the Web
Establishing why an object can't be pickled - Stack Overflow
I would use dill , which has tools to investigate what inside an object causes your target object to not be picklable. See...
Read more >Ensemble Models-versiong | Kaggle
I know there's a wife/kids and other girls but I can't help it. ... from tokenizers import BertWordPieceTokenizer from tensorflow.keras.layers import Dense, ...
Read more >Do Not Use Python Pickle Unless You Know All These Points
A typical type that cannot be pickled will be the live connections objects, such as the network or database connections.
Read more >tokenizers - PyPI
... A BPE implementation compatible with the one used by SentencePiece; BertWordPieceTokenizer : The famous Bert tokenizer, using WordPiece.
Read more >Troubleshooting common problems with home canned pickles
Explore common problems home canners face with pickled products and solutions to ... If you cannot begin pickling immediately, refrigerate or spread out...
Read more >Top Related Medium Post
No results found
Top Related StackOverflow Question
No results found
Troubleshoot Live Code
Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start FreeTop Related Reddit Thread
No results found
Top Related Hackernoon Post
No results found
Top Related Tweet
No results found
Top Related Dev.to Post
No results found
Top Related Hashnode Post
No results found
Top GitHub Comments
And this probably should be moved to the tokenizer repo @sshleifer to confirm.
This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.