question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

BertWordPieceTokenizer cannot be pickled

See original GitHub issue

🐛 Bug

Information

Model I am using (Bert, XLNet …): Bert

Language I am using the model on (English, Chinese …): English

The problem arises when using:

  • my own modified scripts:

The tasks I am working on is:

  • my own task or dataset:

To reproduce

import torch
import tokenizers
import pandas as pd
from torch.utils import data

class config:
    MAX_LEN = 128
    TRAIN_BATCH_SIZE = 64
    VALID_BATCH_SIZE = 16
    EPOCHS = 5
    BERT_PATH = "../input/bert-base-uncased/"
    MODEL_PATH = "model.bin"
    TRAINING_FILE = "../input/tweet-sentiment-extraction/train_folds.csv"
    TOKENIZER = tokenizers.BertWordPieceTokenizer(
        f"{BERT_PATH}/vocab.txt", 
        lowercase=True
    )

def process_data(tweet, selected_text, sentiment, tokenizer, max_len):
    len_st = len(selected_text)
    idx0 = -1
    idx1 = -1
    for ind in (i for i, e in enumerate(tweet) if e == selected_text[0]):
        if tweet[ind: ind+len_st] == selected_text:
            idx0 = ind
            idx1 = ind + len_st - 1
            break

    char_targets = [0] * len(tweet)
    if idx0 != -1 and idx1 != -1 :
        for ct in range(idx0, idx1 + 1):
            char_targets[ct] = 1
    
    tok_tweet = tokenizer.encode(tweet)
    input_ids_orig = tok_tweet.ids[1:-1]
    tweet_offsets = tok_tweet.offsets[1:-1]
    
    target_idx = []
    for j, (offset1, offset2) in enumerate(tweet_offsets):
        if sum(char_targets[offset1: offset2]) > 0:
            target_idx.append(j)
    
    targets_start = target_idx[0]
    targets_end = target_idx[-1]

    sentiment_id = {
        'positive': 3893,
        'negative': 4997,
        'neutral': 8699
    }
    
    input_ids = [101] + [sentiment_id[sentiment]] + [102] + input_ids_orig + [102]
    token_type_ids = [0, 0, 0] + [1] * (len(input_ids_orig) + 1)
    mask = [1] * len(token_type_ids)
    tweet_offsets = [(0, 0)] * 3 + tweet_offsets + [(0, 0)]
    targets_start += 3
    targets_end += 3

    padding_length = max_len - len(input_ids)
    if padding_length > 0:
        input_ids = input_ids + ([0] * padding_length)
        mask = mask + ([0] * padding_length)
        token_type_ids = token_type_ids + ([0] * padding_length)
        tweet_offsets = tweet_offsets + ([(0, 0)] * padding_length)
    
    return {
        'ids': input_ids,
        'mask': mask,
        'token_type_ids': token_type_ids,
        'targets_start': targets_start,
        'targets_end': targets_end,
        'orig_tweet': tweet,
        'orig_selected': selected_text,
        'sentiment': sentiment,
        'offsets': tweet_offsets
    }

class TweetDataset(data.Dataset):
    def __init__(self, tweet, sentiment, selected_text):
        self.tweet = tweet
        self.sentiment = sentiment
        self.selected_text = selected_text
        self.tokenizer = config.TOKENIZER
        self.max_len = config.MAX_LEN
    
    def __len__(self):
        return len(self.tweet)

    def __getitem__(self, item):
        data = process_data(
            self.tweet[item], 
            self.selected_text[item], 
            self.sentiment[item],
            self.tokenizer,
            self.max_len
        )

        return {
            'ids': torch.tensor(data["ids"], dtype=torch.long),
            'mask': torch.tensor(data["mask"], dtype=torch.long),
            'token_type_ids': torch.tensor(data["token_type_ids"], dtype=torch.long),
            'targets_start': torch.tensor(data["targets_start"], dtype=torch.long),
            'targets_end': torch.tensor(data["targets_end"], dtype=torch.long),
            'orig_tweet': data["orig_tweet"],
            'orig_selected': data["orig_selected"],
            'sentiment': data["sentiment"],
            'offsets': torch.tensor(data["offsets"], dtype=torch.long)
        }

dfx = pd.read_csv(config.TRAINING_FILE)
fold = 4
df_train = dfx[dfx.kfold != fold].reset_index(drop=True)
df_valid = dfx[dfx.kfold == fold].reset_index(drop=True)

train_dataset = TweetDataset(
        tweet=dfx.text.values,
        sentiment=dfx.sentiment.values,
        selected_text=dfx.selected_text.values
    )

train_data_loader = torch.utils.data.DataLoader(
        train_dataset,
        batch_size=config.TRAIN_BATCH_SIZE,
        num_workers=1
    )

if __name__ =='__main__':
    a = enumerate(train_data_loader)

Expected behavior

The enumerate should return the iterable.

Environment info

Output of transformers-cli env

transformers version: 2.9.1 Platform: Windows-10-10.0.18362-SP0 Python version: 3.8.2 PyTorch version (GPU?): 1.5.0 (True) Tensorflow version (GPU?): not installed (NA) Using GPU in script?: Yes Using distributed or parallel set-up in script?: No

Issue Analytics

  • State:closed
  • Created 3 years ago
  • Reactions:1
  • Comments:7 (1 by maintainers)

github_iconTop GitHub Comments

1reaction
Sriharsha-hatwarcommented, May 21, 2020

And this probably should be moved to the tokenizer repo @sshleifer to confirm.

0reactions
stale[bot]commented, Jul 20, 2020

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

Read more comments on GitHub >

github_iconTop Results From Across the Web

Establishing why an object can't be pickled - Stack Overflow
I would use dill , which has tools to investigate what inside an object causes your target object to not be picklable. See...
Read more >
Ensemble Models-versiong | Kaggle
I know there's a wife/kids and other girls but I can't help it. ... from tokenizers import BertWordPieceTokenizer from tensorflow.keras.layers import Dense, ...
Read more >
Do Not Use Python Pickle Unless You Know All These Points
A typical type that cannot be pickled will be the live connections objects, such as the network or database connections.
Read more >
tokenizers - PyPI
... A BPE implementation compatible with the one used by SentencePiece; BertWordPieceTokenizer : The famous Bert tokenizer, using WordPiece.
Read more >
Troubleshooting common problems with home canned pickles
Explore common problems home canners face with pickled products and solutions to ... If you cannot begin pickling immediately, refrigerate or spread out...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found