question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Tapas not working with tables exceeding token limit

See original GitHub issue

Environment info

  • transformers version: 4.3.0
  • Platform: MacOS
  • Python version: 3.7
  • PyTorch version (GPU?): 1.7.1
  • Using GPU in script?: No
  • Using distributed or parallel set-up in script?: No

Who can help

@LysandreJik @sgugger @NielsRogge

Information

Model I am using (Bert, XLNet …): TaPas

To reproduce

When executing the following code, using this table, I get an IndexError: index out of range in self.

from transformers import AutoTokenizer, AutoModelForTableQuestionAnswering
import pandas as pd

tokenizer = AutoTokenizer.from_pretrained("google/tapas-base-finetuned-wtq", drop_rows_to_fit=True)
model = AutoModelForTableQuestionAnswering.from_pretrained("google/tapas-base-finetuned-wtq")

df = pd.read_csv("table.tsv", sep="\t").astype(str)

queries = ["How big is Ardeen?"]
inputs = tokenizer(table=df, queries=queries, padding="max_length", truncation=True, return_tensors="pt")

outputs = model(**inputs)

I am not completely sure about the cause of the error but I suspect that the column rank vectors are not correctly generated. (torch.max(token_type_ids[:, :, 4]) returns 298 and torch.max(token_type_ids[:, :, 5]) returns 302, the Embedding Models for column rank and inverse column rank, however, allow a max value of 255)

Issue Analytics

  • State:closed
  • Created 3 years ago
  • Comments:6 (4 by maintainers)

github_iconTop GitHub Comments

1reaction
yuhaozhangcommented, Aug 7, 2021

@NielsRogge Thanks for the explanations above. Has there been any update on this issue? I have also run into this issue when running Tapas on the WTQ dataset, and it took me a lot of efforts to get to the bottom of this and realize that this is an issue with the column_rank IDs from oversized tables.

The painful part is that there is currently no guard or no warning against feeding oversized tables into the tokenizer, and the issue will only come out as a “CUDA error: device-side assert triggered” message when the Tapas forward pass is run.

I think there are several potential ways to solve this or make this less painful:

  1. Computing the column rank after the table truncation (as already suggested by another comment above). This makes a ton of sense because the table will only be presented to the model after truncation in the tokenizer anyway, so there is no point to maintain a non-continuous column rank for large tables (with some ranks removed due to truncation). I understand that the original TF implementation might not handle this, but can this be added as a behavior in the Huggingface implementation?

  2. Add an option to re-map all the large column ranks to the max rank value. This can be implemented in this tokenizer function: https://github.com/huggingface/transformers/blob/7fcee113c163a95d1b125ef35dc49a0a1aa13a50/src/transformers/models/tapas/tokenization_tapas.py#L1487 This is less ideal than 1, but can make sure that the model won’t crash due to an index-out-of-range error.

  3. The easiest fix would be to add some warning/exception in the tokenizer that reminds users about this. Or let the tokenizer return a None value in the output, or return a special boolean variable such as table_oversized. This does not solve anything, but can make the capture of this issue much easier.

Look forward to some updates on this issue.

1reaction
NielsRoggecommented, Feb 10, 2021

So the author replied:

IIRC, then we compute them before pruning the table. That was by design so that those ranks would match the original numeric rank (pre-pruning). It’s true that the rank could thus exceed the vocab size. We could add some trimming to prevent that.

So this is something that could be added in the future (together with the prune_columns option). I put it on my to-do list for now.

Read more comments on GitHub >

github_iconTop Results From Across the Web

TAPAS - Hugging Face
The TAPAS model was proposed in TAPAS: Weakly Supervised Table Parsing via ... masked tokens and at NLU in general, but is not...
Read more >
Learning to Reason Over Tables from Less Data
The TAPAS model architecture uses a BERT model to encode the statement and the flattened table, read row by row. Special embeddings are...
Read more >
(PDF) TaPas: Weakly Supervised Table Parsing via Pre-training
Table showing the results of the supervised analysis using SAM software of 71 triple-negative tumors. View full-text. Data. Full ...
Read more >
Google TAPAS is a BERT-Based Model to Query Tabular Data ...
Position ID: Just like BERT, this embedding represents the index of the token in the flattened sequence. · Segment ID: Encodes a table...
Read more >
OPEN QUESTION ANSWERING OVER TABLES AND TEXT
Open question answering considers the problem of retrieving documents from a ... et al., 2020) can easily exceed the 512-token limit, which poses...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found