Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Tapas not working with tables exceeding token limit

See original GitHub issue

Environment info

transformers version: 4.3.0
Platform: MacOS
Python version: 3.7
PyTorch version (GPU?): 1.7.1
Using GPU in script?: No
Using distributed or parallel set-up in script?: No

Who can help

@LysandreJik @sgugger @NielsRogge

Information

Model I am using (Bert, XLNet …): TaPas

To reproduce

When executing the following code, using this table, I get an IndexError: index out of range in self.

from transformers import AutoTokenizer, AutoModelForTableQuestionAnswering
import pandas as pd

tokenizer = AutoTokenizer.from_pretrained("google/tapas-base-finetuned-wtq", drop_rows_to_fit=True)
model = AutoModelForTableQuestionAnswering.from_pretrained("google/tapas-base-finetuned-wtq")

df = pd.read_csv("table.tsv", sep="\t").astype(str)

queries = ["How big is Ardeen?"]
inputs = tokenizer(table=df, queries=queries, padding="max_length", truncation=True, return_tensors="pt")

outputs = model(**inputs)

I am not completely sure about the cause of the error but I suspect that the column rank vectors are not correctly generated. (torch.max(token_type_ids[:, :, 4]) returns 298 and torch.max(token_type_ids[:, :, 5]) returns 302, the Embedding Models for column rank and inverse column rank, however, allow a max value of 255)

Issue Analytics

State:
Created 3 years ago
Comments:6 (4 by maintainers)

Top GitHub Comments

1reaction

yuhaozhangcommented, Aug 7, 2021

@NielsRogge Thanks for the explanations above. Has there been any update on this issue? I have also run into this issue when running Tapas on the WTQ dataset, and it took me a lot of efforts to get to the bottom of this and realize that this is an issue with the column_rank IDs from oversized tables.

The painful part is that there is currently no guard or no warning against feeding oversized tables into the tokenizer, and the issue will only come out as a “CUDA error: device-side assert triggered” message when the Tapas forward pass is run.

I think there are several potential ways to solve this or make this less painful:

Computing the column rank after the table truncation (as already suggested by another comment above). This makes a ton of sense because the table will only be presented to the model after truncation in the tokenizer anyway, so there is no point to maintain a non-continuous column rank for large tables (with some ranks removed due to truncation). I understand that the original TF implementation might not handle this, but can this be added as a behavior in the Huggingface implementation?
Add an option to re-map all the large column ranks to the max rank value. This can be implemented in this tokenizer function: https://github.com/huggingface/transformers/blob/7fcee113c163a95d1b125ef35dc49a0a1aa13a50/src/transformers/models/tapas/tokenization_tapas.py#L1487 This is less ideal than 1, but can make sure that the model won’t crash due to an index-out-of-range error.
The easiest fix would be to add some warning/exception in the tokenizer that reminds users about this. Or let the tokenizer return a None value in the output, or return a special boolean variable such as table_oversized. This does not solve anything, but can make the capture of this issue much easier.

Look forward to some updates on this issue.

1reaction

NielsRoggecommented, Feb 10, 2021

So the author replied:

IIRC, then we compute them before pruning the table. That was by design so that those ranks would match the original numeric rank (pre-pruning). It’s true that the rank could thus exceed the vocab size. We could add some trimming to prevent that.

So this is something that could be added in the future (together with the prune_columns option). I put it on my to-do list for now.