Tapas not working with tables exceeding token limit
See original GitHub issueEnvironment info
transformers
version: 4.3.0- Platform: MacOS
- Python version: 3.7
- PyTorch version (GPU?): 1.7.1
- Using GPU in script?: No
- Using distributed or parallel set-up in script?: No
Who can help
@LysandreJik @sgugger @NielsRogge
Information
Model I am using (Bert, XLNet …): TaPas
To reproduce
When executing the following code, using this table, I get an IndexError: index out of range in self
.
from transformers import AutoTokenizer, AutoModelForTableQuestionAnswering
import pandas as pd
tokenizer = AutoTokenizer.from_pretrained("google/tapas-base-finetuned-wtq", drop_rows_to_fit=True)
model = AutoModelForTableQuestionAnswering.from_pretrained("google/tapas-base-finetuned-wtq")
df = pd.read_csv("table.tsv", sep="\t").astype(str)
queries = ["How big is Ardeen?"]
inputs = tokenizer(table=df, queries=queries, padding="max_length", truncation=True, return_tensors="pt")
outputs = model(**inputs)
I am not completely sure about the cause of the error but I suspect that the column rank vectors are not correctly generated. (torch.max(token_type_ids[:, :, 4])
returns 298 and torch.max(token_type_ids[:, :, 5])
returns 302, the Embedding Models for column rank and inverse column rank, however, allow a max value of 255)
Issue Analytics
- State:
- Created 3 years ago
- Comments:6 (4 by maintainers)
Top Results From Across the Web
TAPAS - Hugging Face
The TAPAS model was proposed in TAPAS: Weakly Supervised Table Parsing via ... masked tokens and at NLU in general, but is not...
Read more >Learning to Reason Over Tables from Less Data
The TAPAS model architecture uses a BERT model to encode the statement and the flattened table, read row by row. Special embeddings are...
Read more >(PDF) TaPas: Weakly Supervised Table Parsing via Pre-training
Table showing the results of the supervised analysis using SAM software of 71 triple-negative tumors. View full-text. Data. Full ...
Read more >Google TAPAS is a BERT-Based Model to Query Tabular Data ...
Position ID: Just like BERT, this embedding represents the index of the token in the flattened sequence. · Segment ID: Encodes a table...
Read more >OPEN QUESTION ANSWERING OVER TABLES AND TEXT
Open question answering considers the problem of retrieving documents from a ... et al., 2020) can easily exceed the 512-token limit, which poses...
Read more >Top Related Medium Post
No results found
Top Related StackOverflow Question
No results found
Troubleshoot Live Code
Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start FreeTop Related Reddit Thread
No results found
Top Related Hackernoon Post
No results found
Top Related Tweet
No results found
Top Related Dev.to Post
No results found
Top Related Hashnode Post
No results found
Top GitHub Comments
@NielsRogge Thanks for the explanations above. Has there been any update on this issue? I have also run into this issue when running Tapas on the WTQ dataset, and it took me a lot of efforts to get to the bottom of this and realize that this is an issue with the
column_rank
IDs from oversized tables.The painful part is that there is currently no guard or no warning against feeding oversized tables into the tokenizer, and the issue will only come out as a “CUDA error: device-side assert triggered” message when the Tapas forward pass is run.
I think there are several potential ways to solve this or make this less painful:
Computing the column rank after the table truncation (as already suggested by another comment above). This makes a ton of sense because the table will only be presented to the model after truncation in the tokenizer anyway, so there is no point to maintain a non-continuous column rank for large tables (with some ranks removed due to truncation). I understand that the original TF implementation might not handle this, but can this be added as a behavior in the Huggingface implementation?
Add an option to re-map all the large column ranks to the max rank value. This can be implemented in this tokenizer function: https://github.com/huggingface/transformers/blob/7fcee113c163a95d1b125ef35dc49a0a1aa13a50/src/transformers/models/tapas/tokenization_tapas.py#L1487 This is less ideal than 1, but can make sure that the model won’t crash due to an index-out-of-range error.
The easiest fix would be to add some warning/exception in the tokenizer that reminds users about this. Or let the tokenizer return a
None
value in the output, or return a special boolean variable such astable_oversized
. This does not solve anything, but can make the capture of this issue much easier.Look forward to some updates on this issue.
So the author replied:
So this is something that could be added in the future (together with the
prune_columns
option). I put it on my to-do list for now.