question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Logic bug in arrow_writer?

See original GitHub issue

https://github.com/huggingface/datasets/blob/88a902d6474fae8d793542d57a4f3b0d187f3c5b/src/datasets/arrow_writer.py#L475-L488

I got some error, and I found it’s caused by batch_examples being {}. I wonder if the code should be as follows:

-        if batch_examples and len(next(iter(batch_examples.values()))) == 0:
+       if not batch_examples or len(next(iter(batch_examples.values()))) == 0:
            return

@lhoestq

Issue Analytics

  • State:closed
  • Created a year ago
  • Comments:10 (10 by maintainers)

github_iconTop GitHub Comments

1reaction
lhoestqcommented, Jun 16, 2022

But wouldn’t it be nice if the code can ignore it, like it ignores {“a”: []}?

I think it would make things confusing because it doesn’t follow our definition of a batch: “the columns of a batch = the keys of the dict”. It would probably break certain behaviors as well. For example if you remove all the columns of a dataset (using .remove_colums(...) or .map(..., remove_columns=...)), the writer has to write 0 columns, and currently the only way to tell the writer to do so using write_batch is to pass {}.

The error says something like arrays and schema doesn’t have the same length. And it’s not very clear I passed a {}.

Yea the message can actually be improved indeed, it’s definitely not clear. Maybe we can add a line right before the call pa.Table.from_arrays to make sure the keys of the batch match the field names of the schema

1reaction
cccntucommented, Jun 16, 2022

Thanks, I added a if-print and I found it does return an empty examples in the chunking function that is passed to .map().

Read more comments on GitHub >

github_iconTop Results From Across the Web

How To Fix Major Bug Issues In Logic Pro 10.7 - YouTube
Top Courses & Kits ⤵️--------------------------------------------- Beat Making In Logic Pro 10.5 - https://bit.ly/3lrLVMb Music Theory ...
Read more >
Reading and writing Parquet files — Apache Arrow v10.0.1
Unsupported logical types: JSON, BSON, UUID. If such a type is encountered when reading a Parquet file, the default physical type mapping is...
Read more >
XGBooster/github-issues · Datasets at Hugging Face
"As spotted by @cccntu in #4502, there's a logic bug in `ArrowWriter.write_batch` as the if-statement to handle the empty batches as detailed in...
Read more >
CHANGELOG - parquet-cpp - Git at Google
[PARQUET-1083] - [C++] Refactor core logic in parquet-scan.cc so that it can be ... [PARQUET-1078] - [C++] Add Arrow writer option to coerce...
Read more >
Error:'java.lang.UnsupportedOperationException' for Pyspark ...
EDIT : spark 3.1.1 do not have anymore this bug. ORIGINAL ANSWER : The solution of @Chogg DON'T WORK def _build_spark_session(app_name: str) ...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found