Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Logic bug in arrow_writer?

See original GitHub issue

https://github.com/huggingface/datasets/blob/88a902d6474fae8d793542d57a4f3b0d187f3c5b/src/datasets/arrow_writer.py#L475-L488

I got some error, and I found it’s caused by batch_examples being {}. I wonder if the code should be as follows:

-        if batch_examples and len(next(iter(batch_examples.values()))) == 0:
+       if not batch_examples or len(next(iter(batch_examples.values()))) == 0:
            return

@lhoestq

Issue Analytics

State:
Created a year ago
Comments:10 (10 by maintainers)

Top GitHub Comments

1reaction

lhoestqcommented, Jun 16, 2022

But wouldn’t it be nice if the code can ignore it, like it ignores {“a”: []}?

I think it would make things confusing because it doesn’t follow our definition of a batch: “the columns of a batch = the keys of the dict”. It would probably break certain behaviors as well. For example if you remove all the columns of a dataset (using .remove_colums(...) or .map(..., remove_columns=...)), the writer has to write 0 columns, and currently the only way to tell the writer to do so using write_batch is to pass {}.

The error says something like arrays and schema doesn’t have the same length. And it’s not very clear I passed a {}.

Yea the message can actually be improved indeed, it’s definitely not clear. Maybe we can add a line right before the call pa.Table.from_arrays to make sure the keys of the batch match the field names of the schema

1reaction

cccntucommented, Jun 16, 2022

Thanks, I added a if-print and I found it does return an empty examples in the chunking function that is passed to .map().

Top Results From Across the Web

How To Fix Major Bug Issues In Logic Pro 10.7 - YouTube

Top Courses & Kits ⤵️--------------------------------------------- Beat Making In Logic Pro 10.5 - https://bit.ly/3lrLVMb Music Theory ...

Reading and writing Parquet files — Apache Arrow v10.0.1

Unsupported logical types: JSON, BSON, UUID. If such a type is encountered when reading a Parquet file, the default physical type mapping is...

XGBooster/github-issues · Datasets at Hugging Face

"As spotted by @cccntu in #4502, there's a logic bug in `ArrowWriter.write_batch` as the if-statement to handle the empty batches as detailed in...

CHANGELOG - parquet-cpp - Git at Google

[PARQUET-1083] - [C++] Refactor core logic in parquet-scan.cc so that it can be ... [PARQUET-1078] - [C++] Add Arrow writer option to coerce...

Error:'java.lang.UnsupportedOperationException' for Pyspark ...

EDIT : spark 3.1.1 do not have anymore this bug. ORIGINAL ANSWER : The solution of @Chogg DON'T WORK def _build_spark_session(app_name: str) ...

Troubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.

Start Free

Top Related Reddit Thread

No results found

Top Related Tweet

No results found

Top Related Dev.to Post

No results found

Logic bug in arrow_writer?

Issue Analytics

Top GitHub Comments

Top Results From Across the Web

Top Related Medium Post

Top Related StackOverflow Question

Troubleshoot Live Code

Top Related Reddit Thread

Top Related Hackernoon Post

Top Related Tweet

Top Related Dev.to Post

Top Related Hashnode Post

Failure to hash (and cache) a `.map(...)` (almost always) - using this method can produce incorrect results

Dataset slow during model training