Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

BUG: compute=False not working correctly in new ORC writer

See original GitHub issue

Using the new ORC writer from https://github.com/dask/dask/pull/7756 with compute=False, and then afterwards computing the result, only the first partition gets written:

import pandas as pd
import dask.dataframe as dd
import pathlib

df = pd.DataFrame(np.random.randn(100, 4), columns=['a', 'b', 'c', 'd'])
ddf = dd.from_pandas(df, npartitions=4)

In [5]: ddf.to_orc("test_orc_dataset")
Out[5]: (None, None, None, None)

In [9]: list(pathlib.Path("test_orc_dataset/").glob("*"))
Out[9]: 
[PosixPath('test_orc_dataset/part.0.orc'),
 PosixPath('test_orc_dataset/part.3.orc'),
 PosixPath('test_orc_dataset/part.2.orc'),
 PosixPath('test_orc_dataset/part.1.orc')]

In [11]: dataset = ddf.to_orc("test_orc_dataset_delayed", compute=False)

In [12]: dataset.compute()

In [13]: list(pathlib.Path("test_orc_dataset_delayed/").glob("*"))
Out[13]: [PosixPath('test_orc_dataset_delayed/part.0.orc')]

With the Parquet writer, it’s working correctly.

cc @rjzamora

Issue Analytics

State:
Created 2 years ago
Comments:7 (7 by maintainers)

Top GitHub Comments

2reactions

jorisvandenbosschecommented, Aug 12, 2021

Just tried to reproduce this and turns out I was on pyarrow 3 which failed at ddf.to_orc:

Yes, ORC write support was only added more recently, so maybe dask could check for the version and give a nicer error message

0reactions

jorisvandenbosschecommented, Aug 13, 2021

Yes - Sorry for all the churn here. I have a rough version of what I want to do in #8004 -

No need to be sorry! I am just genuinely interested to see how it could be organized / if I want to adapt my implementation in https://github.com/geopandas/dask-geopandas/pull/91 as well. Will try to take a look at #8004 one of the coming days.

Top Results From Across the Web

[BUG] orc writing can produce invalid orc file · Issue #7346 · rapidsai ...

I wrote a hack to dump the contents of the orc write function to disk and then a small test to read that...

Solved: Is there a issue with saving ORC data with Spark S...

For Hive, since Hive 1.2.1 ORC writer and reader is too old, so it has some bugs of course. In general, it will...

ORC - - ASF JIRA

This issue can't be displayed right now. It could be for a variety of reasons, like a network or application error. Try reloading...

Troubleshooting Reads from ORC and Parquet Files - Vertica

This behavior is specific to Parquet files; with an ORC file the type is correctly reported as STRING. The problem occurs because Parquet...

C++ Apache Orc is not filtering data correctly - Stack Overflow

But unfortulately it is working fine with column number. New Code: #include <iostream> #include <list> #include <memory> #include <chrono> // Orc specific ...