question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

BUG: compute=False not working correctly in new ORC writer

See original GitHub issue

Using the new ORC writer from https://github.com/dask/dask/pull/7756 with compute=False, and then afterwards computing the result, only the first partition gets written:

import pandas as pd
import dask.dataframe as dd
import pathlib

df = pd.DataFrame(np.random.randn(100, 4), columns=['a', 'b', 'c', 'd'])
ddf = dd.from_pandas(df, npartitions=4)

In [5]: ddf.to_orc("test_orc_dataset")
Out[5]: (None, None, None, None)

In [9]: list(pathlib.Path("test_orc_dataset/").glob("*"))
Out[9]: 
[PosixPath('test_orc_dataset/part.0.orc'),
 PosixPath('test_orc_dataset/part.3.orc'),
 PosixPath('test_orc_dataset/part.2.orc'),
 PosixPath('test_orc_dataset/part.1.orc')]

In [11]: dataset = ddf.to_orc("test_orc_dataset_delayed", compute=False)

In [12]: dataset.compute()

In [13]: list(pathlib.Path("test_orc_dataset_delayed/").glob("*"))
Out[13]: [PosixPath('test_orc_dataset_delayed/part.0.orc')]

With the Parquet writer, it’s working correctly.

cc @rjzamora

Issue Analytics

  • State:closed
  • Created 2 years ago
  • Comments:7 (7 by maintainers)

github_iconTop GitHub Comments

2reactions
jorisvandenbosschecommented, Aug 12, 2021

Just tried to reproduce this and turns out I was on pyarrow 3 which failed at ddf.to_orc:

Yes, ORC write support was only added more recently, so maybe dask could check for the version and give a nicer error message

0reactions
jorisvandenbosschecommented, Aug 13, 2021

Yes - Sorry for all the churn here. I have a rough version of what I want to do in #8004 -

No need to be sorry! I am just genuinely interested to see how it could be organized / if I want to adapt my implementation in https://github.com/geopandas/dask-geopandas/pull/91 as well. Will try to take a look at #8004 one of the coming days.

Read more comments on GitHub >

github_iconTop Results From Across the Web

[BUG] orc writing can produce invalid orc file · Issue #7346 · rapidsai ...
I wrote a hack to dump the contents of the orc write function to disk and then a small test to read that...
Read more >
Solved: Is there a issue with saving ORC data with Spark S...
For Hive, since Hive 1.2.1 ORC writer and reader is too old, so it has some bugs of course. In general, it will...
Read more >
ORC - - ASF JIRA
This issue can't be displayed right now. It could be for a variety of reasons, like a network or application error. Try reloading...
Read more >
Troubleshooting Reads from ORC and Parquet Files - Vertica
This behavior is specific to Parquet files; with an ORC file the type is correctly reported as STRING. The problem occurs because Parquet...
Read more >
C++ Apache Orc is not filtering data correctly - Stack Overflow
But unfortulately it is working fine with column number. New Code: #include <iostream> #include <list> #include <memory> #include <chrono> // Orc specific ...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found