BUG: compute=False not working correctly in new ORC writer
See original GitHub issueUsing the new ORC writer from https://github.com/dask/dask/pull/7756 with compute=False, and then afterwards computing the result, only the first partition gets written:
import pandas as pd
import dask.dataframe as dd
import pathlib
df = pd.DataFrame(np.random.randn(100, 4), columns=['a', 'b', 'c', 'd'])
ddf = dd.from_pandas(df, npartitions=4)
In [5]: ddf.to_orc("test_orc_dataset")
Out[5]: (None, None, None, None)
In [9]: list(pathlib.Path("test_orc_dataset/").glob("*"))
Out[9]:
[PosixPath('test_orc_dataset/part.0.orc'),
PosixPath('test_orc_dataset/part.3.orc'),
PosixPath('test_orc_dataset/part.2.orc'),
PosixPath('test_orc_dataset/part.1.orc')]
In [11]: dataset = ddf.to_orc("test_orc_dataset_delayed", compute=False)
In [12]: dataset.compute()
In [13]: list(pathlib.Path("test_orc_dataset_delayed/").glob("*"))
Out[13]: [PosixPath('test_orc_dataset_delayed/part.0.orc')]
With the Parquet writer, it’s working correctly.
cc @rjzamora
Issue Analytics
- State:
- Created 2 years ago
- Comments:7 (7 by maintainers)
Top Results From Across the Web
[BUG] orc writing can produce invalid orc file · Issue #7346 · rapidsai ...
I wrote a hack to dump the contents of the orc write function to disk and then a small test to read that...
Read more >Solved: Is there a issue with saving ORC data with Spark S...
For Hive, since Hive 1.2.1 ORC writer and reader is too old, so it has some bugs of course. In general, it will...
Read more >ORC - - ASF JIRA
This issue can't be displayed right now. It could be for a variety of reasons, like a network or application error. Try reloading...
Read more >Troubleshooting Reads from ORC and Parquet Files - Vertica
This behavior is specific to Parquet files; with an ORC file the type is correctly reported as STRING. The problem occurs because Parquet...
Read more >C++ Apache Orc is not filtering data correctly - Stack Overflow
But unfortulately it is working fine with column number. New Code: #include <iostream> #include <list> #include <memory> #include <chrono> // Orc specific ...
Read more >
Top Related Medium Post
No results found
Top Related StackOverflow Question
No results found
Troubleshoot Live Code
Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free
Top Related Reddit Thread
No results found
Top Related Hackernoon Post
No results found
Top Related Tweet
No results found
Top Related Dev.to Post
No results found
Top Related Hashnode Post
No results found

Yes, ORC write support was only added more recently, so maybe dask could check for the version and give a nicer error message
No need to be sorry! I am just genuinely interested to see how it could be organized / if I want to adapt my implementation in https://github.com/geopandas/dask-geopandas/pull/91 as well. Will try to take a look at #8004 one of the coming days.