DISCUSS: What would an ORC reader/writer API look like?
See original GitHub issuecc @mrocklin for dask.dataframe visibility
I’m one of the developers of https://github.com/rapidsai/cudf and we’re working on adding GPU-accelerated file readers / writers to our library. It seems most of the standard formats are covered quite nicely in the Pandas API, but ORC isn’t. Before we went off defining our own API I wanted to open a discussion for defining what that API would look like so we can be consistent with the Pandas and Pandas-like community.
At the top level, I imagine it would look almost identical to Parquet in something like the following:
def read_orc(path, engine='auto', columns=None, **kwargs):
"""
Load an orc object from the file path, returning a DataFrame.
Parameters
----------
path : string
File path
columns : list, default=None
If not None, only these columns will be read from the file.
engine : {'auto', 'pyarrow'}, default 'auto'
Orc library to use. If 'auto', then the option
``io.orc.engine`` is used. The default ``io.orc.engine``
behavior is to use 'pyarrow'.
kwargs are passed to the engine
Returns
-------
DataFrame
"""
...
def to_orc(self, fname, engine='auto', compression='snappy', index=None,
partition_cols=None, **kwargs):
"""
Write a DataFrame to the binary orc format.
This function writes the dataframe as a `orc file
<https://orc.apache.org/>`_. You can choose different orc
backends, and have the option of compression. See
:ref:`the user guide <io.orc>` for more details.
Parameters
----------
fname : str
File path or Root Directory path. Will be used as Root Directory
path while writing a partitioned dataset.
engine : {'auto', 'pyarrow'}, default 'auto'
Orc library to use. If 'auto', then the option
``io.orc.engine`` is used. The default ``io.orc.engine``
behavior is to use 'pyarrow'.
compression : {'snappy', 'gzip', 'brotli', None}, default 'snappy'
Name of the compression to use. Use ``None`` for no compression.
index : bool, default None
If ``True``, include the dataframe's index(es) in the file output.
If ``False``, they will not be written to the file. If ``None``,
the behavior depends on the chosen engine.
partition_cols : list, optional, default None
Column names by which to partition the dataset
Columns are partitioned in the order they are given
**kwargs
Additional arguments passed to the orc library. See
:ref:`pandas io <io.orc>` for more details.
"""
...
Issue Analytics
- State:
- Created 5 years ago
- Comments:13 (9 by maintainers)
Top Results From Across the Web
API Help (ORC Core 1.8.1 API) - Apache ORC
This API (Application Programming Interface) document has pages corresponding to the items in the navigation bar, described as follows. Overview. The Overview ...
Read more >Structured Wide-Area Programming: Orc Abstraction, Class
Classes can be translated to Orc calculus using a special site. ... Methods may be invoked concurrently, as in functions. ... def class...
Read more >How do I Combine or Merge Small ORC files into Larger ORC ...
ALTER TABLE table_name [PARTITION partition_spec] CONCATENATE can be used to merge small ORC files into a larger file since Hive 0.14.0.
Read more >Azure HDInsight now supports Apache Spark 2.3
Data Scientist will be delighted by better integration of Deep Learning ... in core Spark engine as long as you are using DataFrame...
Read more >Efficient Indexing of Hashtags using Bitmap Indices
the search of hashtags and their combinations in tweets can be ... files, column-oriented storage formats like Orc [15], Parquet ... Hive is...
Read more >
Top Related Medium Post
No results found
Top Related StackOverflow Question
No results found
Troubleshoot Live Code
Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free
Top Related Reddit Thread
No results found
Top Related Hackernoon Post
No results found
Top Related Tweet
No results found
Top Related Dev.to Post
No results found
Top Related Hashnode Post
No results found
From a user perspective I think that it might be better to have explicit
read_parquet and
read_orc` functions. Though of course on the implementation side hopefully there is some reuse as Arrow’s ORC reader becomes more consistent with its parquet reader.+1 to everything that @xhochy said
@mrocklin ORC has different use cases than Parquet, especially with its powerful predicate push down, block level indexes and bloom filters. Many people are using it with Presto due to the huge amount of work they invested in streamlining ORC. Also in our tests ORC massively outperformed parquet for our use case (20%+ speed increases).
We are absolutely committed to ORC as a format simply due to the amount of data we manage on a tiny budget and ORC having the features required to allow us to do this within that budget.
With Support from spark, cudf and BigQuery recently added I think this should be bumped up the roadmap!