question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

DISCUSS: What would an ORC reader/writer API look like?

See original GitHub issue

cc @mrocklin for dask.dataframe visibility

I’m one of the developers of https://github.com/rapidsai/cudf and we’re working on adding GPU-accelerated file readers / writers to our library. It seems most of the standard formats are covered quite nicely in the Pandas API, but ORC isn’t. Before we went off defining our own API I wanted to open a discussion for defining what that API would look like so we can be consistent with the Pandas and Pandas-like community.

At the top level, I imagine it would look almost identical to Parquet in something like the following:

def read_orc(path, engine='auto', columns=None, **kwargs):
    """
    Load an orc object from the file path, returning a DataFrame.

    Parameters
    ----------
    path : string
        File path
    columns : list, default=None
        If not None, only these columns will be read from the file.
    engine : {'auto', 'pyarrow'}, default 'auto'
        Orc library to use. If 'auto', then the option
        ``io.orc.engine`` is used. The default ``io.orc.engine``
        behavior is to use 'pyarrow'.
    kwargs are passed to the engine

    Returns
    -------
    DataFrame
    """
    ...


def to_orc(self, fname, engine='auto', compression='snappy', index=None,
           partition_cols=None, **kwargs):
    """
    Write a DataFrame to the binary orc format.

    This function writes the dataframe as a `orc file
    <https://orc.apache.org/>`_. You can choose different orc
    backends, and have the option of compression. See
    :ref:`the user guide <io.orc>` for more details.

    Parameters
    ----------
    fname : str
        File path or Root Directory path. Will be used as Root Directory
        path while writing a partitioned dataset.
    engine : {'auto', 'pyarrow'}, default 'auto'
        Orc library to use. If 'auto', then the option
        ``io.orc.engine`` is used. The default ``io.orc.engine``
        behavior is to use 'pyarrow'.
    compression : {'snappy', 'gzip', 'brotli', None}, default 'snappy'
        Name of the compression to use. Use ``None`` for no compression.
    index : bool, default None
        If ``True``, include the dataframe's index(es) in the file output.
        If ``False``, they will not be written to the file. If ``None``,
        the behavior depends on the chosen engine.
    partition_cols : list, optional, default None
        Column names by which to partition the dataset
        Columns are partitioned in the order they are given
    **kwargs
        Additional arguments passed to the orc library. See
        :ref:`pandas io <io.orc>` for more details.
    """
    ...

Issue Analytics

  • State:closed
  • Created 5 years ago
  • Comments:13 (9 by maintainers)

github_iconTop GitHub Comments

4reactions
mrocklincommented, Feb 8, 2019

From a user perspective I think that it might be better to have explicit read_parquet and read_orc` functions. Though of course on the implementation side hopefully there is some reuse as Arrow’s ORC reader becomes more consistent with its parquet reader.

+1 to everything that @xhochy said

2reactions
voyceycommented, Dec 10, 2019

@mrocklin ORC has different use cases than Parquet, especially with its powerful predicate push down, block level indexes and bloom filters. Many people are using it with Presto due to the huge amount of work they invested in streamlining ORC. Also in our tests ORC massively outperformed parquet for our use case (20%+ speed increases).

We are absolutely committed to ORC as a format simply due to the amount of data we manage on a tiny budget and ORC having the features required to allow us to do this within that budget.

With Support from spark, cudf and BigQuery recently added I think this should be bumped up the roadmap!

Read more comments on GitHub >

github_iconTop Results From Across the Web

API Help (ORC Core 1.8.1 API) - Apache ORC
This API (Application Programming Interface) document has pages corresponding to the items in the navigation bar, described as follows. Overview. The Overview ...
Read more >
Structured Wide-Area Programming: Orc Abstraction, Class
Classes can be translated to Orc calculus using a special site. ... Methods may be invoked concurrently, as in functions. ... def class...
Read more >
How do I Combine or Merge Small ORC files into Larger ORC ...
ALTER TABLE table_name [PARTITION partition_spec] CONCATENATE can be used to merge small ORC files into a larger file since Hive 0.14.0.
Read more >
Azure HDInsight now supports Apache Spark 2.3
Data Scientist will be delighted by better integration of Deep Learning ... in core Spark engine as long as you are using DataFrame...
Read more >
Efficient Indexing of Hashtags using Bitmap Indices
the search of hashtags and their combinations in tweets can be ... files, column-oriented storage formats like Orc [15], Parquet ... Hive is...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found