question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

[BUG] Issue reading hive partitioned dataset with NativeExecutionEngine

See original GitHub issue

I have a pandas dataframe with a column DAY representing the day number in month (ex : values from 1 to 31 for december)

from datetime import datetime
import pandas as pd

df = pd.DataFrame({'IBES': ['AAPL', 'AAPL', 'IBM', 'IBM'],
                   'EST_MEAN': [12.2, 10.0, 13.1, 13.5],
                   'EST_MEDIAN': [12.2, 12.0, 13.1, 13.2],
                   'BDATE': [datetime(2022, 1, 6), datetime(2022, 1, 7),
                             datetime(2022, 1, 6), datetime(2022, 1, 7)],
                   'DAY': [6, 7, 6, 7]
                  })

I save this dataframe with hive partition DAY

%%fsql

SELECT * FROM df
SAVE PREPARTITION BY DAY OVERWRITE PARQUET output_path

The result folder has a format similar to this: ! tree output_path output_path ├── DAY=6 │ └── 02b4a05c12fa4791aca2931e47659ecc.parquet └── DAY=7 └── bd17a05a5bd948cc824e4730fd03b473.parquet

When I try to read the dataset using spark execution engine, there is no problem

%%fsql spark

df_int = LOAD PARQUET output_path
SELECT * FROM df_int
PRINT df_int

But the same code fails using native execution engine.

The above exception was the direct cause of the following exception:

FugueDataFrameInitError                   Traceback (most recent call last)
/tmp/ipykernel_7378/4230927618.py in <module>
----> 1 get_ipython().run_cell_magic('fsql', '', "\ndf_int = LOAD PARQUET output_path\nSELECT * FROM df_int\nPRINT df_int\n")

~/.conda/envs/fugue/lib/python3.8/site-packages/IPython/core/interactiveshell.py in run_cell_magic(self, magic_name, line, cell)
   2404             with self.builtin_trap:
   2405                 args = (magic_arg_s, cell)
-> 2406                 result = fn(*args, **kwargs)
   2407             return result
   2408 

~/.conda/envs/fugue/lib/python3.8/site-packages/decorator.py in fun(*args, **kw)
    230             if not kwsyntax:
    231                 args, kw = fix(args, kw, sig)
--> 232             return caller(func, *(extras + args), **kw)
    233     fun.__name__ = func.__name__
    234     fun.__doc__ = func.__doc__

~/.conda/envs/fugue/lib/python3.8/site-packages/IPython/core/magic.py in <lambda>(f, *a, **k)
    185     # but it's overkill for just that one bit of state.
    186     def magic_deco(arg):
--> 187         call = lambda f, *a, **k: f(*a, **k)
    188 
    189         if callable(arg):

~/dev/fugue/fugue_notebook/env.py in fsql(self, line, cell, local_ns)
     88         except FugueSQLSyntaxError as ex:
     89             raise FugueSQLSyntaxError(str(ex)).with_traceback(None) from None
---> 90         dag.run(self.get_engine(line, {} if local_ns is None else local_ns))
     91         for k, v in dag.yields.items():
     92             if isinstance(v, YieldedDataFrame):

~/dev/fugue/fugue/workflow/workflow.py in run(self, *args, **kwargs)
   1516                 if ctb is None:  # pragma: no cover
   1517                     raise
-> 1518                 raise ex.with_traceback(ctb)
   1519             self._computed = True
   1520         return DataFrames(

~/dev/fugue/fugue/dataframe/pandas_dataframe.py in __init__(self, df, schema, metadata, pandas_df_wrapper)
     77             self._native = pdf
     78         except Exception as e:
---> 79             raise FugueDataFrameInitError from e
     80 
     81     @property

FugueDataFrameInitError:

I also observed that when you specifiy the list of columns you want to read, and this does not include the partition column, else it works fine:

%%fsql

df_int = LOAD PARQUET output_path COLUMNS IBES,EST_MEDIAN,EST_MEAN,BDATE
SELECT * FROM df_int
PRINT df_int

Environment:

  • Backend: pandas
  • Backend version: 1.3.5
  • Python version: 3.8.12
  • OS: linux

Issue Analytics

  • State:open
  • Created 2 years ago
  • Comments:9 (4 by maintainers)

github_iconTop GitHub Comments

1reaction
goodwanghancommented, Jan 10, 2022

We need to add this from triad And then on Fugue

0reactions
goodwanghancommented, Apr 12, 2022

Sorry, let me reopen

Read more comments on GitHub >

github_iconTop Results From Across the Web

HIVE partitions adding not working as expected..pa... - 224083
Currently i am working on HIVE tables and facing issue with hive partitions ,we have script to drop partitions if exist based on...
Read more >
Support hive-style partitioning of parquet archives · Issue #2186
[WIP] Fix issue #285 : save hive partitioned dataset using NativeExecutionEngine and DaskExecutionEngine fugue-project/fugue#306.
Read more >
Hive Partitions Explained with Examples
The hive partition is similar to table partitioning available in SQL server or any other RDBMS database tables.
Read more >
Loading externally partitioned data | BigQuery - Google Cloud
Load Hive partitioned data · In the Google Cloud console, go to BigQuery. · In the Explorer pane, expand your project and select...
Read more >
Unable to load data in Hive partitioned table - Stack Overflow
If you want to load data into a Hive partition, you have to provide the value of the partition itself in the LOAD...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found