[BUG] Issue reading hive partitioned dataset with NativeExecutionEngine
See original GitHub issueI have a pandas dataframe with a column DAY representing the day number in month (ex : values from 1 to 31 for december)
from datetime import datetime
import pandas as pd
df = pd.DataFrame({'IBES': ['AAPL', 'AAPL', 'IBM', 'IBM'],
'EST_MEAN': [12.2, 10.0, 13.1, 13.5],
'EST_MEDIAN': [12.2, 12.0, 13.1, 13.2],
'BDATE': [datetime(2022, 1, 6), datetime(2022, 1, 7),
datetime(2022, 1, 6), datetime(2022, 1, 7)],
'DAY': [6, 7, 6, 7]
})
I save this dataframe with hive partition DAY
%%fsql
SELECT * FROM df
SAVE PREPARTITION BY DAY OVERWRITE PARQUET output_path
The result folder has a format similar to this:
! tree output_path
output_path
├── DAY=6
│ └── 02b4a05c12fa4791aca2931e47659ecc.parquet
└── DAY=7
└── bd17a05a5bd948cc824e4730fd03b473.parquet
When I try to read the dataset using spark execution engine, there is no problem
%%fsql spark
df_int = LOAD PARQUET output_path
SELECT * FROM df_int
PRINT df_int
But the same code fails using native execution engine.
The above exception was the direct cause of the following exception:
FugueDataFrameInitError Traceback (most recent call last)
/tmp/ipykernel_7378/4230927618.py in <module>
----> 1 get_ipython().run_cell_magic('fsql', '', "\ndf_int = LOAD PARQUET output_path\nSELECT * FROM df_int\nPRINT df_int\n")
~/.conda/envs/fugue/lib/python3.8/site-packages/IPython/core/interactiveshell.py in run_cell_magic(self, magic_name, line, cell)
2404 with self.builtin_trap:
2405 args = (magic_arg_s, cell)
-> 2406 result = fn(*args, **kwargs)
2407 return result
2408
~/.conda/envs/fugue/lib/python3.8/site-packages/decorator.py in fun(*args, **kw)
230 if not kwsyntax:
231 args, kw = fix(args, kw, sig)
--> 232 return caller(func, *(extras + args), **kw)
233 fun.__name__ = func.__name__
234 fun.__doc__ = func.__doc__
~/.conda/envs/fugue/lib/python3.8/site-packages/IPython/core/magic.py in <lambda>(f, *a, **k)
185 # but it's overkill for just that one bit of state.
186 def magic_deco(arg):
--> 187 call = lambda f, *a, **k: f(*a, **k)
188
189 if callable(arg):
~/dev/fugue/fugue_notebook/env.py in fsql(self, line, cell, local_ns)
88 except FugueSQLSyntaxError as ex:
89 raise FugueSQLSyntaxError(str(ex)).with_traceback(None) from None
---> 90 dag.run(self.get_engine(line, {} if local_ns is None else local_ns))
91 for k, v in dag.yields.items():
92 if isinstance(v, YieldedDataFrame):
~/dev/fugue/fugue/workflow/workflow.py in run(self, *args, **kwargs)
1516 if ctb is None: # pragma: no cover
1517 raise
-> 1518 raise ex.with_traceback(ctb)
1519 self._computed = True
1520 return DataFrames(
~/dev/fugue/fugue/dataframe/pandas_dataframe.py in __init__(self, df, schema, metadata, pandas_df_wrapper)
77 self._native = pdf
78 except Exception as e:
---> 79 raise FugueDataFrameInitError from e
80
81 @property
FugueDataFrameInitError:
I also observed that when you specifiy the list of columns you want to read, and this does not include the partition column, else it works fine:
%%fsql
df_int = LOAD PARQUET output_path COLUMNS IBES,EST_MEDIAN,EST_MEAN,BDATE
SELECT * FROM df_int
PRINT df_int
Environment:
- Backend: pandas
- Backend version: 1.3.5
- Python version: 3.8.12
- OS: linux
Issue Analytics
- State:
- Created 2 years ago
- Comments:9 (4 by maintainers)
Top Results From Across the Web
HIVE partitions adding not working as expected..pa... - 224083
Currently i am working on HIVE tables and facing issue with hive partitions ,we have script to drop partitions if exist based on...
Read more >Support hive-style partitioning of parquet archives · Issue #2186
[WIP] Fix issue #285 : save hive partitioned dataset using NativeExecutionEngine and DaskExecutionEngine fugue-project/fugue#306.
Read more >Hive Partitions Explained with Examples
The hive partition is similar to table partitioning available in SQL server or any other RDBMS database tables.
Read more >Loading externally partitioned data | BigQuery - Google Cloud
Load Hive partitioned data · In the Google Cloud console, go to BigQuery. · In the Explorer pane, expand your project and select...
Read more >Unable to load data in Hive partitioned table - Stack Overflow
If you want to load data into a Hive partition, you have to provide the value of the partition itself in the LOAD...
Read more >Top Related Medium Post
No results found
Top Related StackOverflow Question
No results found
Troubleshoot Live Code
Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start FreeTop Related Reddit Thread
No results found
Top Related Hackernoon Post
No results found
Top Related Tweet
No results found
Top Related Dev.to Post
No results found
Top Related Hashnode Post
No results found
Top GitHub Comments
We need to add this from triad And then on Fugue
Sorry, let me reopen