ExcelDataSet not working after pandas 1.4.0 release (python 3.8)
See original GitHub issueDescription
On a fresh new project, after installing the extra dependency kedro[pandas.ExcelDataSet]
, the user receives a runtime error when trying to load a xslx file dataset via the data catalog. The cause of the issue seems to be an incompatibility with the new pandas v1.4.0
and xlrd~=1.0
, the latter being enforced by kedro 17.6.0
.
Context
The error occurred while I was following the exact steps of the spaceflights tutorial. In order to load the shuttles
dataset (stored as a xslx file), I was instructed to install the extra dependency kedro[pandas.ExcelDataSet]
. After installing the dependency and trying to load the dataset via the catalog, I received the following error: Pandas requires version '2.0.1' or newer of 'xlrd' (version '1.2.0' currently installed).
After some debugging, I discovered that pandas released version 1.4.0
two days ago, and it bumped the mininum version of optional dependency xlrd to 2.0.1
(source, source2). As kedro enforces xlrd~=1.0
via extra dependency kedro[pandas.ExcelDataSet]
(source), pandas checks that the currently installed version of xlrd is 1.2.0
and raises the error before importing the package (source).
I repeated the same steps using python 3.7, but because pandas 1.4.0 only supports python 3.8+ (source), the installed version of pandas was 1.3.5. When I tried to load the dataset, I received no errors (although I got a future warning saying xlrd will not support xlsx files in version >= 2.0)
Steps to Reproduce
- Create a virtualenv with python 3.8 and activate it
- Run
pip install kedro==0.17.6
- Run
kedro new
cd
to project dir- Run
kedro install
- Replace the line
kedro==0.17.6
withkedro[pandas.ExcelDataSet]==0.17.6
in src/requirements.in - Run
kedro build-reqs && kedro install
- Add file
shuttles.xlsx
(from spaceflights tutorial) to folderdata/01_raw/
- Add the dataset information to
catalog.yml
:shuttles: type: pandas.ExcelDataSet filepath: data/01_raw/shuttles.xlsx
- Run
kedro ipython
- Run python code:
shuttles = catalog.load("shuttles")
Expected Result
The xlsx file should’ve been loaded into memory as a pandas Dataframe
Actual Result
I received the following error:
DataSetError: Failed while loading data from data set ExcelDataSet(filepath=/.../data/01_raw/shuttles.xlsx, load_args={'engine': xlrd}, protocol=file, save_args={'index': False}, writer_args={'engine': xlsxwriter}).
Pandas requires version '2.0.1' or newer of 'xlrd' (version '1.2.0' currently installed).
Your Environment
- Kedro version used (
pip show kedro
orkedro -V
): 0.17.6 - Python version used (
python -V
): 3.8.12 - Operating system and version: WSL: Ubuntu 20.04
- Pandas version: 1.4.0
- xlrd version: 1.2.0
Issue Analytics
- State:
- Created 2 years ago
- Reactions:1
- Comments:12 (12 by maintainers)
Hi @amaralbf we’re working on a patch fix for this since one of our upstream dependencies has started causing this.
Quick hack solution for you to get you working:
pip install openpyxl # which will be the default engine in 0.18.0
then in you catalog entry:
Hi!
Sorry for the late reply. I struggled to find the exact cause but I think the issue was that my package had “kedro[pandas.SQLDataSet]” in setup.cfg but then the non-kedro project in which i imported my package, the requirements.txt had “kedro[pandas.ExcelDataSet]” causing the first to be ignored. At least that is what i have figured out so far. Now, I simply import SQLAlchemy in setup.cfg in my python package and everthing works 😃
Thanks for the help