Pandas.AppendableExcelDataSet doesn't give expected data
See original GitHub issueDescription
pandas.AppendableExcelDataSet
doesn’t ensure same data on reload when the same sheet already exists.
Context
Assuming a pipeline uses AppendableExcelDataSet
dataset to store preprocessed data in a single sheet, and then merge these in different node downstream. If the sheets with same name already existed before the current run eg: preprocessed_companies
, the new data would be saved to sheet preprocessed_companies1
. If the new data is different, which is the case when running regularly, the pipeline would keep on using the old dataset or the first dataset.
Steps to Reproduce
Sample script to reproduce a test
from kedro.extras.datasets.pandas import AppendableExcelDataSet
from kedro.extras.datasets.pandas import ExcelDataSet
import pandas as pd
data_1 = pd.DataFrame({"col1": [1, 2], "col2": [4, 5], "col3": [5, 6]})
regular_ds = ExcelDataSet(
filepath="/tmp/test.xlsx", save_args={"sheet_name": "my_sheet"}
)
regular_ds.save(data_1)
data_2 = pd.DataFrame({"col1": [7, 8], "col2": [5, 7]})
appendable_ds = AppendableExcelDataSet(
filepath="/tmp/test.xlsx",
save_args={"sheet_name": "my_sheet"},
load_args={"sheet_name": "my_sheet"},
)
appendable_ds.save(data_2)
reloaded = appendable_ds.load()
assert data_2.equals(reloaded)
Expected Result
the assert is expected to be true or throw a warning/error stating the sheet already exists
Actual Result
line 18, in <module>
assert data_2.equals(reloaded)
AssertionError
Your Environment
Include as many relevant details about the environment in which you experienced the bug:
- Kedro version used (
pip show kedro
orkedro -V
): kedro, version 0.17.5 - Python version used (
python -V
): Python 3.7.9 - Operating system and version: MacOS BigSur
Issue Analytics
- State:
- Created 2 years ago
- Comments:5 (5 by maintainers)
Thanks a lot lorenabalan, Upgrading Pandas version did the trick.
Hi @avan-sh , I’ve tried reproducing using the script you provided, but I get an error at
save
time, that the sheet already exists. What version ofpandas
are you running? I see that in 1.3.0 they introducedif_sheet_exists
inExcelWriter
which handles the problem you’re encountering.