Pandas.AppendableExcelDataSet doesn't give expected data
See original GitHub issueDescription
pandas.AppendableExcelDataSet doesn’t ensure same data on reload when the same sheet already exists.
Context
Assuming a pipeline uses AppendableExcelDataSet dataset to store preprocessed data in a single sheet, and then merge these in different node downstream. If the sheets with same name already existed before the current run eg: preprocessed_companies, the new data would be saved to sheet preprocessed_companies1. If the new data is different, which is the case when running regularly, the pipeline would keep on using the old dataset or the first dataset.
Steps to Reproduce
Sample script to reproduce a test
from kedro.extras.datasets.pandas import AppendableExcelDataSet
from kedro.extras.datasets.pandas import ExcelDataSet
import pandas as pd
data_1 = pd.DataFrame({"col1": [1, 2], "col2": [4, 5], "col3": [5, 6]})
regular_ds = ExcelDataSet(
filepath="/tmp/test.xlsx", save_args={"sheet_name": "my_sheet"}
)
regular_ds.save(data_1)
data_2 = pd.DataFrame({"col1": [7, 8], "col2": [5, 7]})
appendable_ds = AppendableExcelDataSet(
filepath="/tmp/test.xlsx",
save_args={"sheet_name": "my_sheet"},
load_args={"sheet_name": "my_sheet"},
)
appendable_ds.save(data_2)
reloaded = appendable_ds.load()
assert data_2.equals(reloaded)
Expected Result
the assert is expected to be true or throw a warning/error stating the sheet already exists
Actual Result
line 18, in <module>
assert data_2.equals(reloaded)
AssertionError
Your Environment
Include as many relevant details about the environment in which you experienced the bug:
- Kedro version used (
pip show kedroorkedro -V): kedro, version 0.17.5 - Python version used (
python -V): Python 3.7.9 - Operating system and version: MacOS BigSur
Issue Analytics
- State:
- Created 2 years ago
- Comments:5 (5 by maintainers)

Top Related StackOverflow Question
Thanks a lot lorenabalan, Upgrading Pandas version did the trick.
Hi @avan-sh , I’ve tried reproducing using the script you provided, but I get an error at
savetime, that the sheet already exists. What version ofpandasare you running? I see that in 1.3.0 they introducedif_sheet_existsinExcelWriterwhich handles the problem you’re encountering.