BUG: hash_pandas_object ignores column name values
See original GitHub issuePandas version checks
-
I have checked that this issue has not already been reported.
-
I have confirmed this bug exists on the latest version of pandas.
-
I have confirmed this bug exists on the main branch of pandas.
Reproducible Example
import hashlib, pandas as pd
df = pd.DataFrame({'s': ['a', 'b', 'c', 'd'], 'd': ['b', 'c', 'd', 'e'], 'i': [0, 2, 4, 6]})
df_renamed = df.rename(columns={'s': 'ss'})
hash_df = hashlib.sha256(pd.util.hash_pandas_object(df, index=True).values).hexdigest()
hash_df_renamed = hashlib.sha256(pd.util.hash_pandas_object(df_renamed, index=True).values).hexdigest()
assert hash_df != hash_df_renamed
Issue Description
When hashing a df, the column names are ignored
Expected Behavior
I expected dfs with diff column names to hash differently. If ignoring col names is desired, I’d expect a default-off flag for ignoring col names in the hash calc.
Installed Versions
INSTALLED VERSIONS
commit : 66e3805b8cabe977f40c05259cc3fcf7ead5687d python : 3.7.13.final.0 python-bits : 64 OS : Linux OS-release : 5.4.144+ Version : #1 SMP Tue Dec 7 09:58:10 PST 2021 machine : x86_64 processor : x86_64 byteorder : little LC_ALL : None LANG : en_US.UTF-8 LOCALE : en_US.UTF-8
pandas : 1.3.5 numpy : 1.21.5 pytz : 2018.9 dateutil : 2.8.2 pip : 21.1.3 setuptools : 57.4.0 Cython : 0.29.28 pytest : 3.6.4 hypothesis : None sphinx : 1.8.6 blosc : None feather : 0.4.1 xlsxwriter : None lxml.etree : 4.2.6 html5lib : 1.0.1 pymysql : None psycopg2 : 2.7.6.1 (dt dec pq3 ext lo64) jinja2 : 2.11.3 IPython : 5.5.0 pandas_datareader: 0.9.0 bs4 : 4.6.3 bottleneck : 1.3.4 fsspec : None fastparquet : None gcsfs : None matplotlib : 3.2.2 numexpr : 2.8.1 odfpy : None openpyxl : 3.0.9 pandas_gbq : 0.13.3 pyarrow : 6.0.1 pyxlsb : None s3fs : None scipy : 1.4.1 sqlalchemy : 1.4.32 tables : 3.7.0 tabulate : 0.8.9 xarray : 0.18.2 xlrd : 1.1.0 xlwt : 1.3.0 numba : 0.51.2
Issue Analytics
- State:
- Created a year ago
- Comments:10 (6 by maintainers)
Top GitHub Comments
It does seem to be difficult to get an accurate hash value that encompasses all the components of a DataFrame (values, index, columns, flags, metadata, maybe others?). I’m +1 with supporting this.
@Gabriel-ROBIN: You can use the BytesIO object directly.