question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

BUG: hash_pandas_object ignores column name values

See original GitHub issue

Pandas version checks

  • I have checked that this issue has not already been reported.

  • I have confirmed this bug exists on the latest version of pandas.

  • I have confirmed this bug exists on the main branch of pandas.

Reproducible Example

import hashlib, pandas as pd

df = pd.DataFrame({'s': ['a', 'b', 'c', 'd'], 'd': ['b', 'c', 'd', 'e'], 'i': [0, 2, 4, 6]})

df_renamed = df.rename(columns={'s': 'ss'})

hash_df = hashlib.sha256(pd.util.hash_pandas_object(df, index=True).values).hexdigest()
hash_df_renamed = hashlib.sha256(pd.util.hash_pandas_object(df_renamed, index=True).values).hexdigest()

assert hash_df != hash_df_renamed

Issue Description

When hashing a df, the column names are ignored

Expected Behavior

I expected dfs with diff column names to hash differently. If ignoring col names is desired, I’d expect a default-off flag for ignoring col names in the hash calc.

Installed Versions

INSTALLED VERSIONS

commit : 66e3805b8cabe977f40c05259cc3fcf7ead5687d python : 3.7.13.final.0 python-bits : 64 OS : Linux OS-release : 5.4.144+ Version : #1 SMP Tue Dec 7 09:58:10 PST 2021 machine : x86_64 processor : x86_64 byteorder : little LC_ALL : None LANG : en_US.UTF-8 LOCALE : en_US.UTF-8

pandas : 1.3.5 numpy : 1.21.5 pytz : 2018.9 dateutil : 2.8.2 pip : 21.1.3 setuptools : 57.4.0 Cython : 0.29.28 pytest : 3.6.4 hypothesis : None sphinx : 1.8.6 blosc : None feather : 0.4.1 xlsxwriter : None lxml.etree : 4.2.6 html5lib : 1.0.1 pymysql : None psycopg2 : 2.7.6.1 (dt dec pq3 ext lo64) jinja2 : 2.11.3 IPython : 5.5.0 pandas_datareader: 0.9.0 bs4 : 4.6.3 bottleneck : 1.3.4 fsspec : None fastparquet : None gcsfs : None matplotlib : 3.2.2 numexpr : 2.8.1 odfpy : None openpyxl : 3.0.9 pandas_gbq : 0.13.3 pyarrow : 6.0.1 pyxlsb : None s3fs : None scipy : 1.4.1 sqlalchemy : 1.4.32 tables : 3.7.0 tabulate : 0.8.9 xarray : 0.18.2 xlrd : 1.1.0 xlwt : 1.3.0 numba : 0.51.2

Issue Analytics

  • State:open
  • Created a year ago
  • Comments:10 (6 by maintainers)

github_iconTop GitHub Comments

1reaction
rhshadrachcommented, Apr 11, 2022

It does seem to be difficult to get an accurate hash value that encompasses all the components of a DataFrame (values, index, columns, flags, metadata, maybe others?). I’m +1 with supporting this.

1reaction
rhshadrachcommented, Apr 11, 2022

@Gabriel-ROBIN: You can use the BytesIO object directly.

df = pd.DataFrame({"a": [1, 1, 1], "b": ["x", "y", "z"], "c": [1, 2, 3]})
buffer = BytesIO()
df.to_parquet(buffer)
with open("temp.parquet", mode="wb") as f:
    f.write(buffer.getvalue())
print(pd.read_parquet("temp.parquet"))
Read more comments on GitHub >

github_iconTop Results From Across the Web

Concat list of pandas data frame, but ignoring column name
I've tried various values for the parameters (*), but none that do what I'm after. Edit: Sample data: res = [ pd.DataFrame({'A':[1,2, ...
Read more >
pandas.melt — pandas 1.5.2 documentation
Name to use for the 'value' column. col_levelint or str, optional. If columns are a MultiIndex then use this level to melt. ignore_indexbool,...
Read more >
Hashing on Pandas DataFrame More Effectively
put hashed value to defined Destination DataFrame as destinationdf where column name is start with Hash_ combine with all columns in column list...
Read more >
How to Exclude Columns in Pandas? - GeeksforGeeks
We can exclude one column from the pandas dataframe by using the loc function. This function removes the column based on the location....
Read more >
Deterministic hashing of Python data objects - death and gravity
in which we calculate deterministic hashes for Python data objects, ... "empty" values, to allow adding new fields without the hash changing; can...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found