Potential incompatiblity with Pandas 1.4.0
See original GitHub issueDescribe the bug
Pandas version 1.4.0 was release few days ago and some tests start failing. I was able to reproduce with a minimum example which is failing with Pandas 1.4.0 and working with Pandas 1.3.5.
To Reproduce
import pandas as pd
import pandas_profiling
data = {"col1": [1, 2], "col2": [3, 4]}
dataframe = pd.DataFrame(data=data)
profile = pandas_profiling.ProfileReport(dataframe, minimal=False)
profile.to_html()
When running with Pandas 1.4.0, I get the following traceback:
Traceback (most recent call last):
File "/tmp/bug.py", line 8, in <module>
profile.to_html()
File "/vemv/lib/python3.9/site-packages/pandas_profiling/profile_report.py", line 368, in to_html
return self.html
File "/vemv/lib/python3.9/site-packages/pandas_profiling/profile_report.py", line 185, in html
self._html = self._render_html()
File "/vemv/lib/python3.9/site-packages/pandas_profiling/profile_report.py", line 287, in _render_html
report = self.report
File "/vemv/lib/python3.9/site-packages/pandas_profiling/profile_report.py", line 179, in report
self._report = get_report_structure(self.config, self.description_set)
File "/vemv/lib/python3.9/site-packages/pandas_profiling/profile_report.py", line 161, in description_set
self._description_set = describe_df(
File "/vemv/lib/python3.9/site-packages/pandas_profiling/model/describe.py", line 71, in describe
series_description = get_series_descriptions(
File "/vemv/lib/python3.9/site-packages/multimethod/__init__.py", line 303, in __call__
return func(*args, **kwargs)
File "/vemv/lib/python3.9/site-packages/pandas_profiling/model/pandas/summary_pandas.py", line 92, in pandas_get_series_descriptions
for i, (column, description) in enumerate(
File "/home/lothiraldan/.pyenv/versions/3.9.1/lib/python3.9/multiprocessing/pool.py", line 870, in next
raise value
File "/home/lothiraldan/.pyenv/versions/3.9.1/lib/python3.9/multiprocessing/pool.py", line 125, in worker
result = (True, func(*args, **kwds))
File "/vemv/lib/python3.9/site-packages/pandas_profiling/model/pandas/summary_pandas.py", line 72, in multiprocess_1d
return column, describe_1d(config, series, summarizer, typeset)
File "/vemv/lib/python3.9/site-packages/multimethod/__init__.py", line 303, in __call__
return func(*args, **kwargs)
File "/vemv/lib/python3.9/site-packages/pandas_profiling/model/pandas/summary_pandas.py", line 50, in pandas_describe_1d
return summarizer.summarize(config, series, dtype=vtype)
File "/vemv/lib/python3.9/site-packages/pandas_profiling/model/summarizer.py", line 37, in summarize
_, _, summary = self.handle(str(dtype), config, series, {"type": str(dtype)})
File "/vemv/lib/python3.9/site-packages/pandas_profiling/model/handler.py", line 62, in handle
return op(*args)
File "/vemv/lib/python3.9/site-packages/pandas_profiling/model/handler.py", line 21, in func2
return f(*res)
File "/vemv/lib/python3.9/site-packages/pandas_profiling/model/handler.py", line 21, in func2
return f(*res)
File "/vemv/lib/python3.9/site-packages/pandas_profiling/model/handler.py", line 21, in func2
return f(*res)
File "/vemv/lib/python3.9/site-packages/pandas_profiling/model/handler.py", line 17, in func2
res = g(*x)
File "/vemv/lib/python3.9/site-packages/multimethod/__init__.py", line 303, in __call__
return func(*args, **kwargs)
File "/vemv/lib/python3.9/site-packages/pandas_profiling/model/summary_algorithms.py", line 65, in inner
return fn(config, series, summary)
File "/vemv/lib/python3.9/site-packages/pandas_profiling/model/summary_algorithms.py", line 82, in inner
return fn(config, series, summary)
File "/vemv/lib/python3.9/site-packages/pandas_profiling/model/pandas/describe_categorical_pandas.py", line 205, in pandas_describe_categorical_1d
summary.update(length_summary_vc(value_counts))
File "/vemv/lib/python3.9/site-packages/pandas_profiling/model/pandas/describe_categorical_pandas.py", line 162, in length_summary_vc
"median_length": weighted_median(
File "/vemv/lib/python3.9/site-packages/pandas_profiling/model/pandas/utils_pandas.py", line 13, in weighted_median
w_median = (data[weights == np.max(weights)])[0]
IndexError: only integers, slices (`:`), ellipsis (`...`), numpy.newaxis (`None`) and integer or boolean arrays are valid indices
If I try changing the minimal
from False
to True
, the script is now passing.
Version information:
Failing environment
Python version: Python 3.9.1 Pip version: pip 21.3.1 Pandas and pandas-profiling versions: 1.4.0 | 3.1.0 Full pip list:
Package Version
--------------------- ---------
attrs 21.4.0
certifi 2021.10.8
charset-normalizer 2.0.10
cycler 0.11.0
fonttools 4.28.5
htmlmin 0.1.12
idna 3.3
ImageHash 4.2.1
Jinja2 3.0.3
joblib 1.0.1
kiwisolver 1.3.2
MarkupSafe 2.0.1
matplotlib 3.5.1
missingno 0.5.0
multimethod 1.6
networkx 2.6.3
numpy 1.22.1
packaging 21.3
pandas 1.4.0
pandas-profiling 3.1.0
phik 0.12.0
Pillow 9.0.0
pip 21.3.1
pydantic 1.9.0
pyparsing 3.0.7
python-dateutil 2.8.2
pytz 2021.3
PyWavelets 1.2.0
PyYAML 6.0
requests 2.27.1
scipy 1.7.3
seaborn 0.11.2
setuptools 60.0.5
six 1.16.0
tangled-up-in-unicode 0.1.0
tqdm 4.62.3
typing_extensions 4.0.1
urllib3 1.26.8
visions 0.7.4
wheel 0.37.1
Working environment
Python version: Python 3.9.1 Pip version: pip 21.3.1 Pandas and pandas-profiling versions: 1.3.5 | 3.1.0 Full pip list:
Package Version
--------------------- ---------
attrs 21.4.0
certifi 2021.10.8
charset-normalizer 2.0.10
cycler 0.11.0
fonttools 4.28.5
htmlmin 0.1.12
idna 3.3
ImageHash 4.2.1
Jinja2 3.0.3
joblib 1.0.1
kiwisolver 1.3.2
MarkupSafe 2.0.1
matplotlib 3.5.1
missingno 0.5.0
multimethod 1.6
networkx 2.6.3
numpy 1.22.1
packaging 21.3
pandas 1.3.5
pandas-profiling 3.1.0
phik 0.12.0
Pillow 9.0.0
pip 21.3.1
pydantic 1.9.0
pyparsing 3.0.7
python-dateutil 2.8.2
pytz 2021.3
PyWavelets 1.2.0
PyYAML 6.0
requests 2.27.1
scipy 1.7.3
seaborn 0.11.2
setuptools 60.0.5
six 1.16.0
tangled-up-in-unicode 0.1.0
tqdm 4.62.3
typing_extensions 4.0.1
urllib3 1.26.8
visions 0.7.4
wheel 0.37.1
Let me know if I can provide more details and thank you for your good work!
Issue Analytics
- State:
- Created 2 years ago
- Reactions:25
- Comments:15 (1 by maintainers)
Top Results From Across the Web
What's new in 1.4.0 (January 22, 2022) - Pandas
Backwards incompatible API changes pandas 1.4. 0 supports Python 3.8 and higher.
Read more >sklearn-pandas 1.4.0 - PyPI
In particular, it provides: 1. A way to map ``DataFrame`` columns to transformations, which are later recombined into features. 2. A compatibility shim...
Read more >Loose version in requirements.in - pip - Stack Overflow
This should be possible by using the string version pip install 'pandas>=1.4.0,<1.5.0'. Use double quotes for Windows.
Read more >SciPy 1.4.0 Release Notes — SciPy v1.9.3 Manual
fft inconsistency when axes=None and shape… #10628: Scipy python>3.6 Windows wheels don't ship msvcp*.dll. #10733: DOC/BUG: ...
Read more >Changelog - Streamlit Docs
All Streamlit commands that accept pandas DataFrames as input also support ... Going forward, we'll let you know if there's a mismatch in...
Read more >
Top Related Medium Post
No results found
Top Related StackOverflow Question
No results found
Troubleshoot Live Code
Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free
Top Related Reddit Thread
No results found
Top Related Hackernoon Post
No results found
Top Related Tweet
No results found
Top Related Dev.to Post
No results found
Top Related Hashnode Post
No results found
I investigated a bit and I think I identified a behavior change on the following line: https://github.com/pandas-profiling/pandas-profiling/blob/eac60a0b4e9a278a0ca44d8c712a599bdb41ec71/src/pandas_profiling/model/pandas/describe_categorical_pandas.py#L175
With pandas 1.3.5,
length.index
isIndex([1, 1], dtype='object', name='col1')
while with pandas 1.4.0, it isIndex([1, 1], dtype='Int64', name='col1')
.Later in the code https://github.com/pandas-profiling/pandas-profiling/blob/eac60a0b4e9a278a0ca44d8c712a599bdb41ec71/src/pandas_profiling/model/pandas/utils_pandas.py#L13,
weights == np.max(weights)
is an instance of<class 'numpy.ndarray'>
while with pandas 1.4.0, it is now an instance of<class 'pandas.core.arrays.boolean.BooleanArray'>
.Taking a look at the release note, it might be related to https://pandas.pydata.org/pandas-docs/stable/whatsnew/v1.4.0.html#index-can-hold-arbitrary-extensionarrays.
I am not sure what the correct fix is and if you plan to support both version of pandas, let me know if I can provide more help.
This is a weird one… A simple fix is to modify line 13 identified by @Lothiraldan to use np.where (as is done in the else condition below) i.e.
I haven’t investigated too deeply but with the above change tests appear to now pass on pandas 1.40+. I haven’t investigated much beyond the fix but for some reason the result of the
==
comparison in this case isn’t producing a boolean ndarray.