Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Potential incompatiblity with Pandas 1.4.0

See original GitHub issue

Describe the bug

Pandas version 1.4.0 was release few days ago and some tests start failing. I was able to reproduce with a minimum example which is failing with Pandas 1.4.0 and working with Pandas 1.3.5.

To Reproduce

import pandas as pd
import pandas_profiling

data = {"col1": [1, 2], "col2": [3, 4]}
dataframe = pd.DataFrame(data=data)

profile = pandas_profiling.ProfileReport(dataframe, minimal=False)
profile.to_html()

When running with Pandas 1.4.0, I get the following traceback:

Traceback (most recent call last):
  File "/tmp/bug.py", line 8, in <module>
    profile.to_html()
  File "/vemv/lib/python3.9/site-packages/pandas_profiling/profile_report.py", line 368, in to_html
    return self.html
  File "/vemv/lib/python3.9/site-packages/pandas_profiling/profile_report.py", line 185, in html
    self._html = self._render_html()
  File "/vemv/lib/python3.9/site-packages/pandas_profiling/profile_report.py", line 287, in _render_html
    report = self.report
  File "/vemv/lib/python3.9/site-packages/pandas_profiling/profile_report.py", line 179, in report
    self._report = get_report_structure(self.config, self.description_set)
  File "/vemv/lib/python3.9/site-packages/pandas_profiling/profile_report.py", line 161, in description_set
    self._description_set = describe_df(
  File "/vemv/lib/python3.9/site-packages/pandas_profiling/model/describe.py", line 71, in describe
    series_description = get_series_descriptions(
  File "/vemv/lib/python3.9/site-packages/multimethod/__init__.py", line 303, in __call__
    return func(*args, **kwargs)
  File "/vemv/lib/python3.9/site-packages/pandas_profiling/model/pandas/summary_pandas.py", line 92, in pandas_get_series_descriptions
    for i, (column, description) in enumerate(
  File "/home/lothiraldan/.pyenv/versions/3.9.1/lib/python3.9/multiprocessing/pool.py", line 870, in next
    raise value
  File "/home/lothiraldan/.pyenv/versions/3.9.1/lib/python3.9/multiprocessing/pool.py", line 125, in worker
    result = (True, func(*args, **kwds))
  File "/vemv/lib/python3.9/site-packages/pandas_profiling/model/pandas/summary_pandas.py", line 72, in multiprocess_1d
    return column, describe_1d(config, series, summarizer, typeset)
  File "/vemv/lib/python3.9/site-packages/multimethod/__init__.py", line 303, in __call__
    return func(*args, **kwargs)
  File "/vemv/lib/python3.9/site-packages/pandas_profiling/model/pandas/summary_pandas.py", line 50, in pandas_describe_1d
    return summarizer.summarize(config, series, dtype=vtype)
  File "/vemv/lib/python3.9/site-packages/pandas_profiling/model/summarizer.py", line 37, in summarize
    _, _, summary = self.handle(str(dtype), config, series, {"type": str(dtype)})
  File "/vemv/lib/python3.9/site-packages/pandas_profiling/model/handler.py", line 62, in handle
    return op(*args)
  File "/vemv/lib/python3.9/site-packages/pandas_profiling/model/handler.py", line 21, in func2
    return f(*res)
  File "/vemv/lib/python3.9/site-packages/pandas_profiling/model/handler.py", line 21, in func2
    return f(*res)
  File "/vemv/lib/python3.9/site-packages/pandas_profiling/model/handler.py", line 21, in func2
    return f(*res)
  File "/vemv/lib/python3.9/site-packages/pandas_profiling/model/handler.py", line 17, in func2
    res = g(*x)
  File "/vemv/lib/python3.9/site-packages/multimethod/__init__.py", line 303, in __call__
    return func(*args, **kwargs)
  File "/vemv/lib/python3.9/site-packages/pandas_profiling/model/summary_algorithms.py", line 65, in inner
    return fn(config, series, summary)
  File "/vemv/lib/python3.9/site-packages/pandas_profiling/model/summary_algorithms.py", line 82, in inner
    return fn(config, series, summary)
  File "/vemv/lib/python3.9/site-packages/pandas_profiling/model/pandas/describe_categorical_pandas.py", line 205, in pandas_describe_categorical_1d
    summary.update(length_summary_vc(value_counts))
  File "/vemv/lib/python3.9/site-packages/pandas_profiling/model/pandas/describe_categorical_pandas.py", line 162, in length_summary_vc
    "median_length": weighted_median(
  File "/vemv/lib/python3.9/site-packages/pandas_profiling/model/pandas/utils_pandas.py", line 13, in weighted_median
    w_median = (data[weights == np.max(weights)])[0]
IndexError: only integers, slices (`:`), ellipsis (`...`), numpy.newaxis (`None`) and integer or boolean arrays are valid indices

If I try changing the minimal from False to True, the script is now passing.

Version information:

Failing environment

Python version: Python 3.9.1 Pip version: pip 21.3.1 Pandas and pandas-profiling versions: 1.4.0 | 3.1.0 Full pip list:

Package               Version
--------------------- ---------
attrs                 21.4.0
certifi               2021.10.8
charset-normalizer    2.0.10
cycler                0.11.0
fonttools             4.28.5
htmlmin               0.1.12
idna                  3.3
ImageHash             4.2.1
Jinja2                3.0.3
joblib                1.0.1
kiwisolver            1.3.2
MarkupSafe            2.0.1
matplotlib            3.5.1
missingno             0.5.0
multimethod           1.6
networkx              2.6.3
numpy                 1.22.1
packaging             21.3
pandas                1.4.0
pandas-profiling      3.1.0
phik                  0.12.0
Pillow                9.0.0
pip                   21.3.1
pydantic              1.9.0
pyparsing             3.0.7
python-dateutil       2.8.2
pytz                  2021.3
PyWavelets            1.2.0
PyYAML                6.0
requests              2.27.1
scipy                 1.7.3
seaborn               0.11.2
setuptools            60.0.5
six                   1.16.0
tangled-up-in-unicode 0.1.0
tqdm                  4.62.3
typing_extensions     4.0.1
urllib3               1.26.8
visions               0.7.4
wheel                 0.37.1

Working environment

Python version: Python 3.9.1 Pip version: pip 21.3.1 Pandas and pandas-profiling versions: 1.3.5 | 3.1.0 Full pip list:

Package               Version
--------------------- ---------
attrs                 21.4.0
certifi               2021.10.8
charset-normalizer    2.0.10
cycler                0.11.0
fonttools             4.28.5
htmlmin               0.1.12
idna                  3.3
ImageHash             4.2.1
Jinja2                3.0.3
joblib                1.0.1
kiwisolver            1.3.2
MarkupSafe            2.0.1
matplotlib            3.5.1
missingno             0.5.0
multimethod           1.6
networkx              2.6.3
numpy                 1.22.1
packaging             21.3
pandas                1.3.5
pandas-profiling      3.1.0
phik                  0.12.0
Pillow                9.0.0
pip                   21.3.1
pydantic              1.9.0
pyparsing             3.0.7
python-dateutil       2.8.2
pytz                  2021.3
PyWavelets            1.2.0
PyYAML                6.0
requests              2.27.1
scipy                 1.7.3
seaborn               0.11.2
setuptools            60.0.5
six                   1.16.0
tangled-up-in-unicode 0.1.0
tqdm                  4.62.3
typing_extensions     4.0.1
urllib3               1.26.8
visions               0.7.4
wheel                 0.37.1

Let me know if I can provide more details and thank you for your good work!

Issue Analytics

State:
Created 2 years ago
Reactions:25
Comments:15 (1 by maintainers)

Top GitHub Comments

14reactions

Lothiraldancommented, Jan 25, 2022

I investigated a bit and I think I identified a behavior change on the following line: https://github.com/pandas-profiling/pandas-profiling/blob/eac60a0b4e9a278a0ca44d8c712a599bdb41ec71/src/pandas_profiling/model/pandas/describe_categorical_pandas.py#L175

With pandas 1.3.5, length.index is Index([1, 1], dtype='object', name='col1') while with pandas 1.4.0, it is Index([1, 1], dtype='Int64', name='col1').

Later in the code https://github.com/pandas-profiling/pandas-profiling/blob/eac60a0b4e9a278a0ca44d8c712a599bdb41ec71/src/pandas_profiling/model/pandas/utils_pandas.py#L13, weights == np.max(weights) is an instance of <class 'numpy.ndarray'> while with pandas 1.4.0, it is now an instance of <class 'pandas.core.arrays.boolean.BooleanArray'>.

Taking a look at the release note, it might be related to https://pandas.pydata.org/pandas-docs/stable/whatsnew/v1.4.0.html#index-can-hold-arbitrary-extensionarrays.

I am not sure what the correct fix is and if you plan to support both version of pandas, let me know if I can provide more help.

8reactions

ieavescommented, Feb 19, 2022

This is a weird one… A simple fix is to modify line 13 identified by @Lothiraldan to use np.where (as is done in the else condition below) i.e.

w_median = (data[np.where(weights == np.max(weights))])[0]

I haven’t investigated too deeply but with the above change tests appear to now pass on pandas 1.40+. I haven’t investigated much beyond the fix but for some reason the result of the == comparison in this case isn’t producing a boolean ndarray.

Top Results From Across the Web

What's new in 1.4.0 (January 22, 2022) - Pandas

Backwards incompatible API changes pandas 1.4. 0 supports Python 3.8 and higher.

sklearn-pandas 1.4.0 - PyPI

In particular, it provides: 1. A way to map ``DataFrame`` columns to transformations, which are later recombined into features. 2. A compatibility shim...

Loose version in requirements.in - pip - Stack Overflow

This should be possible by using the string version pip install 'pandas>=1.4.0,<1.5.0'. Use double quotes for Windows.

SciPy 1.4.0 Release Notes — SciPy v1.9.3 Manual

fft inconsistency when axes=None and shape… #10628: Scipy python>3.6 Windows wheels don't ship msvcp*.dll. #10733: DOC/BUG: ...

Changelog - Streamlit Docs

All Streamlit commands that accept pandas DataFrames as input also support ... Going forward, we'll let you know if there's a mismatch in...