question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Potential incompatiblity with Pandas 1.4.0

See original GitHub issue

Describe the bug

Pandas version 1.4.0 was release few days ago and some tests start failing. I was able to reproduce with a minimum example which is failing with Pandas 1.4.0 and working with Pandas 1.3.5.

To Reproduce

import pandas as pd
import pandas_profiling

data = {"col1": [1, 2], "col2": [3, 4]}
dataframe = pd.DataFrame(data=data)

profile = pandas_profiling.ProfileReport(dataframe, minimal=False)
profile.to_html()

When running with Pandas 1.4.0, I get the following traceback:

Traceback (most recent call last):
  File "/tmp/bug.py", line 8, in <module>
    profile.to_html()
  File "/vemv/lib/python3.9/site-packages/pandas_profiling/profile_report.py", line 368, in to_html
    return self.html
  File "/vemv/lib/python3.9/site-packages/pandas_profiling/profile_report.py", line 185, in html
    self._html = self._render_html()
  File "/vemv/lib/python3.9/site-packages/pandas_profiling/profile_report.py", line 287, in _render_html
    report = self.report
  File "/vemv/lib/python3.9/site-packages/pandas_profiling/profile_report.py", line 179, in report
    self._report = get_report_structure(self.config, self.description_set)
  File "/vemv/lib/python3.9/site-packages/pandas_profiling/profile_report.py", line 161, in description_set
    self._description_set = describe_df(
  File "/vemv/lib/python3.9/site-packages/pandas_profiling/model/describe.py", line 71, in describe
    series_description = get_series_descriptions(
  File "/vemv/lib/python3.9/site-packages/multimethod/__init__.py", line 303, in __call__
    return func(*args, **kwargs)
  File "/vemv/lib/python3.9/site-packages/pandas_profiling/model/pandas/summary_pandas.py", line 92, in pandas_get_series_descriptions
    for i, (column, description) in enumerate(
  File "/home/lothiraldan/.pyenv/versions/3.9.1/lib/python3.9/multiprocessing/pool.py", line 870, in next
    raise value
  File "/home/lothiraldan/.pyenv/versions/3.9.1/lib/python3.9/multiprocessing/pool.py", line 125, in worker
    result = (True, func(*args, **kwds))
  File "/vemv/lib/python3.9/site-packages/pandas_profiling/model/pandas/summary_pandas.py", line 72, in multiprocess_1d
    return column, describe_1d(config, series, summarizer, typeset)
  File "/vemv/lib/python3.9/site-packages/multimethod/__init__.py", line 303, in __call__
    return func(*args, **kwargs)
  File "/vemv/lib/python3.9/site-packages/pandas_profiling/model/pandas/summary_pandas.py", line 50, in pandas_describe_1d
    return summarizer.summarize(config, series, dtype=vtype)
  File "/vemv/lib/python3.9/site-packages/pandas_profiling/model/summarizer.py", line 37, in summarize
    _, _, summary = self.handle(str(dtype), config, series, {"type": str(dtype)})
  File "/vemv/lib/python3.9/site-packages/pandas_profiling/model/handler.py", line 62, in handle
    return op(*args)
  File "/vemv/lib/python3.9/site-packages/pandas_profiling/model/handler.py", line 21, in func2
    return f(*res)
  File "/vemv/lib/python3.9/site-packages/pandas_profiling/model/handler.py", line 21, in func2
    return f(*res)
  File "/vemv/lib/python3.9/site-packages/pandas_profiling/model/handler.py", line 21, in func2
    return f(*res)
  File "/vemv/lib/python3.9/site-packages/pandas_profiling/model/handler.py", line 17, in func2
    res = g(*x)
  File "/vemv/lib/python3.9/site-packages/multimethod/__init__.py", line 303, in __call__
    return func(*args, **kwargs)
  File "/vemv/lib/python3.9/site-packages/pandas_profiling/model/summary_algorithms.py", line 65, in inner
    return fn(config, series, summary)
  File "/vemv/lib/python3.9/site-packages/pandas_profiling/model/summary_algorithms.py", line 82, in inner
    return fn(config, series, summary)
  File "/vemv/lib/python3.9/site-packages/pandas_profiling/model/pandas/describe_categorical_pandas.py", line 205, in pandas_describe_categorical_1d
    summary.update(length_summary_vc(value_counts))
  File "/vemv/lib/python3.9/site-packages/pandas_profiling/model/pandas/describe_categorical_pandas.py", line 162, in length_summary_vc
    "median_length": weighted_median(
  File "/vemv/lib/python3.9/site-packages/pandas_profiling/model/pandas/utils_pandas.py", line 13, in weighted_median
    w_median = (data[weights == np.max(weights)])[0]
IndexError: only integers, slices (`:`), ellipsis (`...`), numpy.newaxis (`None`) and integer or boolean arrays are valid indices

If I try changing the minimal from False to True, the script is now passing.

Version information:

Failing environment

Python version: Python 3.9.1 Pip version: pip 21.3.1 Pandas and pandas-profiling versions: 1.4.0 | 3.1.0 Full pip list:

Package               Version
--------------------- ---------
attrs                 21.4.0
certifi               2021.10.8
charset-normalizer    2.0.10
cycler                0.11.0
fonttools             4.28.5
htmlmin               0.1.12
idna                  3.3
ImageHash             4.2.1
Jinja2                3.0.3
joblib                1.0.1
kiwisolver            1.3.2
MarkupSafe            2.0.1
matplotlib            3.5.1
missingno             0.5.0
multimethod           1.6
networkx              2.6.3
numpy                 1.22.1
packaging             21.3
pandas                1.4.0
pandas-profiling      3.1.0
phik                  0.12.0
Pillow                9.0.0
pip                   21.3.1
pydantic              1.9.0
pyparsing             3.0.7
python-dateutil       2.8.2
pytz                  2021.3
PyWavelets            1.2.0
PyYAML                6.0
requests              2.27.1
scipy                 1.7.3
seaborn               0.11.2
setuptools            60.0.5
six                   1.16.0
tangled-up-in-unicode 0.1.0
tqdm                  4.62.3
typing_extensions     4.0.1
urllib3               1.26.8
visions               0.7.4
wheel                 0.37.1

Working environment

Python version: Python 3.9.1 Pip version: pip 21.3.1 Pandas and pandas-profiling versions: 1.3.5 | 3.1.0 Full pip list:

Package               Version
--------------------- ---------
attrs                 21.4.0
certifi               2021.10.8
charset-normalizer    2.0.10
cycler                0.11.0
fonttools             4.28.5
htmlmin               0.1.12
idna                  3.3
ImageHash             4.2.1
Jinja2                3.0.3
joblib                1.0.1
kiwisolver            1.3.2
MarkupSafe            2.0.1
matplotlib            3.5.1
missingno             0.5.0
multimethod           1.6
networkx              2.6.3
numpy                 1.22.1
packaging             21.3
pandas                1.3.5
pandas-profiling      3.1.0
phik                  0.12.0
Pillow                9.0.0
pip                   21.3.1
pydantic              1.9.0
pyparsing             3.0.7
python-dateutil       2.8.2
pytz                  2021.3
PyWavelets            1.2.0
PyYAML                6.0
requests              2.27.1
scipy                 1.7.3
seaborn               0.11.2
setuptools            60.0.5
six                   1.16.0
tangled-up-in-unicode 0.1.0
tqdm                  4.62.3
typing_extensions     4.0.1
urllib3               1.26.8
visions               0.7.4
wheel                 0.37.1

Let me know if I can provide more details and thank you for your good work!

Issue Analytics

  • State:closed
  • Created 2 years ago
  • Reactions:25
  • Comments:15 (1 by maintainers)

github_iconTop GitHub Comments

14reactions
Lothiraldancommented, Jan 25, 2022

I investigated a bit and I think I identified a behavior change on the following line: https://github.com/pandas-profiling/pandas-profiling/blob/eac60a0b4e9a278a0ca44d8c712a599bdb41ec71/src/pandas_profiling/model/pandas/describe_categorical_pandas.py#L175

With pandas 1.3.5, length.index is Index([1, 1], dtype='object', name='col1') while with pandas 1.4.0, it is Index([1, 1], dtype='Int64', name='col1').

Later in the code https://github.com/pandas-profiling/pandas-profiling/blob/eac60a0b4e9a278a0ca44d8c712a599bdb41ec71/src/pandas_profiling/model/pandas/utils_pandas.py#L13, weights == np.max(weights) is an instance of <class 'numpy.ndarray'> while with pandas 1.4.0, it is now an instance of <class 'pandas.core.arrays.boolean.BooleanArray'>.

Taking a look at the release note, it might be related to https://pandas.pydata.org/pandas-docs/stable/whatsnew/v1.4.0.html#index-can-hold-arbitrary-extensionarrays.

I am not sure what the correct fix is and if you plan to support both version of pandas, let me know if I can provide more help.

8reactions
ieavescommented, Feb 19, 2022

This is a weird one… A simple fix is to modify line 13 identified by @Lothiraldan to use np.where (as is done in the else condition below) i.e.

w_median = (data[np.where(weights == np.max(weights))])[0]

I haven’t investigated too deeply but with the above change tests appear to now pass on pandas 1.40+. I haven’t investigated much beyond the fix but for some reason the result of the == comparison in this case isn’t producing a boolean ndarray.

Read more comments on GitHub >

github_iconTop Results From Across the Web

What's new in 1.4.0 (January 22, 2022) - Pandas
Backwards incompatible API changes​​ pandas 1.4. 0 supports Python 3.8 and higher.
Read more >
sklearn-pandas 1.4.0 - PyPI
In particular, it provides: 1. A way to map ``DataFrame`` columns to transformations, which are later recombined into features. 2. A compatibility shim...
Read more >
Loose version in requirements.in - pip - Stack Overflow
This should be possible by using the string version pip install 'pandas>=1.4.0,<1.5.0'. Use double quotes for Windows.
Read more >
SciPy 1.4.0 Release Notes — SciPy v1.9.3 Manual
fft inconsistency when axes=None and shape… #10628: Scipy python>3.6 Windows wheels don't ship msvcp*.dll. #10733: DOC/BUG: ...
Read more >
Changelog - Streamlit Docs
All Streamlit commands that accept pandas DataFrames as input also support ... Going forward, we'll let you know if there's a mismatch in...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found