question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

bin problem in histograms

See original GitHub issue

Describe the bug I have a data table of 79 observations and 50 variables. I generate the profiling report regularly. I have used pandas-profiling 2.1 for quite a long time. There are some variables with the discrete values of (0, 1, 2, 3, 4, 5). The histogram looked like this in version 2.1: image The same report in version 2.8 looks like this: image

In version 2.8 there are 10 bins created which is not the best solution in this case. I have tried to set the number of bins manually in the yaml config file but it was unsuccessful. I am not sure if it is a bug or it is only my misunderstanding of the configuration and parameters. Please help me to solve this problem! Thanks a lot!

To Reproduce

We would need to reproduce your scenario before being able to resolve it.

Data:

 #   Column             Non-Null Count  Dtype  
---  ------             --------------  -----  
31  OK75_num           79 non-null     int64

The values are (0, 1, 2, 3, 4, 5) values.

Code: Preferably, use this code format:

from pandas_profiling import ProfileReport profile = ProfileReport(df, config_file=‘profiler_settings.yml’) profile.to_file(“profiler_report.html”)

pool_size: 7
plot:
    histogram:
        bins: 6
        bayesian_blocks_bins: no
    image_format: svg
    dpi: 800
    scatter_threshold: 1000
    correlation:
        cmap: RdBu
        bad: '#000000'
    missing:
        cmap: RdBu
        force_labels: yes
title: "Report title"
progress_bar: yes
variables:
    descriptions: {}
vars:
    num:
        quantiles:
        - 0.05
        - 0.25
        - 0.5
        - 0.75
        - 0.95
        skewness_threshold: 20
        low_categorical_threshold: 5
        chi_squared_threshold: 0
    cat:
        length: yes
        unicode: no
        cardinality_threshold: 50
        n_obs: 6
        chi_squared_threshold: 0
        coerce_str_to_date: no
    bool:
        n_obs: 3
    file:
        active: no
    image:
        active: no
        exif: yes
        hash: yes
sort: None
missing_diagrams:
    bar: no
    matrix: no
    heatmap: no
    dendrogram: no
correlations:
    pearson:
        calculate: no
        warn_high_correlations: yes
        threshold: 0.9
    spearman:
        calculate: no
        warn_high_correlations: no
    kendall:
        calculate: no
        warn_high_correlations: no
    phi_k:
        calculate: no
        warn_high_correlations: no
    cramers:
        calculate: no
        warn_high_correlations: yes
        threshold: 0.9
    recoded:
        calculate: no
        warn_high_correlations: no
        threshold: 0.0
interactions:
    targets: []
    continuous: yes
categorical_maximum_correlation_distinct: 6
n_obs_unique: 5
n_extreme_obs: 5
n_freq_table_max: 6
#n_freq_table_max: 10
memory_deep: no
duplicates:
    head: 10
samples:
    head: 5
    tail: 5
reject_variables: no
notebook:
    iframe:
        height: 800px
        width: 100%
        attribute: srcdoc
html:
    minify_html: yes
    use_local_assets: yes
    inline: yes
    navbar_show: yes
    file_name: None
    style:
        theme: None
        #theme: "flatly"
        logo: ''
        primary_color: '#337ab7'
        #primary_color: "#2c3e50"
        full_width: yes

Version information:

  • Python version: Python 3.7.3 64 bit (anaconda3).
  • Environment: vs code and local Jupyter Notebook
  • pip: If you are using pip, run pip freeze in your environment and report the results. The list of packages can be rather long, you can use the snippet below to collapse the output.
Click to expand Version information: alabaster==0.7.12 altgraph==0.16.1 anaconda-client==1.7.2 anaconda-navigator==1.9.7 anaconda-project==0.8.2 appdirs==1.4.3 asn1crypto==0.24.0 astroid==2.2.5 astropy==4.0.1.post1 atomicwrites==1.3.0 attrs==19.3.0 Babel==2.6.0 backcall==0.1.0 backports.os==0.1.1 backports.shutil-get-terminal-size==1.0.0 beautifulsoup4==4.7.1 bitarray==0.8.3 bkcharts==0.2 bleach==3.1.0 bokeh==1.0.4 boto==2.49.0 Bottleneck==1.2.1 cached-property==1.5.1 certifi==2019.3.9 cffi==1.12.2 chardet==3.0.4 Click==7.0 cloudpickle==0.8.0 clyent==1.2.2 colorama==0.4.1 conda==4.6.11 conda-build==3.17.8 conda-verify==3.1.1 confuse==1.0.0 contextlib2==0.5.5 cryptography==2.6.1 cycler==0.10.0 Cython==0.29.6 cytoolz==0.9.0.1 dask==2.5.2 decorator==4.4.0 defusedxml==0.5.0 distributed==2.5.2 docutils==0.14 entrypoints==0.3 et-xmlfile==1.0.1 fastcache==1.0.2 filelock==3.0.10 Flask==1.0.2 funcsigs==1.0.2 future==0.17.1 gevent==1.4.0 glob2==0.6 gmpy2==2.0.8 greenlet==0.4.15 h5py==2.9.0 heapdict==1.0.0 html5lib==1.0.1 htmlmin==0.1.12 idna==2.8 ImageHash==4.1.0 imageio==2.5.0 imagesize==1.1.0 importlib-metadata==0.0.0 ipykernel==5.1.0 ipython==7.4.0 ipython-genutils==0.2.0 ipywidgets==7.5.1 isodate==0.6.0 isort==4.3.16 itsdangerous==1.1.0 jdcal==1.4 jedi==0.13.3 jeepney==0.4 Jinja2==2.11.2 joblib==0.15.1 jsonschema==3.0.1 jupyter==1.0.0 jupyter-client==5.2.4 jupyter-console==6.0.0 jupyter-core==4.4.0 jupyterlab==0.35.4 jupyterlab-server==0.2.0 keyring==18.0.0 kiwisolver==1.0.1 lazy-object-proxy==1.3.1 libarchive-c==2.8 lief==0.9.0 llvmlite==0.28.0 locket==0.2.0 lxml==4.3.2 MarkupSafe==1.1.1 matplotlib==3.2.1 mccabe==0.6.1 missingno==0.4.2 mistune==0.8.4 mkl-fft==1.0.10 mkl-random==1.0.2 modin==0.6.1 more-itertools==6.0.0 mpmath==1.1.0 msgpack==0.6.1 multipledispatch==0.6.0 navigator-updater==0.2.1 nbconvert==5.4.1 nbformat==4.4.0 networkx==2.4 nltk==3.4 nose==1.3.7 notebook==5.7.8 numba==0.43.1 numexpr==2.6.9 numpy==1.16.2 numpydoc==0.8.0 olefile==0.46 openpyxl==2.6.1 packaging==19.0 pandas==1.0.3 pandas-profiling==2.8.0 pandocfilters==1.4.2 parso==0.3.4 partd==0.3.10 path.py==11.5.0 pathlib2==2.3.3 patsy==0.5.1 pep8==1.7.1 pexpect==4.6.0 phik==0.9.12 pickleshare==0.7.5 Pillow==5.4.1 pkginfo==1.5.0.1 pluggy==0.9.0 ply==3.11 prometheus-client==0.6.0 prompt-toolkit==2.0.9 protobuf==3.10.0 psutil==5.6.1 ptyprocess==0.6.0 py==1.8.0 pycodestyle==2.5.0 pycosat==0.6.3 pycparser==2.19 pycrypto==2.6.1 pycurl==7.43.0.2 pyflakes==2.1.1 Pygments==2.3.1 PyInstaller==3.5 pylint==2.3.1 pyodbc==4.0.26 pyOpenSSL==19.0.0 pyparsing==2.3.1 pyrsistent==0.14.11 pyserial==3.4 PySimpleGUI==4.1.0 PySocks==1.6.8 pytest==4.3.1 pytest-arraydiff==0.3 pytest-astropy==0.5.0 pytest-doctestplus==0.3.0 pytest-openfiles==0.3.2 pytest-pylint==0.14.0 pytest-remotedata==0.3.1 python-dateutil==2.8.0 pytz==2018.9 PyWavelets==1.0.2 PyYAML==5.1 pyzmq==18.0.0 QtAwesome==0.5.7 qtconsole==4.4.3 QtPy==1.7.0 ray==0.7.3 redis==3.3.8 requests==2.23.0 requests-toolbelt==0.9.1 retrying==1.3.3 rope==0.12.0 ruamel-yaml==0.15.46 scikit-image==0.14.2 scikit-learn==0.20.3 scipy==1.4.1 seaborn==0.9.0 SecretStorage==3.1.1 Send2Trash==1.5.0 simplegeneric==0.8.1 singledispatch==3.4.0.3 six==1.12.0 snowballstemmer==1.2.1 sortedcollections==1.1.2 sortedcontainers==2.1.0 soupsieve==1.8 Sphinx==1.8.5 sphinxcontrib-websupport==1.1.0 spyder==3.3.3 spyder-kernels==0.4.2 SQLAlchemy==1.3.1 statsmodels==0.9.0 sympy==1.3 tables==3.5.1 tangled-up-in-unicode==0.0.6 tblib==1.3.2 terminado==0.8.1 testpath==0.4.2 toolz==0.9.0 tornado==6.0.2 tqdm==4.46.0 traitlets==4.3.2 typed-ast==1.4.0 unicodecsv==0.14.1 urllib3==1.24.1 virtualenv==16.7.9 visions==0.4.4 wcwidth==0.1.7 webencodings==0.5.1 Werkzeug==0.14.1 widgetsnbextension==3.5.1 wrapt==1.11.1 wurlitzer==1.0.2 xlrd==1.2.0 XlsxWriter==1.1.5 xlwt==1.3.0 zeep==3.4.0 zict==0.1.4 zipp==0.3.3

Issue Analytics

  • State:closed
  • Created 3 years ago
  • Comments:5 (2 by maintainers)

github_iconTop GitHub Comments

1reaction
sbrugmancommented, May 22, 2020

Thanks for reporting this @pvojnisek. As @loopyme points out, the bin size used to be (unintelionally) hard-coded in render_real.py. The next release will pre-compute histograms earlier in the process anyway, which will in addition to more efficient parallelization include a fix for this problem.

0reactions
sbrugmancommented, Jul 15, 2020

The v2.9.0rc1 release is out, and should resolve this issue. Until this version is fully released, you can install it via pip in the following way:

pip install --pre -U pandas-profiling

It would be very helpful to know if the release candidate adequately solves the issue.

Read more comments on GitHub >

github_iconTop Results From Across the Web

Histograms review (article) | Khan Academy
A histogram displays numerical data by grouping data into "bins" of equal width. Each bin is plotted as a bar whose height corresponds...
Read more >
A Complete Guide to Histograms | Tutorial by Chartio
If you have too many bins, then the data distribution will look rough, and it will be difficult to discern the signal from...
Read more >
Error on the Bin of a Normalised Histogram
Suppose I have a histogram, N, each with bins of width Δx, denoted by bin indices, i. The count of a single bin...
Read more >
Choose Bin Sizes for Histograms in Easy Steps + Sturge's Rule
A bin—sometimes called a class interval—is a way of sorting data in a histogram. It's very similar to the idea of putting data...
Read more >
Selecting the Number of Bins in a Histogram: A Decision ...
In this note we consider the problem of, given a sample, selecting the number of bins in a histogram. A loss function is...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found