bin problem in histograms
See original GitHub issueDescribe the bug
I have a data table of 79 observations and 50 variables. I generate the profiling report regularly. I have used pandas-profiling 2.1 for quite a long time.
There are some variables with the discrete values of (0, 1, 2, 3, 4, 5). The histogram looked like this in version 2.1:
The same report in version 2.8 looks like this:
In version 2.8 there are 10 bins created which is not the best solution in this case. I have tried to set the number of bins manually in the yaml config file but it was unsuccessful. I am not sure if it is a bug or it is only my misunderstanding of the configuration and parameters. Please help me to solve this problem! Thanks a lot!
To Reproduce
We would need to reproduce your scenario before being able to resolve it.
Data:
# Column Non-Null Count Dtype
--- ------ -------------- -----
31 OK75_num 79 non-null int64
The values are (0, 1, 2, 3, 4, 5) values.
Code: Preferably, use this code format:
from pandas_profiling import ProfileReport profile = ProfileReport(df, config_file=‘profiler_settings.yml’) profile.to_file(“profiler_report.html”)
pool_size: 7
plot:
histogram:
bins: 6
bayesian_blocks_bins: no
image_format: svg
dpi: 800
scatter_threshold: 1000
correlation:
cmap: RdBu
bad: '#000000'
missing:
cmap: RdBu
force_labels: yes
title: "Report title"
progress_bar: yes
variables:
descriptions: {}
vars:
num:
quantiles:
- 0.05
- 0.25
- 0.5
- 0.75
- 0.95
skewness_threshold: 20
low_categorical_threshold: 5
chi_squared_threshold: 0
cat:
length: yes
unicode: no
cardinality_threshold: 50
n_obs: 6
chi_squared_threshold: 0
coerce_str_to_date: no
bool:
n_obs: 3
file:
active: no
image:
active: no
exif: yes
hash: yes
sort: None
missing_diagrams:
bar: no
matrix: no
heatmap: no
dendrogram: no
correlations:
pearson:
calculate: no
warn_high_correlations: yes
threshold: 0.9
spearman:
calculate: no
warn_high_correlations: no
kendall:
calculate: no
warn_high_correlations: no
phi_k:
calculate: no
warn_high_correlations: no
cramers:
calculate: no
warn_high_correlations: yes
threshold: 0.9
recoded:
calculate: no
warn_high_correlations: no
threshold: 0.0
interactions:
targets: []
continuous: yes
categorical_maximum_correlation_distinct: 6
n_obs_unique: 5
n_extreme_obs: 5
n_freq_table_max: 6
#n_freq_table_max: 10
memory_deep: no
duplicates:
head: 10
samples:
head: 5
tail: 5
reject_variables: no
notebook:
iframe:
height: 800px
width: 100%
attribute: srcdoc
html:
minify_html: yes
use_local_assets: yes
inline: yes
navbar_show: yes
file_name: None
style:
theme: None
#theme: "flatly"
logo: ''
primary_color: '#337ab7'
#primary_color: "#2c3e50"
full_width: yes
Version information:
- Python version: Python 3.7.3 64 bit (anaconda3).
- Environment: vs code and local Jupyter Notebook
pip
: If you are usingpip
, runpip freeze
in your environment and report the results. The list of packages can be rather long, you can use the snippet below to collapse the output.
Click to expand Version information: alabaster==0.7.12 altgraph==0.16.1 anaconda-client==1.7.2 anaconda-navigator==1.9.7 anaconda-project==0.8.2 appdirs==1.4.3 asn1crypto==0.24.0 astroid==2.2.5 astropy==4.0.1.post1 atomicwrites==1.3.0 attrs==19.3.0 Babel==2.6.0 backcall==0.1.0 backports.os==0.1.1 backports.shutil-get-terminal-size==1.0.0 beautifulsoup4==4.7.1 bitarray==0.8.3 bkcharts==0.2 bleach==3.1.0 bokeh==1.0.4 boto==2.49.0 Bottleneck==1.2.1 cached-property==1.5.1 certifi==2019.3.9 cffi==1.12.2 chardet==3.0.4 Click==7.0 cloudpickle==0.8.0 clyent==1.2.2 colorama==0.4.1 conda==4.6.11 conda-build==3.17.8 conda-verify==3.1.1 confuse==1.0.0 contextlib2==0.5.5 cryptography==2.6.1 cycler==0.10.0 Cython==0.29.6 cytoolz==0.9.0.1 dask==2.5.2 decorator==4.4.0 defusedxml==0.5.0 distributed==2.5.2 docutils==0.14 entrypoints==0.3 et-xmlfile==1.0.1 fastcache==1.0.2 filelock==3.0.10 Flask==1.0.2 funcsigs==1.0.2 future==0.17.1 gevent==1.4.0 glob2==0.6 gmpy2==2.0.8 greenlet==0.4.15 h5py==2.9.0 heapdict==1.0.0 html5lib==1.0.1 htmlmin==0.1.12 idna==2.8 ImageHash==4.1.0 imageio==2.5.0 imagesize==1.1.0 importlib-metadata==0.0.0 ipykernel==5.1.0 ipython==7.4.0 ipython-genutils==0.2.0 ipywidgets==7.5.1 isodate==0.6.0 isort==4.3.16 itsdangerous==1.1.0 jdcal==1.4 jedi==0.13.3 jeepney==0.4 Jinja2==2.11.2 joblib==0.15.1 jsonschema==3.0.1 jupyter==1.0.0 jupyter-client==5.2.4 jupyter-console==6.0.0 jupyter-core==4.4.0 jupyterlab==0.35.4 jupyterlab-server==0.2.0 keyring==18.0.0 kiwisolver==1.0.1 lazy-object-proxy==1.3.1 libarchive-c==2.8 lief==0.9.0 llvmlite==0.28.0 locket==0.2.0 lxml==4.3.2 MarkupSafe==1.1.1 matplotlib==3.2.1 mccabe==0.6.1 missingno==0.4.2 mistune==0.8.4 mkl-fft==1.0.10 mkl-random==1.0.2 modin==0.6.1 more-itertools==6.0.0 mpmath==1.1.0 msgpack==0.6.1 multipledispatch==0.6.0 navigator-updater==0.2.1 nbconvert==5.4.1 nbformat==4.4.0 networkx==2.4 nltk==3.4 nose==1.3.7 notebook==5.7.8 numba==0.43.1 numexpr==2.6.9 numpy==1.16.2 numpydoc==0.8.0 olefile==0.46 openpyxl==2.6.1 packaging==19.0 pandas==1.0.3 pandas-profiling==2.8.0 pandocfilters==1.4.2 parso==0.3.4 partd==0.3.10 path.py==11.5.0 pathlib2==2.3.3 patsy==0.5.1 pep8==1.7.1 pexpect==4.6.0 phik==0.9.12 pickleshare==0.7.5 Pillow==5.4.1 pkginfo==1.5.0.1 pluggy==0.9.0 ply==3.11 prometheus-client==0.6.0 prompt-toolkit==2.0.9 protobuf==3.10.0 psutil==5.6.1 ptyprocess==0.6.0 py==1.8.0 pycodestyle==2.5.0 pycosat==0.6.3 pycparser==2.19 pycrypto==2.6.1 pycurl==7.43.0.2 pyflakes==2.1.1 Pygments==2.3.1 PyInstaller==3.5 pylint==2.3.1 pyodbc==4.0.26 pyOpenSSL==19.0.0 pyparsing==2.3.1 pyrsistent==0.14.11 pyserial==3.4 PySimpleGUI==4.1.0 PySocks==1.6.8 pytest==4.3.1 pytest-arraydiff==0.3 pytest-astropy==0.5.0 pytest-doctestplus==0.3.0 pytest-openfiles==0.3.2 pytest-pylint==0.14.0 pytest-remotedata==0.3.1 python-dateutil==2.8.0 pytz==2018.9 PyWavelets==1.0.2 PyYAML==5.1 pyzmq==18.0.0 QtAwesome==0.5.7 qtconsole==4.4.3 QtPy==1.7.0 ray==0.7.3 redis==3.3.8 requests==2.23.0 requests-toolbelt==0.9.1 retrying==1.3.3 rope==0.12.0 ruamel-yaml==0.15.46 scikit-image==0.14.2 scikit-learn==0.20.3 scipy==1.4.1 seaborn==0.9.0 SecretStorage==3.1.1 Send2Trash==1.5.0 simplegeneric==0.8.1 singledispatch==3.4.0.3 six==1.12.0 snowballstemmer==1.2.1 sortedcollections==1.1.2 sortedcontainers==2.1.0 soupsieve==1.8 Sphinx==1.8.5 sphinxcontrib-websupport==1.1.0 spyder==3.3.3 spyder-kernels==0.4.2 SQLAlchemy==1.3.1 statsmodels==0.9.0 sympy==1.3 tables==3.5.1 tangled-up-in-unicode==0.0.6 tblib==1.3.2 terminado==0.8.1 testpath==0.4.2 toolz==0.9.0 tornado==6.0.2 tqdm==4.46.0 traitlets==4.3.2 typed-ast==1.4.0 unicodecsv==0.14.1 urllib3==1.24.1 virtualenv==16.7.9 visions==0.4.4 wcwidth==0.1.7 webencodings==0.5.1 Werkzeug==0.14.1 widgetsnbextension==3.5.1 wrapt==1.11.1 wurlitzer==1.0.2 xlrd==1.2.0 XlsxWriter==1.1.5 xlwt==1.3.0 zeep==3.4.0 zict==0.1.4 zipp==0.3.3
Issue Analytics
- State:
- Created 3 years ago
- Comments:5 (2 by maintainers)
Thanks for reporting this @pvojnisek. As @loopyme points out, the bin size used to be (unintelionally) hard-coded in
render_real.py
. The next release will pre-compute histograms earlier in the process anyway, which will in addition to more efficient parallelization include a fix for this problem.The v2.9.0rc1 release is out, and should resolve this issue. Until this version is fully released, you can install it via pip in the following way:
It would be very helpful to know if the release candidate adequately solves the issue.