Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Pandas-Profiling not considering the type of the column which is changed manually.

See original GitHub issue

Describe the bug

Pandas is used to read data
One of the column was misread as number (int) instead of string/category (str)
That column is manually typecast with the help of astype function [astype('str') / astype('category')]
Pandas-Profiling is used to generate report
The new type of the column is not considered and instead it considered that column as number.

To Reproduce GitHub repo: https://github.com/mohith7548/Pandas-Profiling-issue-recreation

import numpy as np
import pandas as pd
import pandas_profiling

df = pd.DataFrame({
    'Dummy': ['X', 'Y', 'X', 'X', 'Y', 'Y', 'Y', 'X'],
    'Contract': [9940243658, 9940243537, 9940243103, 9940242844, 9940242844, 9940242840, 9940242774, 9940242774]
})

df['Contract_category'] = df['Contract'].astype('category')
df['Contract_str'] = df['Contract'].astype('str')

df.dtypes
# output
Dummy                  object
Contract                int64
Contract_category    category
Contract_str           object
dtype: object

df.profile_report() # ProfileReport(df)
# When generating report, Pandas-Profiling has considered Contract_category & Contract_str as int 
# even though they are explicitly casted as category, str respectively above.

Version information: python version: 3.6 enviroment: jupyter-notebook pandas-profiling==2.10.0

Additional context As a work-around many suggested to prefix a character at the beginning of the number after type casting from int to str, so that Pandas-Profiling considers it as string/category. But this method seems to bring unnecessary complexity.

Issue Analytics

State:
Created 3 years ago
Comments:8 (3 by maintainers)

Top GitHub Comments

2reactions

ieavescommented, May 21, 2021

@reedv the infer_dtypes flag attempts to infer the best data types for each column in your dataframe before computing summaries. For example, if you had a sequence of values [‘1’, ‘2’, ‘3’], when you use infer_dtypes, PP will provide a summary for integers rather than strings.

If you’ve already massaged your data and have things the way you want there’s no harm in calling it with infer_dtypes off. Under the hood, PP is using a customizable type system called visions and is moving towards fully customizable ProfileReports.

1reaction

mohith7548commented, Jan 30, 2021

Thanks for the help @ieaves. I made a pull request making the developer choose infer/detect dtype by using a new kwarg param infer_dtypes in ProfileReport.
New changes here:
https://github.com/mohith7548/pandas-profiling/blob/4dd35dfbf7c1eb005a90e87275efdeec6968931b/src/pandas_profiling/model/summary.py#L46

Top Results From Across the Web

Pandas-Profiling not considering the type of the column which ...

Pandas is used to read data · One of the column was misread as number ( int ) instead of string/category ( str...

How to change variable type when working with pandas ...

pandas -profiling tries to infer the data type that best suits for a column. And it is how it's written before. Since there...

Bringing Customization to Pandas Profiling | by Ian Eaves

The recently completed migration of pandas-profiling to the visions type system brings fully customizable type detection logic, summary ...

How to Supercharge Data Exploration with Pandas Profiling

Doing this work manually is time-consuming. Each dataset has properties that warrant producing specific statistics or charts. There is no clear ...

Automated EDA using pandas profiling,sweetviz,autoviz

Exploratory data analysis(EDA) is used to explain how the data is, what is the relationship between the attributes, and furthermore.

Troubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.

Start Free

Top Related Reddit Thread

No results found

Top Related Tweet

No results found

Top Related Dev.to Post

No results found

Pandas-Profiling not considering the type of the column which is changed manually.

Issue Analytics

Top GitHub Comments

Top Results From Across the Web

Top Related Medium Post

Top Related StackOverflow Question

Troubleshoot Live Code

Top Related Reddit Thread

Top Related Hackernoon Post

Top Related Tweet

Top Related Dev.to Post

Top Related Hashnode Post

Add functionality to modify plots (and their look)

RecursionError: maximum recursion depth exceeded while calling a Python object