Pandas-Profiling not considering the type of the column which is changed manually.
See original GitHub issueDescribe the bug
- Pandas is used to read data
- One of the column was misread as number (
int
) instead of string/category (str
) - That column is manually typecast with the help of
astype
function [astype('str')
/astype('category')
] - Pandas-Profiling is used to generate report
- The new type of the column is not considered and instead it considered that column as number.
To Reproduce GitHub repo: https://github.com/mohith7548/Pandas-Profiling-issue-recreation
import numpy as np
import pandas as pd
import pandas_profiling
df = pd.DataFrame({
'Dummy': ['X', 'Y', 'X', 'X', 'Y', 'Y', 'Y', 'X'],
'Contract': [9940243658, 9940243537, 9940243103, 9940242844, 9940242844, 9940242840, 9940242774, 9940242774]
})
df['Contract_category'] = df['Contract'].astype('category')
df['Contract_str'] = df['Contract'].astype('str')
df.dtypes
# output
Dummy object
Contract int64
Contract_category category
Contract_str object
dtype: object
df.profile_report() # ProfileReport(df)
# When generating report, Pandas-Profiling has considered Contract_category & Contract_str as int
# even though they are explicitly casted as category, str respectively above.
Version information:
python version: 3.6
enviroment: jupyter-notebook
pandas-profiling==2.10.0
Additional context
As a work-around many suggested to prefix a character at the beginning of the number after type casting from int
to str
,
so that Pandas-Profiling considers it as string/category. But this method seems to bring unnecessary complexity.
Issue Analytics
- State:
- Created 3 years ago
- Comments:8 (3 by maintainers)
Top Results From Across the Web
Pandas-Profiling not considering the type of the column which ...
Pandas is used to read data · One of the column was misread as number ( int ) instead of string/category ( str...
Read more >How to change variable type when working with pandas ...
pandas -profiling tries to infer the data type that best suits for a column. And it is how it's written before. Since there...
Read more >Bringing Customization to Pandas Profiling | by Ian Eaves
The recently completed migration of pandas-profiling to the visions type system brings fully customizable type detection logic, summary ...
Read more >How to Supercharge Data Exploration with Pandas Profiling
Doing this work manually is time-consuming. Each dataset has properties that warrant producing specific statistics or charts. There is no clear ...
Read more >Automated EDA using pandas profiling,sweetviz,autoviz
Exploratory data analysis(EDA) is used to explain how the data is, what is the relationship between the attributes, and furthermore.
Read more >
Top Related Medium Post
No results found
Top Related StackOverflow Question
No results found
Troubleshoot Live Code
Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free
Top Related Reddit Thread
No results found
Top Related Hackernoon Post
No results found
Top Related Tweet
No results found
Top Related Dev.to Post
No results found
Top Related Hashnode Post
No results found
@reedv the
infer_dtypes
flag attempts to infer the best data types for each column in your dataframe before computing summaries. For example, if you had a sequence of values [‘1’, ‘2’, ‘3’], when you use infer_dtypes, PP will provide a summary for integers rather than strings.If you’ve already massaged your data and have things the way you want there’s no harm in calling it with infer_dtypes off. Under the hood, PP is using a customizable type system called visions and is moving towards fully customizable ProfileReports.
Thanks for the help @ieaves. I made a pull request making the developer choose infer/detect dtype by using a new kwarg param
infer_dtypes
in ProfileReport.New changes here:
https://github.com/mohith7548/pandas-profiling/blob/4dd35dfbf7c1eb005a90e87275efdeec6968931b/src/pandas_profiling/model/summary.py#L46