question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Pandas-Profiling not considering the type of the column which is changed manually.

See original GitHub issue

Describe the bug

  • Pandas is used to read data
  • One of the column was misread as number (int) instead of string/category (str)
  • That column is manually typecast with the help of astype function [astype('str') / astype('category')]
  • Pandas-Profiling is used to generate report
  • The new type of the column is not considered and instead it considered that column as number.

To Reproduce GitHub repo: https://github.com/mohith7548/Pandas-Profiling-issue-recreation

import numpy as np
import pandas as pd
import pandas_profiling

df = pd.DataFrame({
    'Dummy': ['X', 'Y', 'X', 'X', 'Y', 'Y', 'Y', 'X'],
    'Contract': [9940243658, 9940243537, 9940243103, 9940242844, 9940242844, 9940242840, 9940242774, 9940242774]
})

df['Contract_category'] = df['Contract'].astype('category')
df['Contract_str'] = df['Contract'].astype('str')

df.dtypes
# output
Dummy                  object
Contract                int64
Contract_category    category
Contract_str           object
dtype: object

df.profile_report() # ProfileReport(df)
# When generating report, Pandas-Profiling has considered Contract_category & Contract_str as int 
# even though they are explicitly casted as category, str respectively above.

image

Version information: python version: 3.6 enviroment: jupyter-notebook pandas-profiling==2.10.0

Additional context As a work-around many suggested to prefix a character at the beginning of the number after type casting from int to str, so that Pandas-Profiling considers it as string/category. But this method seems to bring unnecessary complexity.

Issue Analytics

  • State:closed
  • Created 3 years ago
  • Comments:8 (3 by maintainers)

github_iconTop GitHub Comments

2reactions
ieavescommented, May 21, 2021

@reedv the infer_dtypes flag attempts to infer the best data types for each column in your dataframe before computing summaries. For example, if you had a sequence of values [‘1’, ‘2’, ‘3’], when you use infer_dtypes, PP will provide a summary for integers rather than strings.

If you’ve already massaged your data and have things the way you want there’s no harm in calling it with infer_dtypes off. Under the hood, PP is using a customizable type system called visions and is moving towards fully customizable ProfileReports.

1reaction
mohith7548commented, Jan 30, 2021

Thanks for the help @ieaves. I made a pull request making the developer choose infer/detect dtype by using a new kwarg param infer_dtypes in ProfileReport.
New changes here:
https://github.com/mohith7548/pandas-profiling/blob/4dd35dfbf7c1eb005a90e87275efdeec6968931b/src/pandas_profiling/model/summary.py#L46

Read more comments on GitHub >

github_iconTop Results From Across the Web

Pandas-Profiling not considering the type of the column which ...
Pandas is used to read data · One of the column was misread as number ( int ) instead of string/category ( str...
Read more >
How to change variable type when working with pandas ...
pandas -profiling tries to infer the data type that best suits for a column. And it is how it's written before. Since there...
Read more >
Bringing Customization to Pandas Profiling | by Ian Eaves
The recently completed migration of pandas-profiling to the visions type system brings fully customizable type detection logic, summary ...
Read more >
How to Supercharge Data Exploration with Pandas Profiling
Doing this work manually is time-consuming. Each dataset has properties that warrant producing specific statistics or charts. There is no clear ...
Read more >
Automated EDA using pandas profiling,sweetviz,autoviz
Exploratory data analysis(EDA) is used to explain how the data is, what is the relationship between the attributes, and furthermore.
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found