ENH:Universal visualization for Data-Science Dataset
See original GitHub issueUniversal visualisation for Data-Science Dataset
For the past years, I have been learning Data-Science. I have gone through loads of datasets. I have recently developed a Repository with 25 examples for data exploration, understanding, and machine learning. We have noticed, that there is not really a universal way of showing the content of a dataset. Currently, you can use
showing the first few rows, gives some good insight into the dataset, but it only shows a slice of it, and it takes some brain juice and good focus to interpret correctly. df.info()
shows you column name, Non-Null Count and Dtype. This is some very useful metadata, but doesn’t give you any insights into the contents. Also labelling stings as Dtype “object” is correct, but not very helpful. df.describe()
gives very helpful information into continuous values, but does not help with strings and categorical values.
When I search for new datasets to learn Data-Science, I want to understand what the data is about and what I may be able to do with it with just one visualisation.
Additional Benefit
When exploring a new dataset, the first steps are always the same: I want to know what features I am working with, and what Datatypes do they have? For categorical features, I want a list of the categories df["feature"].unique()
. For continuous values, I want to know the range ( min, max) and maybe the mean.
The Solution: A universal visualisation.
Wouldnt it be amazing if pandas
could print a table with just one function call, that describes all this information in a compact easy to understand format. It would automatically detect categorical and continuous values and provide the most important information, to quickly understand what the data is about.
This table could become the default visualisation, to efficiently describe a dataset. It can be printed out directly in markdown table format, so that it can be directly copied into the documentation
API breaking implications
This would be a new function:
Describe alternatives you’ve considered
The current alternative is that everyone uses a custom format to visualize their dataset.
You can use the functions listed above, to get a feel for the dataset. For categorical values, you have to do a df["feature"].unique()
or df["feature"].value_counts()
for every single feature.
Now your information is scattered all over your python notebook, and you are constantly scrolling around to the different cells
Sample Implementation
For our ML-Repository (mentioned above) I have implemented a function, that partially fulfils these requirements. I would propose an output like this.
The contents of a feature is described with universal symbols and mathematical notation. The format should be universal and not depend on a specific language. Words like "example: " or “Values from -10 to + 110.4” are not used, to make the Table generally interpretable for people speaking all kinds of languages.
| Feature | Data Type |
| customerID | str { "5789-LDFXO", ... } |
| gender | str {"Female", "Male"} |
| SeniorCitizen | int64 |
| Partner | str {"Yes", "No"} |
| Dependents | str {"No", "Yes"} |
| tenure | int64 |
| PhoneService | str {"No", "Yes"} |
| MultipleLines | str {"No phone service", "No", "Yes"} |
| InternetService | str {"DSL", "Fiber optic", "No"} |
| OnlineSecurity | str {"No", "Yes", "No internet service"} |
| OnlineBackup | str {"Yes", "No", "No internet service"} |
| DeviceProtection| str {"No", "Yes", "No internet service"} |
| TechSupport | str {"No", "Yes", "No internet service"} |
| StreamingTV | str {"No", "Yes", "No internet service"} |
| StreamingMovies | str {"No", "Yes", "No internet service"} |
| Contract | str {"Month-to-month", "One year", "Two year"} |
| PaperlessBilling| str {"Yes", "No"} |
| PaymentMethod | str {"Electronic check", "Mailed check", "Bank transfer (automatic)", "Credit card (automatic)"} |
| MonthlyCharges | float64 [ 18.25; 118.75 ] |
| TotalCharges | str { "659.35", ... } |
| Churn | str {"No", "Yes"} |
def feature_description(data):
longestColumnName = len(max(np.array(data.columns), key=len))
print(f"| {'Feature'.ljust(longestColumnName)}| Data Type |")
print(f"|{''.join(['-']*( longestColumnName+1))}|-----------|")
for col in data.columns:
description = ''
col_dropna = data[col].dropna()
example = col_dropna.sample(1).values[0]
if type(example) == str:
description = 'str'.ljust(8)
if len(col_dropna.unique()) < 10:
description += '{'
description += ', '.join([ f'"{name}"' for name in col_dropna.unique()])
description += '}'
description += '{ "'+ example + '", ... }'
elif (type(example) == np.int32) and (len(col_dropna.unique()) < 10) :
description += 'int32 {'
description += ', '.join([ f'{name}' for name in sorted(col_dropna.unique())])
description += '}'
elif (type(example) == np.float64):
description += f"{'float64'.ljust(8)}[ {col_dropna.min()}; {col_dropna.max()} ]"
description = example.dtype
description = type(example)
print("| " + col.ljust(longestColumnName)+ f'| {description} |')
Proposed Parameters
Proposed Parameters
- categorical_limit=10: maximum categories, to be displayed in the categorical notation.
- max_displayed_chars_in_string=30: Maximum number of characters displayed in the example and the categorical notation, before being shorted with …
- show_NAN_count=False: count the Non-Null Count, as in df.info()
- markdown_format=True: Display the Table in the Markdown format
- extended=False: Show more information like value_counts for categorical and standard deviation for continuous values
Issue Analytics
- State:
- Created a year ago
- Comments:5 (4 by maintainers)
Yes, very clear description of the request!
Thank you for the hint to
. What I proposed is a bit different, as I want a more compact description. But I totally understand your point. It was a pleasure open sourcing with you.Since this is my first Feature Request on such a big project, I gotta ask. Was this a well-written Issue? Thanks.