Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

ENH:Universal visualization for Data-Science Dataset

See original GitHub issue

Universal visualisation for Data-Science Dataset

For the past years, I have been learning Data-Science. I have gone through loads of datasets. I have recently developed a Repository with 25 examples for data exploration, understanding, and machine learning. We have noticed, that there is not really a universal way of showing the content of a dataset. Currently, you can use

df.head()
df.info()
df.describe()

df.head() showing the first few rows, gives some good insight into the dataset, but it only shows a slice of it, and it takes some brain juice and good focus to interpret correctly. df.info() shows you column name, Non-Null Count and Dtype. This is some very useful metadata, but doesn’t give you any insights into the contents. Also labelling stings as Dtype “object” is correct, but not very helpful. df.describe() gives very helpful information into continuous values, but does not help with strings and categorical values.

When I search for new datasets to learn Data-Science, I want to understand what the data is about and what I may be able to do with it with just one visualisation.

Additional Benefit

When exploring a new dataset, the first steps are always the same: I want to know what features I am working with, and what Datatypes do they have? For categorical features, I want a list of the categories df["feature"].unique(). For continuous values, I want to know the range ( min, max) and maybe the mean.

The Solution: A universal visualisation.

Wouldnt it be amazing if pandas could print a table with just one function call, that describes all this information in a compact easy to understand format. It would automatically detect categorical and continuous values and provide the most important information, to quickly understand what the data is about.

This table could become the default visualisation, to efficiently describe a dataset. It can be printed out directly in markdown table format, so that it can be directly copied into the documentation

API breaking implications

This would be a new function: pandas.DataFrame.feature_description()

Describe alternatives you’ve considered

The current alternative is that everyone uses a custom format to visualize their dataset.
You can use the functions listed above, to get a feel for the dataset. For categorical values, you have to do a df["feature"].unique() or df["feature"].value_counts() for every single feature.
Now your information is scattered all over your python notebook, and you are constantly scrolling around to the different cells

Sample Implementation

For our ML-Repository (mentioned above) I have implemented a function, that partially fulfils these requirements. I would propose an output like this.
The contents of a feature is described with universal symbols and mathematical notation. The format should be universal and not depend on a specific language. Words like "example: " or “Values from -10 to + 110.4” are not used, to make the Table generally interpretable for people speaking all kinds of languages.

| Feature         | Data Type |
|-----------------|-----------|
| customerID      |  str     { "5789-LDFXO", ... }   |
| gender          |  str     {"Female", "Male"}   |
| SeniorCitizen   |  int64   |
| Partner         |  str     {"Yes", "No"}   |
| Dependents      |  str     {"No", "Yes"}   |
| tenure          |  int64   |
| PhoneService    |  str     {"No", "Yes"}   |
| MultipleLines   |  str     {"No phone service", "No", "Yes"}   |
| InternetService |  str     {"DSL", "Fiber optic", "No"}   |
| OnlineSecurity  |  str     {"No", "Yes", "No internet service"}   |
| OnlineBackup    |  str     {"Yes", "No", "No internet service"}   |
| DeviceProtection|  str     {"No", "Yes", "No internet service"}   |
| TechSupport     |  str     {"No", "Yes", "No internet service"}   |
| StreamingTV     |  str     {"No", "Yes", "No internet service"}   |
| StreamingMovies |  str     {"No", "Yes", "No internet service"}   |
| Contract        |  str     {"Month-to-month", "One year", "Two year"}   |
| PaperlessBilling|  str     {"Yes", "No"}   |
| PaymentMethod   |  str     {"Electronic check", "Mailed check", "Bank transfer (automatic)", "Credit card (automatic)"}   |
| MonthlyCharges  |  float64 [ 18.25; 118.75 ]   |
| TotalCharges    |  str     { "659.35", ... }   |
| Churn           |  str     {"No", "Yes"}   |

Implementation:

def feature_description(data):
    longestColumnName = len(max(np.array(data.columns), key=len))
    print(f"| {'Feature'.ljust(longestColumnName)}| Data Type |")
    print(f"|{''.join(['-']*( longestColumnName+1))}|-----------|")
    for col in data.columns:
        description = ''
        col_dropna = data[col].dropna()
        example = col_dropna.sample(1).values[0]
        if type(example) == str:
            description = 'str'.ljust(8)
            if len(col_dropna.unique()) < 10:
                description += '{'
                description += ', '.join([ f'"{name}"' for name in col_dropna.unique()])
                description += '}'
            else:
                description += '{ "'+ example + '", ... }'
        elif (type(example) == np.int32) and (len(col_dropna.unique()) < 10) :
            description += 'int32 {'
            description += ', '.join([ f'{name}' for name in sorted(col_dropna.unique())])
            description += '}'
        elif (type(example) == np.float64):
            description += f"{'float64'.ljust(8)}[ {col_dropna.min()}; {col_dropna.max()} ]"
        else:
            try:
                description = example.dtype
            except:
                 description = type(example)
        print("| " + col.ljust(longestColumnName)+ f'|  {description}   |')
 
feature_description(df)

Proposed Parameters

categorical_limit=10: maximum categories, to be displayed in the categorical notation.
max_displayed_chars_in_string=30: Maximum number of characters displayed in the example and the categorical notation, before being shorted with …
show_NAN_count=False: count the Non-Null Count, as in df.info()
markdown_format=True: Display the Table in the Markdown format
extended=False: Show more information like value_counts for categorical and standard deviation for continuous values

Issue Analytics

State:
Created a year ago
Comments:5 (4 by maintainers)

Top GitHub Comments

1reaction

mroeschkecommented, Jul 9, 2022

Since this is my first Feature Request on such a big project, I gotta ask. Was this a well-written Issue?

Yes, very clear description of the request!

0reactions

Dustin-dusTircommented, Jul 9, 2022

Thank you for the hint to pandas-profiling. What I proposed is a bit different, as I want a more compact description. But I totally understand your point. It was a pleasure open sourcing with you.

Since this is my first Feature Request on such a big project, I gotta ask. Was this a well-written Issue? Thanks.

Top Results From Across the Web

Wiz: A Web-Based Tool for Interactive Visualization of Big Data

The ability to derive meaningful relationships from such large datasets depends on our access to analysis tools. In particular, data visualization tools ...

Basic Guide to Data Visualization for Data Science

Data visualization is a way to represent data and information graphically. It can be described as translating data into a visual context ...

Tools for making good data visualizations: the art of charting

Abstract. Data visualization is a collection of methods that use visual representations to explore, make sense of and communicate quantitative data.

(PDF) The Universal Visualization Platform - ResearchGate

Moreover, this platform supports multiple large data sets, and the recording and ... The visualization and analysis of data is the key to ......

Visualization Framework for High-Dimensional Spatio ... - MDPI

Machine learning (ML) based DRTs for data visualization i.e., principal component analysis (PCA), generative topographic mapping (GTM), t- ...