question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

ENH:Universal visualization for Data-Science Dataset

See original GitHub issue

Universal visualisation for Data-Science Dataset

For the past years, I have been learning Data-Science. I have gone through loads of datasets. I have recently developed a Repository with 25 examples for data exploration, understanding, and machine learning. We have noticed, that there is not really a universal way of showing the content of a dataset. Currently, you can use

  • df.head()
  • df.info()
  • df.describe()

df.head() showing the first few rows, gives some good insight into the dataset, but it only shows a slice of it, and it takes some brain juice and good focus to interpret correctly. df.info() shows you column name, Non-Null Count and Dtype. This is some very useful metadata, but doesn’t give you any insights into the contents. Also labelling stings as Dtype “object” is correct, but not very helpful. df.describe() gives very helpful information into continuous values, but does not help with strings and categorical values.

When I search for new datasets to learn Data-Science, I want to understand what the data is about and what I may be able to do with it with just one visualisation.

Additional Benefit

When exploring a new dataset, the first steps are always the same: I want to know what features I am working with, and what Datatypes do they have? For categorical features, I want a list of the categories df["feature"].unique(). For continuous values, I want to know the range ( min, max) and maybe the mean.

The Solution: A universal visualisation.

Wouldnt it be amazing if pandas could print a table with just one function call, that describes all this information in a compact easy to understand format. It would automatically detect categorical and continuous values and provide the most important information, to quickly understand what the data is about.

This table could become the default visualisation, to efficiently describe a dataset. It can be printed out directly in markdown table format, so that it can be directly copied into the documentation

API breaking implications

This would be a new function: pandas.DataFrame.feature_description()

Describe alternatives you’ve considered

The current alternative is that everyone uses a custom format to visualize their dataset.
You can use the functions listed above, to get a feel for the dataset. For categorical values, you have to do a df["feature"].unique() or df["feature"].value_counts() for every single feature.
Now your information is scattered all over your python notebook, and you are constantly scrolling around to the different cells

Sample Implementation

For our ML-Repository (mentioned above) I have implemented a function, that partially fulfils these requirements. I would propose an output like this.
The contents of a feature is described with universal symbols and mathematical notation. The format should be universal and not depend on a specific language. Words like "example: " or “Values from -10 to + 110.4” are not used, to make the Table generally interpretable for people speaking all kinds of languages.

| Feature         | Data Type |
|-----------------|-----------|
| customerID      |  str     { "5789-LDFXO", ... }   |
| gender          |  str     {"Female", "Male"}   |
| SeniorCitizen   |  int64   |
| Partner         |  str     {"Yes", "No"}   |
| Dependents      |  str     {"No", "Yes"}   |
| tenure          |  int64   |
| PhoneService    |  str     {"No", "Yes"}   |
| MultipleLines   |  str     {"No phone service", "No", "Yes"}   |
| InternetService |  str     {"DSL", "Fiber optic", "No"}   |
| OnlineSecurity  |  str     {"No", "Yes", "No internet service"}   |
| OnlineBackup    |  str     {"Yes", "No", "No internet service"}   |
| DeviceProtection|  str     {"No", "Yes", "No internet service"}   |
| TechSupport     |  str     {"No", "Yes", "No internet service"}   |
| StreamingTV     |  str     {"No", "Yes", "No internet service"}   |
| StreamingMovies |  str     {"No", "Yes", "No internet service"}   |
| Contract        |  str     {"Month-to-month", "One year", "Two year"}   |
| PaperlessBilling|  str     {"Yes", "No"}   |
| PaymentMethod   |  str     {"Electronic check", "Mailed check", "Bank transfer (automatic)", "Credit card (automatic)"}   |
| MonthlyCharges  |  float64 [ 18.25; 118.75 ]   |
| TotalCharges    |  str     { "659.35", ... }   |
| Churn           |  str     {"No", "Yes"}   |

Implementation:

def feature_description(data):
    longestColumnName = len(max(np.array(data.columns), key=len))
    print(f"| {'Feature'.ljust(longestColumnName)}| Data Type |")
    print(f"|{''.join(['-']*( longestColumnName+1))}|-----------|")
    for col in data.columns:
        description = ''
        col_dropna = data[col].dropna()
        example = col_dropna.sample(1).values[0]
        if type(example) == str:
            description = 'str'.ljust(8)
            if len(col_dropna.unique()) < 10:
                description += '{'
                description += ', '.join([ f'"{name}"' for name in col_dropna.unique()])
                description += '}'
            else:
                description += '{ "'+ example + '", ... }'
        elif (type(example) == np.int32) and (len(col_dropna.unique()) < 10) :
            description += 'int32 {'
            description += ', '.join([ f'{name}' for name in sorted(col_dropna.unique())])
            description += '}'
        elif (type(example) == np.float64):
            description += f"{'float64'.ljust(8)}[ {col_dropna.min()}; {col_dropna.max()} ]"
        else:
            try:
                description = example.dtype
            except:
                 description = type(example)
        print("| " + col.ljust(longestColumnName)+ f'|  {description}   |')
 
feature_description(df) 

Proposed Parameters

Proposed Parameters

  • categorical_limit=10: maximum categories, to be displayed in the categorical notation.
  • max_displayed_chars_in_string=30: Maximum number of characters displayed in the example and the categorical notation, before being shorted with …
  • show_NAN_count=False: count the Non-Null Count, as in df.info()
  • markdown_format=True: Display the Table in the Markdown format
  • extended=False: Show more information like value_counts for categorical and standard deviation for continuous values

Issue Analytics

  • State:closed
  • Created a year ago
  • Comments:5 (4 by maintainers)

github_iconTop GitHub Comments

1reaction
mroeschkecommented, Jul 9, 2022

Since this is my first Feature Request on such a big project, I gotta ask. Was this a well-written Issue?

Yes, very clear description of the request!

0reactions
Dustin-dusTircommented, Jul 9, 2022

Thank you for the hint to pandas-profiling. What I proposed is a bit different, as I want a more compact description. But I totally understand your point. It was a pleasure open sourcing with you.

Since this is my first Feature Request on such a big project, I gotta ask. Was this a well-written Issue? Thanks.

Read more comments on GitHub >

github_iconTop Results From Across the Web

Wiz: A Web-Based Tool for Interactive Visualization of Big Data
The ability to derive meaningful relationships from such large datasets depends on our access to analysis tools. In particular, data visualization tools ...
Read more >
Basic Guide to Data Visualization for Data Science
Data visualization is a way to represent data and information graphically. It can be described as translating data into a visual context ...
Read more >
Tools for making good data visualizations: the art of charting
Abstract. Data visualization is a collection of methods that use visual representations to explore, make sense of and communicate quantitative data.
Read more >
(PDF) The Universal Visualization Platform - ResearchGate
Moreover, this platform supports multiple large data sets, and the recording and ... The visualization and analysis of data is the key to ......
Read more >
Visualization Framework for High-Dimensional Spatio ... - MDPI
Machine learning (ML) based DRTs for data visualization i.e., principal component analysis (PCA), generative topographic mapping (GTM), t- ...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found