question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Throwing a warning for Pandas SparseArrays

See original GitHub issue

Description

Echoing OP’s sentiments from this reddit thread because it’s something I’ve had to learn the hard way as well.

Right now, sklearn secretly inflates Pandas SparseArrays without warning the user. IMO there should be a warning thrown at the very least.

Steps/Code to Reproduce

import pandas as pd
import numpy as np
from sklearn.linear_model import LinearRegression
df = pd.DataFrame(np.random.randn(10000, 4))
df.iloc[:9998] = 0
for col in df.columns:
    df[col] = pd.SparseArray(df[col], fill_value=0)
l = LinearRegression()
l.fit(df[df.columns[0:2]], df[df.columns[3]])

Using guppy to analyze memory usage, it’s clear that sklearn is inflating this matrix behind the scenes.

Expected Results

sklearn warns the user when inflating a sparse array

Actual Results

sklearn does not warn the user when inflating a sparse arra

Versions

System:
    python: 3.7.4 (tags/v3.7.4:e09359112e, Jul  8 2019, 20:34:20) [MSC v.1916 64 bit (AMD64)]
executable: c:\python37\python.exe
   machine: Windows-10-10.0.18362-SP0

Python deps:
       pip: 19.2.1
setuptools: 40.8.0
   sklearn: 0.21.3
     numpy: 1.17.0
     scipy: 1.3.1
    Cython: None
    pandas: 0.25.0

Issue Analytics

  • State:closed
  • Created 4 years ago
  • Comments:9 (8 by maintainers)

github_iconTop GitHub Comments

1reaction
rushabh-vcommented, Jan 1, 2020

Pandas have deprecated the SparseDataframe and it suggests to create a normal Dataframe with SparseArray columns. I was thinking of determining the Sparse matrices using hasatrr for the unique attributes of the SparseDataframe. But as the inputs are Dataframe having Sparse columns there will be the same attributes as the Dataframe. Any other way to differentiate both of them?

0reactions
jorisvandenbosschecommented, Jan 6, 2020

See https://github.com/pandas-dev/pandas/issues/26706 for a discussion on how you can get for sparse columns in a dataframe, i.e. something like df.dtypes.apply(pd.api.types.is_sparse).any/all() (any/all depending on if you want to check for at least 1 or all columns being sparse)

Read more comments on GitHub >

github_iconTop Results From Across the Web

What's new in 1.5.0 (September 19, 2022) - Pandas
Warning. This feature is experimental, and the API can change in a future release ... PerformanceWarning is now thrown when using string[pyarrow] dtype...
Read more >
How to deal with SettingWithCopyWarning in Pandas
None will suppress the warning entirely, and "raise" will throw a SettingWithCopyError , preventing the operation from going through.
Read more >
2.21 Returning a view versus a copy — Pandas Doc
What's up with the SettingWithCopy warning? We don't usually throw warnings around when you do something that might cost a few extra milliseconds!...
Read more >
SettingwithCopyWarning: How to Fix This Warning in Pandas
Pandas generates the warning when it detects something called chained assignment. Let's define a few terms we'll be using to explain things:.
Read more >
v0.25.0 版本特性(2019年7月18日) - Pandas 中文
从0.25.x系列版本开始,Pandas仅支持Python 3.5.3及更高版本。 ... an erroneous warning indicating that a KeyError will be thrown in the future ...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found