Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Throwing a warning for Pandas SparseArrays

See original GitHub issue

Description

Echoing OP’s sentiments from this reddit thread because it’s something I’ve had to learn the hard way as well.

Right now, sklearn secretly inflates Pandas SparseArrays without warning the user. IMO there should be a warning thrown at the very least.

Steps/Code to Reproduce

import pandas as pd
import numpy as np
from sklearn.linear_model import LinearRegression
df = pd.DataFrame(np.random.randn(10000, 4))
df.iloc[:9998] = 0
for col in df.columns:
    df[col] = pd.SparseArray(df[col], fill_value=0)
l = LinearRegression()
l.fit(df[df.columns[0:2]], df[df.columns[3]])

Using guppy to analyze memory usage, it’s clear that sklearn is inflating this matrix behind the scenes.

Expected Results

sklearn warns the user when inflating a sparse array

Actual Results

sklearn does not warn the user when inflating a sparse arra

Versions

System:
    python: 3.7.4 (tags/v3.7.4:e09359112e, Jul  8 2019, 20:34:20) [MSC v.1916 64 bit (AMD64)]
executable: c:\python37\python.exe
   machine: Windows-10-10.0.18362-SP0

Python deps:
       pip: 19.2.1
setuptools: 40.8.0
   sklearn: 0.21.3
     numpy: 1.17.0
     scipy: 1.3.1
    Cython: None
    pandas: 0.25.0

Issue Analytics

State:
Created 4 years ago
Comments:9 (8 by maintainers)

Top GitHub Comments

1reaction

rushabh-vcommented, Jan 1, 2020

Pandas have deprecated the SparseDataframe and it suggests to create a normal Dataframe with SparseArray columns. I was thinking of determining the Sparse matrices using hasatrr for the unique attributes of the SparseDataframe. But as the inputs are Dataframe having Sparse columns there will be the same attributes as the Dataframe. Any other way to differentiate both of them?

0reactions

jorisvandenbosschecommented, Jan 6, 2020

See https://github.com/pandas-dev/pandas/issues/26706 for a discussion on how you can get for sparse columns in a dataframe, i.e. something like df.dtypes.apply(pd.api.types.is_sparse).any/all() (any/all depending on if you want to check for at least 1 or all columns being sparse)

Top Results From Across the Web

What's new in 1.5.0 (September 19, 2022) - Pandas

Warning. This feature is experimental, and the API can change in a future release ... PerformanceWarning is now thrown when using string[pyarrow] dtype...

How to deal with SettingWithCopyWarning in Pandas

None will suppress the warning entirely, and "raise" will throw a SettingWithCopyError , preventing the operation from going through.

2.21 Returning a view versus a copy — Pandas Doc

What's up with the SettingWithCopy warning? We don't usually throw warnings around when you do something that might cost a few extra milliseconds!...

SettingwithCopyWarning: How to Fix This Warning in Pandas

Pandas generates the warning when it detects something called chained assignment. Let's define a few terms we'll be using to explain things:.

v0.25.0 版本特性（2019年7月18日） - Pandas 中文

从0.25.x系列版本开始，Pandas仅支持Python 3.5.3及更高版本。 ... an erroneous warning indicating that a KeyError will be thrown in the future ...