Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Core dump when writing large matrix

See original GitHub issue

What happened: Writing a large sparse matrix disk crashes fastparquet

Minimal Complete Verifiable Example:

from random import choices
from string import ascii_lowercase
import numpy as np
import pandas as pd
import scipy.sparse as sps

# create random sparse matrix with <1% zeroes and ones, and >99% emptiness 
mat = sps.random(426406, 501, 314e-5)
mat.data = mat.data > 0.5

# convert to dataframe, setting the fill value to NaN
cols = list(map(str, range(mat.shape[1])))
df_synth = pd.DataFrame.sparse.from_spmatrix(mat, columns=cols) \
    .astype(pd.SparseDtype(float, np.nan)) \
    .sparse.to_dense()

# idk if this dense column is necessary, but we had it in our data
df_synth['names'] = pd.Series([''.join(choices(ascii_lowercase, k=58)) for _ in range(df_synth.shape[0])], dtype=pd.StringDtype())

# core dump here
df_synth.to_parquet('/tmp/test-synth.parquet', engine='fastparquet')

Environment:

Dask version: No idea, but fastparquet version is 0.8.0
Python version: 3.8.12
Operating System: Debian GNU/Linux 10 (buster)
Install method (conda, pip, source): pip

Issue Analytics

State:
Created 2 years ago
Comments:8 (4 by maintainers)

Top GitHub Comments

1reaction

martindurantcommented, Mar 21, 2022

Aha, OK, so good I didn’t release it, then. The buffer sizing heuristic clearly needs some work. I suspect it would probably work for you if you write with stats=False and/or explicitly try to reduce the column name lengths. If you manage to write, do ensure that you can also read the data, since a very mild case of the effect you are seeing might result in metadata corruption.

0reactions

martindurantcommented, Nov 24, 2022

It may be that fastparquet should generally avoid including the min/max of bytes types above a certain length, or not make stats of bytes types at all to mitigate this problem.