question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Core dump when writing large matrix

See original GitHub issue

What happened: Writing a large sparse matrix disk crashes fastparquet

Minimal Complete Verifiable Example:

from random import choices
from string import ascii_lowercase
import numpy as np
import pandas as pd
import scipy.sparse as sps

# create random sparse matrix with <1% zeroes and ones, and >99% emptiness 
mat = sps.random(426406, 501, 314e-5)
mat.data = mat.data > 0.5

# convert to dataframe, setting the fill value to NaN
cols = list(map(str, range(mat.shape[1])))
df_synth = pd.DataFrame.sparse.from_spmatrix(mat, columns=cols) \
    .astype(pd.SparseDtype(float, np.nan)) \
    .sparse.to_dense()

# idk if this dense column is necessary, but we had it in our data
df_synth['names'] = pd.Series([''.join(choices(ascii_lowercase, k=58)) for _ in range(df_synth.shape[0])], dtype=pd.StringDtype())

# core dump here
df_synth.to_parquet('/tmp/test-synth.parquet', engine='fastparquet')

Environment:

  • Dask version: No idea, but fastparquet version is 0.8.0
  • Python version: 3.8.12
  • Operating System: Debian GNU/Linux 10 (buster)
  • Install method (conda, pip, source): pip

Issue Analytics

  • State:closed
  • Created 2 years ago
  • Comments:8 (4 by maintainers)

github_iconTop GitHub Comments

1reaction
martindurantcommented, Mar 21, 2022

Aha, OK, so good I didn’t release it, then. The buffer sizing heuristic clearly needs some work. I suspect it would probably work for you if you write with stats=False and/or explicitly try to reduce the column name lengths. If you manage to write, do ensure that you can also read the data, since a very mild case of the effect you are seeing might result in metadata corruption.

0reactions
martindurantcommented, Nov 24, 2022

It may be that fastparquet should generally avoid including the min/max of bytes types above a certain length, or not make stats of bytes types at all to mitigate this problem.

Read more comments on GitHub >

github_iconTop Results From Across the Web

Segmentation fault (core dump) when creating a large 2d ...
Looks like an error initializing arr1. Your using m2c as the column count. You probably meant m1c.
Read more >
Segmentation fault with large matrices in qp #126 - GitHub
I'm getting Segmentation fault (core dumped) errors when passing somewhat large matrices to solvers.qp() . The P matrix has shape (34638, 34638), ...
Read more >
File interface for a "big.matrix" - R-Project.org
a big.matrix object is returned by read.big.matrix , while write.big.matrix creates an output file (a path could be part of filename ).
Read more >
Core dump with magma_dsyevdx_m for large matrix
I'm getting a core dump for matrices larger than about 92,000. When I run my code for a matrix of size 92k, everything...
Read more >
(Core dumped) ¿A too large array? - C Board
I'm writing a code which implementes the cell-to-cell mapping method. Finally I know why my compiler is saying me there is a segmentation...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found