question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Add a How-To guide on storing and loading array data

See original GitHub issue

Right now the only information on storing data in NumPy arrays on disk is the bare-bones function listing in https://numpy.org/devdocs/reference/routines.io.html. All it mentions is the functions available in NumPy itself.

It would be very useful to have a How-To guide (see NEP 44 for what a How-To guide is) on data storage covering:

  • the options: NumPy built-in functionality (mainly the .npy/.npz format), Zarr, HDF5 & co, Bloscpack, pickling (anything else?)
  • short summary of the storage model (e.g. Zarr chunked compressed n-D arrays, optionally in groups; HDF5 filesystem-like in one file)
  • performance (I/O speed, size on disk)
  • portability
  • dependencies
  • maturity
  • recommendations for what to use when

My impression is that we should use .npy for the really simple cases, and direct people to Zarr for pretty much anything else (related: https://github.com/zarr-developers/community/issues/28).

Data source(s) to use:

  • preferably use real-world data (we plan to produce a datasets package that we can rely on for the NumPy docs).
  • should use a regular dtype (e.g. float64), but can also show a structured dtype and an object dtype (solutions may be different for those).

TBD: how much example code should it have, and if we do add example code with Zarr and perhaps also Pytables, should we run the code in CI so we’re sure that code remains working.

My tentative answer: yes let’s add example code, yes let’s run CI for any code we add, TBD can we do this sensibly with Sphinx or is this a good time to start with notebooks?

Issue Analytics

  • State:closed
  • Created 4 years ago
  • Reactions:2
  • Comments:10 (9 by maintainers)

github_iconTop GitHub Comments

1reaction
melissawmcommented, Jul 19, 2020

Hello, all!

I have a draft for this How To at the numpy-tutorials repo. I’d appreciate if you could comment and give me any feedback there. The reason why I’m not adding this as a part of the main NumPy documentation is the dependencies and the time required to build the document; I think this is much more suitable to be built separately from the main documentation.

If you have any considerations, please comment here or over at the numpy-tutorials PR so we can move this forward. Thanks!

1reaction
Iriocommented, Mar 20, 2020

In this how-to guide, I’d wish to read about:

  • common problems that I might run into and how to tackle e.g., pickle an object and later change the source code of the class that generated it.
  • when to use NumPy, when to approach from other libraries such as Pandas e.g., there’s a .xls file with the numbers
  • when to consider using a database for storage (not necessarily tell how to do it, but give a direction of what to google for finding details or link to a secondary how-to/tutorial)
  • the possibility of losing information between saving & loading from a file

On formats, possibly: Apache Parquet, Protocol Buffers.

On storage model: it can be really helpful to use images here. One could create an image showing a simplified model of how NumPy maintains the data in memory; from that, how each format will store the information. It would be helpful for building a better mental model of what happens behind the scenes and understand the trade-offs.

Finally: depending on the title/keywords used, I bet people wanting to save this data in S3 or Cloud Storage may get to this how-to.

Read more comments on GitHub >

github_iconTop Results From Across the Web

Saving and Loading - Arrays for Beginners - Construct 3
If you want to store the contents of an array so that you can restore it the next time you load the game...
Read more >
Arrays / Processing.org
In computer programming, an array is a set of data elements stored under the same name. Arrays can be created to hold any...
Read more >
Numpy Load, Explained - Sharp Sight
This tutorial shows how to use Numpy load to load Numpy arrays from stored npy or npz files. It explains the syntax and...
Read more >
How to Save a NumPy Array to File for Machine Learning
The most common file format for storing numerical data in files is the comma-separated variable format, or CSV for short. It is most...
Read more >
How to load and save 3D Numpy array to file using savetxt ...
Saving and loading 3D arrays · Step 1: reshape the 3D array to 2D array. · Step 2: Insert this array to the...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found