Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Consider using numpy loadtxt under the hood for fast ASCII reading

See original GitHub issue

Description

Interesting progress from numpy who now have a C-based CSV parser built in as loadtxt. See the copied numpy announce below.

I have not looked at it, but I wonder if it is worth investigating to replace our custom ASCII fast C reader. Maybe the answer is a simple “no”. The obvious benefit is greatly reducing maintenance of this difficult code. I suspect the numpy version will have better speed and memory performance as well.

Downsides:

There are some things built in to the astropy fast reader that might not work out of box. E.g. the FastCsv reader supports missing elements at the end as masked values.
Non-small amount of work to fix something that is not really broken. But it might be a clean and well-defined GSoC project.
Not clear if the very careful handling of Fortran formats and other details from @dhomeier made it to the numpy parser.

Cc: @dhomeier @hamogu

Numpy announce

https://github.com/numpy/numpy/pull/20580

is now merged. This moves np.loadtxt to C. Mainly making it much faster. There are also some other improvements and changes though:

It now supports quotechar='"' to support Excel dialect CSV.
Parsing some numbers is stricter (e.g. removed support for _ or hex float parsing by default).
max_rows now actually counts rows and not lines. A warning is given if this makes a difference (blank lines).
Some exception will change, parsing failures now (almost) always give an informative ValueError.
converters=callable is now valid to provide a single converter for all columns.

Additional context

Issue Analytics

State:
Created 2 years ago
Comments:6 (4 by maintainers)

Top GitHub Comments

1reaction

dhomeiercommented, Feb 9, 2022

That “fast” float parsing irritates me a bit personally, I have to admit. It seems to mainly make things much faster if you have 12 or more decimal digits. Is that gap between 12-15 digits important enough? And if I have 15-17 digits (full precision) do I really want to lose that by default?

For clarification, the default is fast_reader=True to use the compiled reader where possible, but with the standard strtod converters. Only explicitly setting it to {'use_fast_converter': True} (I am not a fan of that “dictionary_of_extra_args” syntax either) switches to the xstrtod-optimised parser. I found the fast pandas and astropy versions to perform better in particular on the full range of float64 – factors of 2-3 over npreadtext on random values from -1e300 to 1e300 even with 8-12 digits precision, which seems a not so uncommon case. Is it worth the loss of precision? I share your thoughts in https://github.com/numpy/numpy/pull/20580#issuecomment-993678618 that one should better think about a really performant format if those kinds of optimisations become relevant, but the demand is obviously out there. But it’s probably not worth the effort to add that as an extra feature to the numpy reader.

0reactions

sebergcommented, Feb 9, 2022

Ah, so the slowness of the one shipped with Python hits for exponents. That does indeed seem like something users may be interested in.

Top Results From Across the Web

numpy.loadtxt — NumPy v1.24 Manual

This function aims to be a fast reader for simply formatted files. The genfromtxt function provides more sophisticated handling of, e.g., lines with...

ENH: Move loadtxt to C for much better speed #20580 - GitHub

Fast reading of data in huge text files seems to be important, at least comparisons are popular :) Julia boasts it is faster...

Numpy loadtxt() Explained with Examples

numpy.loadtxt() is used to return the n-dimensional NumPy array by reading the data from the text file, with an aim to be a...

Python pandas: read file skipping commented - Stack Overflow

I have a number of python codes that read, manipulate and save files. I have always used numpy.loadtxt and numpy.savetxt to do it....

[Numpy-discussion] Memory efficient alternative for np.loadtxt ...

where for huge arrays the current NumPy ASCII readers are really slow and ... going below the 2x memory usage on read in...