question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Consider using numpy loadtxt under the hood for fast ASCII reading

See original GitHub issue

Description

Interesting progress from numpy who now have a C-based CSV parser built in as loadtxt. See the copied numpy announce below.

I have not looked at it, but I wonder if it is worth investigating to replace our custom ASCII fast C reader. Maybe the answer is a simple “no”. The obvious benefit is greatly reducing maintenance of this difficult code. I suspect the numpy version will have better speed and memory performance as well.

Downsides:

  • There are some things built in to the astropy fast reader that might not work out of box. E.g. the FastCsv reader supports missing elements at the end as masked values.
  • Non-small amount of work to fix something that is not really broken. But it might be a clean and well-defined GSoC project.
  • Not clear if the very careful handling of Fortran formats and other details from @dhomeier made it to the numpy parser.

Cc: @dhomeier @hamogu

Numpy announce

https://github.com/numpy/numpy/pull/20580

is now merged. This moves np.loadtxt to C. Mainly making it much faster. There are also some other improvements and changes though:

  • It now supports quotechar='"' to support Excel dialect CSV.
  • Parsing some numbers is stricter (e.g. removed support for _ or hex float parsing by default).
  • max_rows now actually counts rows and not lines. A warning is given if this makes a difference (blank lines).
  • Some exception will change, parsing failures now (almost) always give an informative ValueError.
  • converters=callable is now valid to provide a single converter for all columns.

Additional context

Issue Analytics

  • State:open
  • Created 2 years ago
  • Comments:6 (4 by maintainers)

github_iconTop GitHub Comments

1reaction
dhomeiercommented, Feb 9, 2022

That “fast” float parsing irritates me a bit personally, I have to admit. It seems to mainly make things much faster if you have 12 or more decimal digits. Is that gap between 12-15 digits important enough? And if I have 15-17 digits (full precision) do I really want to lose that by default?

For clarification, the default is fast_reader=True to use the compiled reader where possible, but with the standard strtod converters. Only explicitly setting it to {'use_fast_converter': True} (I am not a fan of that “dictionary_of_extra_args” syntax either) switches to the xstrtod-optimised parser. I found the fast pandas and astropy versions to perform better in particular on the full range of float64 – factors of 2-3 over npreadtext on random values from -1e300 to 1e300 even with 8-12 digits precision, which seems a not so uncommon case. Is it worth the loss of precision? I share your thoughts in https://github.com/numpy/numpy/pull/20580#issuecomment-993678618 that one should better think about a really performant format if those kinds of optimisations become relevant, but the demand is obviously out there. But it’s probably not worth the effort to add that as an extra feature to the numpy reader.

0reactions
sebergcommented, Feb 9, 2022

Ah, so the slowness of the one shipped with Python hits for exponents. That does indeed seem like something users may be interested in.

Read more comments on GitHub >

github_iconTop Results From Across the Web

numpy.loadtxt — NumPy v1.24 Manual
This function aims to be a fast reader for simply formatted files. The genfromtxt function provides more sophisticated handling of, e.g., lines with...
Read more >
ENH: Move loadtxt to C for much better speed #20580 - GitHub
Fast reading of data in huge text files seems to be important, at least comparisons are popular :) Julia boasts it is faster...
Read more >
Numpy loadtxt() Explained with Examples
numpy.loadtxt() is used to return the n-dimensional NumPy array by reading the data from the text file, with an aim to be a...
Read more >
Python pandas: read file skipping commented - Stack Overflow
I have a number of python codes that read, manipulate and save files. I have always used numpy.loadtxt and numpy.savetxt to do it....
Read more >
[Numpy-discussion] Memory efficient alternative for np.loadtxt ...
where for huge arrays the current NumPy ASCII readers are really slow and ... going below the 2x memory usage on read in...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found