Consider using numpy loadtxt under the hood for fast ASCII reading
See original GitHub issueDescription
Interesting progress from numpy who now have a C-based CSV parser built in as loadtxt
. See the copied numpy announce below.
I have not looked at it, but I wonder if it is worth investigating to replace our custom ASCII fast C reader. Maybe the answer is a simple “no”. The obvious benefit is greatly reducing maintenance of this difficult code. I suspect the numpy version will have better speed and memory performance as well.
Downsides:
- There are some things built in to the astropy fast reader that might not work out of box. E.g. the FastCsv reader supports missing elements at the end as masked values.
- Non-small amount of work to fix something that is not really broken. But it might be a clean and well-defined GSoC project.
- Not clear if the very careful handling of Fortran formats and other details from @dhomeier made it to the numpy parser.
Numpy announce
https://github.com/numpy/numpy/pull/20580
is now merged. This moves np.loadtxt
to C. Mainly making it much
faster. There are also some other improvements and changes though:
- It now supports
quotechar='"'
to support Excel dialect CSV. - Parsing some numbers is stricter (e.g. removed support for
_
or hex float parsing by default). max_rows
now actually counts rows and not lines. A warning is given if this makes a difference (blank lines).- Some exception will change, parsing failures now (almost) always
give an informative
ValueError
. converters=callable
is now valid to provide a single converter for all columns.
Additional context
Issue Analytics
- State:
- Created 2 years ago
- Comments:6 (4 by maintainers)
Top Results From Across the Web
numpy.loadtxt — NumPy v1.24 Manual
This function aims to be a fast reader for simply formatted files. The genfromtxt function provides more sophisticated handling of, e.g., lines with...
Read more >ENH: Move loadtxt to C for much better speed #20580 - GitHub
Fast reading of data in huge text files seems to be important, at least comparisons are popular :) Julia boasts it is faster...
Read more >Numpy loadtxt() Explained with Examples
numpy.loadtxt() is used to return the n-dimensional NumPy array by reading the data from the text file, with an aim to be a...
Read more >Python pandas: read file skipping commented - Stack Overflow
I have a number of python codes that read, manipulate and save files. I have always used numpy.loadtxt and numpy.savetxt to do it....
Read more >[Numpy-discussion] Memory efficient alternative for np.loadtxt ...
where for huge arrays the current NumPy ASCII readers are really slow and ... going below the 2x memory usage on read in...
Read more >Top Related Medium Post
No results found
Top Related StackOverflow Question
No results found
Troubleshoot Live Code
Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start FreeTop Related Reddit Thread
No results found
Top Related Hackernoon Post
No results found
Top Related Tweet
No results found
Top Related Dev.to Post
No results found
Top Related Hashnode Post
No results found
Top GitHub Comments
For clarification, the default is
fast_reader=True
to use the compiled reader where possible, but with the standardstrtod
converters. Only explicitly setting it to{'use_fast_converter': True}
(I am not a fan of that “dictionary_of_extra_args” syntax either) switches to thexstrtod
-optimised parser. I found the fast pandas and astropy versions to perform better in particular on the full range offloat64
– factors of 2-3 overnpreadtext
on random values from -1e300 to 1e300 even with 8-12 digits precision, which seems a not so uncommon case. Is it worth the loss of precision? I share your thoughts in https://github.com/numpy/numpy/pull/20580#issuecomment-993678618 that one should better think about a really performant format if those kinds of optimisations become relevant, but the demand is obviously out there. But it’s probably not worth the effort to add that as an extra feature to the numpy reader.Ah, so the slowness of the one shipped with Python hits for exponents. That does indeed seem like something users may be interested in.