Why looping through hdus at pos=2 in fits files is much slower than looping through hdus at pos=1?
See original GitHub issueDescription
I am looping through the hdus of astro fits files I am analyzing. When I loop through the hdus at position 1 (hdu=1
) with astropy.table.read
the time for that is reasonably short. But if I do the same for the hdus at position 2 (hdu=2
) the time is 10 times slower, despite that the size of hdu=2 is smaller than for hdu=1. The tables/hdus at position 2 are smaller (~27 KB) than the tables at position 1 (~130 kB). Further, tables at position 1 have shape 2833 rows * 8 columns and tables at position 2 have shape 1 row * 126 columns. The time matters to me, because I need to loop through tens of thousands of files eventually, so in the magnitude of hours.
Example file can be this one: https://data.sdss.org/sas/dr16/eboss/spectro/redux/v5_13_0/spectra/lite/3699/spec-3699-55517-0420.fits
Expected behavior
Looping should be the same or maybe faster for hdu=2.
Actual behavior
Looping through hdu=2 is much slower (~10 times).
Steps to Reproduce
To reproduce you can do
from astropy.table import Table
for file_name in fits_files_list:
table1 = Table.read(file_name, hdu=1)
# fits_files_list is a list of containing the file paths to fits files. If you want to reproduce my code you can download an example from here:
# https://data.sdss.org/sas/dr16/eboss/spectro/redux/v5_13_0/spectra/lite/3699/spec-3699-55517-0420.fits
# this is the sort of files I am working with now, they all have the same structure
Change then hdu=1
to hdu=2
, keep track of the execution time and note that the second time it is ~10 times slower.
System Details
The details to my system/environment are: Linux-5.11.0-40-generic-x86_64-with-glibc2.10 Python 3.8.5 (default, Sep 4 2020, 07:30:14) [GCC 7.3.0] Numpy 1.19.1 astropy 3.2.3
I have to erfa or scipy installed.
How can I speed up the looping through hdu=2? Why is it 10 times slower? Are there any work around? Tnx
Issue Analytics
- State:
- Created 2 years ago
- Comments:14 (11 by maintainers)
Top GitHub Comments
Another interesting thing I found while examining the performance here is that lookup of header keyword values is actually being largely dominated by, of all things, the astropy config system. There are various config settings that are checked when parsing headers, some of which will get checked for every single keyword.
On some level this is necessary since astropy configs are supposed to be reloadable at runtime. But I wonder if for parsing headers it would make sense to allow config changes between reading individual values from the same header.
@saimn What I found in investigating this is that using
io.fits
is no faster. Converting to Table incurs some overhead but not as much.What I found was that in
_TableLikeHDU._get_data
, aColDefs
for the table gets constructed twice. The first time is when it callsself.columns
where it constructs theColDefs
from the header. It then uses this to determine the appropriate dtype, as well as some other processing of the data format.Then on this line it views the array as a
FITS_rec
(a class that is pretty vestigial at this point, but will take some effort to get rid of). This results in constructing a newColDefs
object, but this time from the dtype instead of the header.This redundancy has significant overhead for a case like this, and should be done away with, though I’m not sure how yet.
Both of the methods used here for constructing
ColDefs
(from the header, and then from the dtype) have opportunities for improvement as well.