question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Why looping through hdus at pos=2 in fits files is much slower than looping through hdus at pos=1?

See original GitHub issue

Description

I am looping through the hdus of astro fits files I am analyzing. When I loop through the hdus at position 1 (hdu=1) with astropy.table.read the time for that is reasonably short. But if I do the same for the hdus at position 2 (hdu=2) the time is 10 times slower, despite that the size of hdu=2 is smaller than for hdu=1. The tables/hdus at position 2 are smaller (~27 KB) than the tables at position 1 (~130 kB). Further, tables at position 1 have shape 2833 rows * 8 columns and tables at position 2 have shape 1 row * 126 columns. The time matters to me, because I need to loop through tens of thousands of files eventually, so in the magnitude of hours.

Example file can be this one: https://data.sdss.org/sas/dr16/eboss/spectro/redux/v5_13_0/spectra/lite/3699/spec-3699-55517-0420.fits

Expected behavior

Looping should be the same or maybe faster for hdu=2.

Actual behavior

Looping through hdu=2 is much slower (~10 times).

Steps to Reproduce

To reproduce you can do

from astropy.table import Table

for file_name in fits_files_list:
    table1 = Table.read(file_name, hdu=1)

# fits_files_list is a list of containing the file paths to fits files. If you want to reproduce my code you can download an example from here:
# https://data.sdss.org/sas/dr16/eboss/spectro/redux/v5_13_0/spectra/lite/3699/spec-3699-55517-0420.fits
# this is the sort of files I am working with now, they all have the same structure

Change then hdu=1 to hdu=2, keep track of the execution time and note that the second time it is ~10 times slower.

System Details

The details to my system/environment are: Linux-5.11.0-40-generic-x86_64-with-glibc2.10 Python 3.8.5 (default, Sep 4 2020, 07:30:14) [GCC 7.3.0] Numpy 1.19.1 astropy 3.2.3

I have to erfa or scipy installed.

How can I speed up the looping through hdu=2? Why is it 10 times slower? Are there any work around? Tnx

Issue Analytics

  • State:open
  • Created 2 years ago
  • Comments:14 (11 by maintainers)

github_iconTop GitHub Comments

1reaction
embraycommented, Nov 19, 2021

Another interesting thing I found while examining the performance here is that lookup of header keyword values is actually being largely dominated by, of all things, the astropy config system. There are various config settings that are checked when parsing headers, some of which will get checked for every single keyword.

On some level this is necessary since astropy configs are supposed to be reloadable at runtime. But I wonder if for parsing headers it would make sense to allow config changes between reading individual values from the same header.

1reaction
embraycommented, Nov 19, 2021

@saimn What I found in investigating this is that using io.fits is no faster. Converting to Table incurs some overhead but not as much.

What I found was that in _TableLikeHDU._get_data, a ColDefs for the table gets constructed twice. The first time is when it calls self.columns where it constructs the ColDefs from the header. It then uses this to determine the appropriate dtype, as well as some other processing of the data format.

Then on this line it views the array as a FITS_rec (a class that is pretty vestigial at this point, but will take some effort to get rid of). This results in constructing a new ColDefs object, but this time from the dtype instead of the header.

This redundancy has significant overhead for a case like this, and should be done away with, though I’m not sure how yet.

Both of the methods used here for constructing ColDefs (from the header, and then from the dtype) have opportunities for improvement as well.

Read more comments on GitHub >

github_iconTop Results From Across the Web

Why looping through hdus at pos=2 in fits files is much slower than ...
The only way is to parse the header for each HDU in order to find where the next one is. astropy.io.fits already goes...
Read more >
astropy.io.fits FAQ
I am opening many FITS files in a loop and getting OSError: Too many open ... an option to access the data portion...
Read more >
AsnFile - GitHub Pages
This tutorial aims to prepare you to alter the association file used by the CalCOS pipeline. Association files are fits files containing a...
Read more >
astropy.io.fits FAQ
I'm opening many FITS files in a loop and getting OSError: Too many open files ... Using the hdu.writeto() method will cause Astropy...
Read more >
(PDF) IBIS-A: The IBIS data Archive | Giorgio Viavattene - Academia ...
1 About 18% of the data are characterized by a resolution in the and A. · 2 give examples of such information. Each...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found