question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Multi-dimensional arrays in variable-length array columns of a FITS binary table cause file corruption

See original GitHub issue

Description

io.fits may create corrupted files when writing a BinTableHDU to a file, if that table contains a variable-length array (VLA) column with arrays that have two (or more) dimensions. No warnings or errors are raised while writing, yet the resulting file may be unreadable to io.fits.

Expected behavior

Being able to write any n-dimensional arrays to a VLA column, writing that to a file and then successfully reading the column (round-trip).

Actual behavior

The resulting file is partially or even completely corrupted.

Steps to Reproduce

  1. Create a two-dimensional numpy array and place it on a numpy array with dtype=object
  2. Create a VLA column with that array
  3. Create a BinTableHDU from that column and write it to a file
  4. Read the file back
array = np.array([np.ones((8, 50))], dtype=object)
col = fits.Column(name='test', format='PD()', array=array)
fits.BinTableHDU.from_columns([col]).writeto('bug.fits', overwrite=True)

with fits.open('bug.fits') as hdus:
    print(hdus)

Produces the following error:

WARNING: non-ASCII characters are present in the FITS file header and have been replaced by "?" characters [astropy.io.fits.util]
WARNING: Header block contains null bytes instead of spaces for padding, and is not FITS-compliant. Nulls may be replaced with spaces upon writing. [astropy.io.fits.header]
Traceback (most recent call last):
  File "[path]\venv\lib\site-packages\astropy\io\fits\hdu\base.py", line 417, in _readfrom_internal
    header_str, header = _BasicHeader.fromfile(data)
  File "[path]\venv\lib\site-packages\astropy\io\fits\header.py", line 2075, in fromfile
    header_str, cards = parse_header(fileobj)
  File "astropy\io\fits\_utils.pyx", line 38, in astropy.io.fits._utils.parse_header
UnicodeDecodeError: 'ascii' codec can't decode byte 0xf0 in position 1: ordinal not in range(128)
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
  (...)
  File "[path]/bugtest.py", line 9, in <module>
    print(hdus)
  File "[path]\venv\lib\site-packages\astropy\io\fits\hdu\hdulist.py", line 258, in __repr__
    self.readall()
  File "[path]\venv\lib\site-packages\astropy\io\fits\hdu\hdulist.py", line 795, in readall
    while self._read_next_hdu():
  File "[path]\venv\lib\site-packages\astropy\io\fits\hdu\hdulist.py", line 1200, in _read_next_hdu
    hdu = _BaseHDU.readfrom(fileobj, **kwargs)
  File "[path]\venv\lib\site-packages\astropy\io\fits\hdu\base.py", line 332, in readfrom
    hdu = cls._readfrom_internal(fileobj, checksum=checksum,
  File "[path]\venv\lib\site-packages\astropy\io\fits\hdu\base.py", line 424, in _readfrom_internal
    header = Header.fromfile(data,
  File "[path]\venv\lib\site-packages\astropy\io\fits\header.py", line 523, in fromfile
    return cls._from_blocks(block_iter, is_binary, sep, endcard,
  File "[path]\venv\lib\site-packages\astropy\io\fits\header.py", line 610, in _from_blocks
    raise OSError('Header missing END card.')
OSError: Header missing END card.


Playing around with it a bit more, I could produce some other weird behaviors.

a = np.ones((5, 2))
b = np.full((10,), 5)
x = [a, b]

array = np.empty(len(x), dtype=object)
array[:] = x

col = fits.Column(name='test', format='PD()', array=array)
fits.BinTableHDU.from_columns([col]).writeto('bug.fits', overwrite=True)

with fits.open('bug.fits') as hdus:
    print(hdus[1].data['test'])

Outputs the following:

[array([1., 1., 1., 1., 1.])
 array([1., 1., 1., 1., 1., 5., 5., 5., 5., 5.])]

while the expected result would be:

[array([[1., 1.],
       [1., 1.],
       [1., 1.],
       [1., 1.],
       [1., 1.]]), array([5, 5, 5, 5, 5, 5, 5, 5, 5, 5])]

So it seems that everything that doesn’t fit in the first dimension is going out of bounds and writing over the next array. This explains why it can also heavily corrupt the file.


Reading the FITS standard, I get the impression that multi-dimensional VLAs should be possible, so this seems like an unexpected behavior. At the very least, if multi-dimensional VLAs aren’t meant to be supported, io.fits should be throwing errors. Right now it’s simply failing silently.

System Details

Windows-10-10.0.19044-SP0 Python 3.9.9 (tags/v3.9.9:ccb0e6a, Nov 15 2021, 18:08:50) [MSC v.1929 64 bit (AMD64)] Numpy 1.22.2 pyerfa 2.0.0.1 astropy 5.0.1 Scipy 1.7.1

Issue Analytics

  • State:closed
  • Created 2 years ago
  • Comments:7 (6 by maintainers)

github_iconTop GitHub Comments

1reaction
kYwzorcommented, Mar 22, 2022

I’ve been thinking about this one for a long while, so I decided to put my thoughts into text in (hopefully) an organized manner. This will be very long so sorry in advance for the wall of text.


What the standard actually says

It’s clear to me that, if we strictly follow the current FITS Standard, it’s impossible to support columns that contain arrays of variable dimensions. However, the Standard still explicitly allows the usage of TDIMn keywords for VLA columns. While this feature is defined in an extremely confusing manner, after reading the Standard (yet again) I now believe it actually satisfactorily specifies how multi-dimensional VLAs must be handled. I’m pretty confident that the interaction between VLA columns and TDIMn can be boiled down to 4 rules:

  • Entries in the same VLA column must be interpreted as having the same dimensions.
    • Reasoning: This is unavoidable given that the standard only allows defining one TDIM per column and it does not define any way of storing shape information either on the heap area or array descriptor.
  • Entries cannot have fewer elements than the size (that is, the product of the dimensions) implied by TDIM.
    • Reasoning: The standard mentions that “The size [implied by TDIM] must be (…), in the case of columns that have a ’P’ or ’Q’ TFORMn data type, less than or equal to the array length specified in the variable-length array descriptor”. Since we have one “array descriptor” for each entry in a VLA column, this means we have to check TDIM against the length defined in every single row, in order to ensure it’s valid.
  • Entries may have more elements than the product of the defined dimensions, in which case we essentially ignore the extra elements.
    • Reasoning: The standard is very clear in saying that “If the number of elements in the array implied by the TDIMn is fewer than the allocated size of the array in the FITS file, then the unused trailing elements should be interpreted as containing undefined fill values.”
  • The 3 rules above don’t apply to entries that have no elements (length zero); those entries should just be interpreted as empty arrays.
    • Reasoning: In the standard it’s specified that “In the special case where the variable-length array descriptor has a size of zero, then the TDIMn keyword is not applicable”. Well, if the TDIMn keyword is “not applicable”, then we have to interpret that specific entry as we would if the keyword didn’t exist… which is to just take it as an empty array.

So, in the first few readings of the Standard, the idea of using TDIM on VLAs felt pointless because it seemed like it would force you to have arrays of fixed length, which would defeat the entire purpose of having variable-length arrays. However, with these simplified “rules” in mind it seems clear to me that there’s actually at least one scenario where using VLAs with TDIM may be preferred to just using a fixed-length array with TDIM: VLAs allow empty entries, which enable significant file size reductions in cases where we’re dealing with huge matrices. I have a feeling this is essentially the one use-case envisioned by the Standard. (I can also imagine a second use-case, where we intentionally create arrays longer than the size of the matrix defined by TDIM, and where these “extra elements” can be used to store some relevant extra information… but this use-case seems very far-fetched and likely against what the standard intends.)

So with this in mind, let’s look at a few examples of columns and their entries, and discuss if they are “legal” according to the Standard, and how they should be interpreted. Let’s assume that TFORMn = '1PJ(8)' for all of these columns.

A (TDIM1 = '(1,1)') B (TDIM2 = '(2,2)') C (TDIM3 = '(2,4)') D (TDIM4 = '(2,4)')
[1] [1, 2, 3, 4, 5, 6, 7, 8] [1, 2, 3, 4, 5, 6, 7, 8] [1, 2, 3, 4, 5, 6, 7, 8]
[1] [1, 2, 3, 4, 5] [1, 2, 3, 4, 5] [ ]

Column A was inspired by #7810 and it is legal. Each entry should be interpreted as a 2D matrix which only has one value… that’s a bit weird but completely fine by the Standard. In Python, it should look something like this:

>>> t.data['A']
[array([[1]]), array([[1]])]

Column B is legal, but both entries have a few extra elements that will be ignored. The expected result is two 2x2 matrices, which in Python would look like:

>>> t.data['B']
[array([[1, 2],
       [3, 4]]), array([[1, 2],
       [3, 4]])]

Column C is illegal, because there are entries that do not have enough elements to fill the matrix defined by TDIM (in other words, the second row has length 5 while the matrix size is 2*4=8). There’s no reasonable way to interpret this column other than by ignoring TDIM.

Since empty entries don’t need to respect TDIM, Column D is also legal and the result in Python would be:

>>> t.data['D']
[array([[1, 2],
       [3, 4],
       [5, 6],
       [7, 8]]), array([], dtype=int32)]

How I think Astropy should handle this

Currently, io.fits doesn’t handle TDIMn for VLAs at all, resulting in a crash in basically any scenario. Regardless of whether you think this feature is useful or not, it seems there’s already code in the wild using this type of pattern (see issue #7810), so there would definitely be some direct benefit in implementing this. On top of that, as far as I can tell this is one of the last few hurdles for achieving full VLA support in Astropy, which would be a great thing in itself.

Keeping with the “tolerant with input and strict with output” philosophy, I think the behavior a user would expect for the example columns is something like this. Reading: Column A and D are correctly read without any issues. Column B is correctly read, but a warning is thrown informing the user that some arrays were larger than the size defined by TDIMn, and thus the trailing elements were ignored. Column C is read as a one-dimensional array, and the user is warned that TDIMn was ignored because it was invalid. Writing: Column A and D are written without any issues. The trailing elements of column B are not written to the file (or maybe Column object can’t even be created with such an array), and the user is informed of that. Column C can never be written as it is illegal.


How other tools/libraries handle this

While #7810 has a file which contains columns similar to column A, I unfortunately don’t have example files for any of the other columns, since I wouldn’t be able to create them with Astropy. If someone could create something like that (or has any other example files), it would be immensely useful for testing. Regardless, for now I’ve tested only that file on a few libraries/tools.

Running P190mm-PAFBE-FEBEPAR.fits.zip through fitsverify returns no errors or warnings. The file is also correctly opened by the fv FITS Viewer, and exploring the binary table allows us to see that USEFEED, BESECTS and FEEDTYPE are all correctly interpreted as 2D images that contain a single pixel. Finally, opening the file with fitsio results in:

[...]/venv/lib/python3.10/site-packages/fitsio/hdu/table.py:1157: FutureWarning: Passing (type, 1) or '1type' as a synonym of type is deprecated; in a future version of numpy, it will be understood as (type, (1,)) / '(1,)type'.
  dtype = numpy.dtype(descr)
Traceback (most recent call last):
  File "/usr/lib/python3.10/code.py", line 90, in runcode
    exec(code, self.locals)
  File "<input>", line 1, in <module>
  File "[...]/venv/lib/python3.10/site-packages/fitsio/hdu/table.py", line 714, in read
    data = self._read_all(
  File "[...]/venv/lib/python3.10/site-packages/fitsio/hdu/table.py", line 764, in _read_all
    array = self._read_rec_with_var(colnums, rows, dtype,
  File "[...]/venv/lib/python3.10/site-packages/fitsio/hdu/table.py", line 1388, in _read_rec_with_var
    array[name][irow][0:ncopy] = item[:]
TypeError: 'numpy.int32' object does not support item assignment

so evidently this is feature is also not supported by fitsio. I haven’t tested using CFITSIO directly so I am not aware if it supports any of this or not.


I would really like to implement this but, having had a look at the source code, I doubt I’d be able to. This is a fairly large change that is very tricky to get right, so it seems to me you have to be extremely familiar with the current code to really understand all the pitfalls (which I am not). So @saimn, if you know anyone who might want to have a look at this, please point them here!

0reactions
kYwzorcommented, Mar 4, 2022

This seems to be related to #7810.

Read more comments on GitHub >

github_iconTop Results From Across the Web

mrdfits.pro - The IDL Astronomy User's Library - NASA
If the length of each element of a variable length ; column is 0 then the column ... binary tables. ; /POINTER_VAR- Use...
Read more >
Less Familiar Objects — Astropy v5.1.1
In this chapter, we will discuss less frequently used FITS data structures. They include ASCII tables, variable length tables, and random access group...
Read more >
Images — Astropy v0.4.2
Basically, the compressed image tiles are stored in rows of a variable length arrray column in a FITS binary table. The astropy.io.fits recognizes...
Read more >
Getting started with the nom.tam.fits library. - GitHub Pages
A column in a binary table can be of either fixed format or a variable length array. Variable length arrays can be only...
Read more >
Ahelp: crates - CIAO 4.14 - Chandra X-ray Center
The file out.fits is a FITS binary table whereas out.dat is an ASCII file (see ... Support for variable-length arrays has been improved...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found