Multi-dimensional arrays in variable-length array columns of a FITS binary table cause file corruption
See original GitHub issueDescription
io.fits
may create corrupted files when writing a BinTableHDU
to a file, if that table contains a variable-length array (VLA) column with arrays that have two (or more) dimensions. No warnings or errors are raised while writing, yet the resulting file may be unreadable to io.fits
.
Expected behavior
Being able to write any n-dimensional arrays to a VLA column, writing that to a file and then successfully reading the column (round-trip).
Actual behavior
The resulting file is partially or even completely corrupted.
Steps to Reproduce
- Create a two-dimensional
numpy
array and place it on anumpy
array withdtype=object
- Create a VLA column with that array
- Create a
BinTableHDU
from that column and write it to a file - Read the file back
array = np.array([np.ones((8, 50))], dtype=object)
col = fits.Column(name='test', format='PD()', array=array)
fits.BinTableHDU.from_columns([col]).writeto('bug.fits', overwrite=True)
with fits.open('bug.fits') as hdus:
print(hdus)
Produces the following error:
WARNING: non-ASCII characters are present in the FITS file header and have been replaced by "?" characters [astropy.io.fits.util]
WARNING: Header block contains null bytes instead of spaces for padding, and is not FITS-compliant. Nulls may be replaced with spaces upon writing. [astropy.io.fits.header]
Traceback (most recent call last):
File "[path]\venv\lib\site-packages\astropy\io\fits\hdu\base.py", line 417, in _readfrom_internal
header_str, header = _BasicHeader.fromfile(data)
File "[path]\venv\lib\site-packages\astropy\io\fits\header.py", line 2075, in fromfile
header_str, cards = parse_header(fileobj)
File "astropy\io\fits\_utils.pyx", line 38, in astropy.io.fits._utils.parse_header
UnicodeDecodeError: 'ascii' codec can't decode byte 0xf0 in position 1: ordinal not in range(128)
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
(...)
File "[path]/bugtest.py", line 9, in <module>
print(hdus)
File "[path]\venv\lib\site-packages\astropy\io\fits\hdu\hdulist.py", line 258, in __repr__
self.readall()
File "[path]\venv\lib\site-packages\astropy\io\fits\hdu\hdulist.py", line 795, in readall
while self._read_next_hdu():
File "[path]\venv\lib\site-packages\astropy\io\fits\hdu\hdulist.py", line 1200, in _read_next_hdu
hdu = _BaseHDU.readfrom(fileobj, **kwargs)
File "[path]\venv\lib\site-packages\astropy\io\fits\hdu\base.py", line 332, in readfrom
hdu = cls._readfrom_internal(fileobj, checksum=checksum,
File "[path]\venv\lib\site-packages\astropy\io\fits\hdu\base.py", line 424, in _readfrom_internal
header = Header.fromfile(data,
File "[path]\venv\lib\site-packages\astropy\io\fits\header.py", line 523, in fromfile
return cls._from_blocks(block_iter, is_binary, sep, endcard,
File "[path]\venv\lib\site-packages\astropy\io\fits\header.py", line 610, in _from_blocks
raise OSError('Header missing END card.')
OSError: Header missing END card.
Playing around with it a bit more, I could produce some other weird behaviors.
a = np.ones((5, 2))
b = np.full((10,), 5)
x = [a, b]
array = np.empty(len(x), dtype=object)
array[:] = x
col = fits.Column(name='test', format='PD()', array=array)
fits.BinTableHDU.from_columns([col]).writeto('bug.fits', overwrite=True)
with fits.open('bug.fits') as hdus:
print(hdus[1].data['test'])
Outputs the following:
[array([1., 1., 1., 1., 1.])
array([1., 1., 1., 1., 1., 5., 5., 5., 5., 5.])]
while the expected result would be:
[array([[1., 1.],
[1., 1.],
[1., 1.],
[1., 1.],
[1., 1.]]), array([5, 5, 5, 5, 5, 5, 5, 5, 5, 5])]
So it seems that everything that doesn’t fit in the first dimension is going out of bounds and writing over the next array. This explains why it can also heavily corrupt the file.
Reading the FITS standard, I get the impression that multi-dimensional VLAs should be possible, so this seems like an unexpected behavior. At the very least, if multi-dimensional VLAs aren’t meant to be supported, io.fits
should be throwing errors. Right now it’s simply failing silently.
System Details
Windows-10-10.0.19044-SP0 Python 3.9.9 (tags/v3.9.9:ccb0e6a, Nov 15 2021, 18:08:50) [MSC v.1929 64 bit (AMD64)] Numpy 1.22.2 pyerfa 2.0.0.1 astropy 5.0.1 Scipy 1.7.1
Issue Analytics
- State:
- Created 2 years ago
- Comments:7 (6 by maintainers)
Top GitHub Comments
I’ve been thinking about this one for a long while, so I decided to put my thoughts into text in (hopefully) an organized manner. This will be very long so sorry in advance for the wall of text.
What the standard actually says
It’s clear to me that, if we strictly follow the current FITS Standard, it’s impossible to support columns that contain arrays of variable dimensions. However, the Standard still explicitly allows the usage of
TDIMn
keywords for VLA columns. While this feature is defined in an extremely confusing manner, after reading the Standard (yet again) I now believe it actually satisfactorily specifies how multi-dimensional VLAs must be handled. I’m pretty confident that the interaction between VLA columns andTDIMn
can be boiled down to 4 rules:TDIM
per column and it does not define any way of storing shape information either on the heap area or array descriptor.TDIM
] must be (…), in the case of columns that have a’P’
or’Q’
TFORMn
data type, less than or equal to the array length specified in the variable-length array descriptor”. Since we have one “array descriptor” for each entry in a VLA column, this means we have to checkTDIM
against the length defined in every single row, in order to ensure it’s valid.TDIMn
is fewer than the allocated size of the array in the FITS file, then the unused trailing elements should be interpreted as containing undefined fill values.”TDIMn
keyword is not applicable”. Well, if theTDIMn
keyword is “not applicable”, then we have to interpret that specific entry as we would if the keyword didn’t exist… which is to just take it as an empty array.So, in the first few readings of the Standard, the idea of using
TDIM
on VLAs felt pointless because it seemed like it would force you to have arrays of fixed length, which would defeat the entire purpose of having variable-length arrays. However, with these simplified “rules” in mind it seems clear to me that there’s actually at least one scenario where using VLAs withTDIM
may be preferred to just using a fixed-length array withTDIM
: VLAs allow empty entries, which enable significant file size reductions in cases where we’re dealing with huge matrices. I have a feeling this is essentially the one use-case envisioned by the Standard. (I can also imagine a second use-case, where we intentionally create arrays longer than the size of the matrix defined byTDIM
, and where these “extra elements” can be used to store some relevant extra information… but this use-case seems very far-fetched and likely against what the standard intends.)So with this in mind, let’s look at a few examples of columns and their entries, and discuss if they are “legal” according to the Standard, and how they should be interpreted. Let’s assume that
TFORMn = '1PJ(8)'
for all of these columns.TDIM1 = '(1,1)'
)TDIM2 = '(2,2)'
)TDIM3 = '(2,4)'
)TDIM4 = '(2,4)'
)Column A was inspired by #7810 and it is legal. Each entry should be interpreted as a 2D matrix which only has one value… that’s a bit weird but completely fine by the Standard. In Python, it should look something like this:
Column B is legal, but both entries have a few extra elements that will be ignored. The expected result is two 2x2 matrices, which in Python would look like:
Column C is illegal, because there are entries that do not have enough elements to fill the matrix defined by
TDIM
(in other words, the second row has length 5 while the matrix size is 2*4=8). There’s no reasonable way to interpret this column other than by ignoringTDIM
.Since empty entries don’t need to respect
TDIM
, Column D is also legal and the result in Python would be:How I think Astropy should handle this
Currently,
io.fits
doesn’t handleTDIMn
for VLAs at all, resulting in a crash in basically any scenario. Regardless of whether you think this feature is useful or not, it seems there’s already code in the wild using this type of pattern (see issue #7810), so there would definitely be some direct benefit in implementing this. On top of that, as far as I can tell this is one of the last few hurdles for achieving full VLA support in Astropy, which would be a great thing in itself.Keeping with the “tolerant with input and strict with output” philosophy, I think the behavior a user would expect for the example columns is something like this. Reading: Column A and D are correctly read without any issues. Column B is correctly read, but a warning is thrown informing the user that some arrays were larger than the size defined by
TDIMn
, and thus the trailing elements were ignored. Column C is read as a one-dimensional array, and the user is warned thatTDIMn
was ignored because it was invalid. Writing: Column A and D are written without any issues. The trailing elements of column B are not written to the file (or maybe Column object can’t even be created with such an array), and the user is informed of that. Column C can never be written as it is illegal.How other tools/libraries handle this
While #7810 has a file which contains columns similar to column A, I unfortunately don’t have example files for any of the other columns, since I wouldn’t be able to create them with Astropy. If someone could create something like that (or has any other example files), it would be immensely useful for testing. Regardless, for now I’ve tested only that file on a few libraries/tools.
Running P190mm-PAFBE-FEBEPAR.fits.zip through
fitsverify
returns no errors or warnings. The file is also correctly opened by thefv
FITS Viewer, and exploring the binary table allows us to see thatUSEFEED
,BESECTS
andFEEDTYPE
are all correctly interpreted as 2D images that contain a single pixel. Finally, opening the file withfitsio
results in:so evidently this is feature is also not supported by
fitsio
. I haven’t tested usingCFITSIO
directly so I am not aware if it supports any of this or not.I would really like to implement this but, having had a look at the source code, I doubt I’d be able to. This is a fairly large change that is very tricky to get right, so it seems to me you have to be extremely familiar with the current code to really understand all the pitfalls (which I am not). So @saimn, if you know anyone who might want to have a look at this, please point them here!
This seems to be related to #7810.