Add column to fits file without loading fits file into memory?
See original GitHub issueIn #6649 I brought up an issue when adding columns using the somewhat misleading add_cols() function. I received very helpful assistance, with the recommendation that I use
fits.BinTableHDU.from_columns(f[1].columns + column)
(which as an aside must actually be f[1].columns.columns, as f[1].columns returns a ColDefs object)
or
fits.BinTableHDU.from_columns(f[1].columns.add_col(column))
These functions seem to only work by loading the columns from f[1] into memory first, and then constructing a while new HDU table there.
I happen to be running astropy on a large enough dataset such that loading all of the columns into memory is actually not practical. When I run this code, my memory spikes as the columns are loaded and the program crashes:
File "/scr/depot0/csh4/py/codes/ver1/corrset/builder.py", line 217, in add_column
hdu[1] = fits.BinTableHDU.from_columns(hdu[1].columns.add_col(col))
File "/scr/depot0/csh4/anaconda3/lib/python3.6/site-packages/astropy/io/fits/hdu/table.py", line 126, in from_columns
data = FITS_rec.from_columns(coldefs, nrows=nrows, fill=fill)
File "/scr/depot0/csh4/anaconda3/lib/python3.6/site-packages/astropy/io/fits/fitsrec.py", line 330, in from_columns
data = np.recarray(nrows, dtype=columns.dtype, buf=raw_data).view(cls)
File "/scr/depot0/csh4/anaconda3/lib/python3.6/site-packages/astropy/io/fits/fitsrec.py", line 244, in __array_finalize__
self._coldefs = ColDefs(self)
File "/scr/depot0/csh4/anaconda3/lib/python3.6/site-packages/astropy/io/fits/column.py", line 1189, in __init__
self._init_from_array(input)
File "/scr/depot0/csh4/anaconda3/lib/python3.6/site-packages/astropy/io/fits/column.py", line 1249, in _init_from_array
dim=dim)
File "/scr/depot0/csh4/anaconda3/lib/python3.6/site-packages/astropy/io/fits/column.py", line 583, in __init__
array = self._convert_to_valid_data_type(array)
File "/scr/depot0/csh4/anaconda3/lib/python3.6/site-packages/astropy/io/fits/column.py", line 1100, in _convert_to_valid_data_type
return np.where(array == 0, ord('F'), ord('T'))
MemoryError
Is there a way to modify an HDU table in place, without having to reload the whole thing? Thanks.
p.s. I can make a minimum working example on request, but it would vary from system to system. My current code works for a smaller data set, so I’m not sure an example is necessary. Perhaps a more useful diagnostic would be a timestamped plot of the RAM usage over time, which I can make if someone wants.
Other info: conda 4.3.25 astropy 2.0.2 python 3.6.1 numpy 1.13.1 spyder 3.2.1
My OS is Springdale Linux Release 6.9 (Pisa) GNOME 2.28.2 Kernel Linux 2.6.32-696.10.1.el6.x86_64 with 16 GB RAM
Small data set that worked just fine: 1.6 GB Large data set that had problems: 3.7 GB
Issue Analytics
- State:
- Created 6 years ago
- Comments:10 (7 by maintainers)
👋 Cassandra
I haven’t tested this, but one thought is to create a memory-mapped file with the right column structure but full of temporary / empty values, and then fill it with the values from the file you actually want. Traveling today but will try to come up with a minimal demo of what I mean. (I’ll be back in Peyton on Friday if you want to chat about this)
I’m going to close this issue as per my previous message, but if you feel that this issue should stay open, then feel free to re-open and remove the Close? label.
If this is the first time I am commenting on this issue, or if you believe I closed this issue incorrectly, please report this here