ECSV array/object-valued column support
See original GitHub issueThe ECSV format is, as far as I know, defined by APE6. That lists the permitted datatypes as bool
, int8
, int16
, int32
, int64
, uint8
, uint16
, uint32
, uint64
, float16
, float32
, float64
, float128
, complex64
, complex128
, complex256
, and string
, and it says on the topic of array-valued columns (as it calls them “Multidimensional columns”):
Multidimensional columns are not supported in version 0.9 of the ECSV format.
None of the available text data formats supports multidimensional columns with more than one element per row. Although in many cases having such data would indicate using a binary storage format, there is utility in supporting this for cases where the column shape is “reasonable”, perhaps with no more than about 10 elements.
(parenthetically: VOTable listed as one of the comparison formats earlier in the document does support array-valued columns, though it’s not really a text format)
However, if you ask astropy.table
to write a table with array-like columns (lists, tuples, numpy arrays) to ECSV format, it will output ECSV tables using the undocumented datatype object
. Such columns produce string values when read back in. See the following code:
from astropy.table import Table
# Create table column data
column_float = [10., 20., 30.]
column_string = ['xx', 'yy', 'zz']
column_list = [[2.0, 3.0, 4.0], [5.0], [8.5, 1.1]]
# Create Astropy table and get info
table = Table([column_float, column_string, column_list],
names=('num_col', 'txt_col', 'list_col'),)
# Write ecsv table with multiple elements per cell
table.write('tsimp.ecsv', format='ascii.ecsv', overwrite=True, delimiter=',')
# Read it back in
tr = Table.read('tsimp.ecsv', format='ascii.ecsv')
print([type(c) for c in tr[0]])
print(tr)
which writes the following output:
[<class 'numpy.float64'>, <class 'numpy.str_'>, <class 'str'>]
num_col txt_col list_col
------- ------- ---------------
10.0 xx [2.0, 3.0, 4.0]
20.0 yy [5.0]
30.0 zz [8.5, 1.1]
and produces the following ECSV file:
# %ECSV 0.9
# ---
# datatype:
# - {name: num_col, datatype: float64}
# - {name: txt_col, datatype: string}
# - {name: list_col, datatype: object}
# delimiter: ','
# schema: astropy-2.0
num_col,txt_col,list_col
10.0,xx,"[2.0, 3.0, 4.0]"
20.0,yy,[5.0]
30.0,zz,"[8.5, 1.1]"
Is this intended behaviour? At present my (java STIL/TOPCAT) ECSV reader fails to read such tables because they have an unknown datatype object
. I’d say that’s an ECSV output bug given the documentation at APE6.
However, as it happens this got noticed during some discussion within DPAC (the Gaia data processing consortium) about whether array values could be stored in ECSV tables, which is an enhancement requested by some DPAC members for use with both Astropy and TOPCAT. The suggestion was that the existing behaviour could be extended with some additional metadata (new datatype
or array flag) to provide array-valued column support in ECSV.
On a related topic, I see #11155, which is pursuing a different approach to serializing array values into ECSV. That may not be suitable for the Gaia use case, since (a) it only supports fixed-size array values and (b) from the metadata in the example file it seems to be intended as an astropy-specific convention.
So: fix object
datatype output? Extend it to support array-valued columns? Or should I just patch my ECSV parser to cope with undocumented datatypes?
Issue Analytics
- State:
- Created 3 years ago
- Comments:25 (25 by maintainers)
Top GitHub Comments
@nstarman - Thanks! Right now my plan is to get to this for the 4.3 release but I’ll let you know by this week if that seems unrealistic and I might need some help. If you are looking for a useful challenge, another set of eyes on #11127 would be great. 😄
See #11569.