Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Cannot write simple dataframe to disk in thrift 0.11.0

See original GitHub issue

Is there something simple I’m missing here? I’m just trying to do the most basic thing in the example:

df = pd.DataFrame(np.zeros((1000,1000)), columns=[str(i) for i in range(1000)])
from fastparquet import write
write('outfile2.parq', df)
write('outfile2.parq', df, row_group_offsets=[0, 10000, 20000],
       compression='GZIP', file_scheme='hive')
---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
<ipython-input-4-5b2fbc3e1a9e> in <module>()
      1 write('outfile2.parq', df, row_group_offsets=[0, 10000, 20000],
----> 2       compression='GZIP', file_scheme='hive')
      3

~/miniconda3/envs/py3default/lib/python3.6/site-packages/fastparquet/writer.py in write(filename, data, row_group_offsets, compression, file_scheme, open_with, mkdirs, has_nulls, write_index, partition_on, fixed_text, append, object_encoding, times)
    831                 with open_with(partname, 'wb') as f2:
    832                     rg = make_part_file(f2, data[start:end], fmd.schema,
--> 833                                         compression=compression, fmd=fmd)
    834                 for chunk in rg.columns:
    835                     chunk.file_path = part

~/miniconda3/envs/py3default/lib/python3.6/site-packages/fastparquet/writer.py in make_part_file(f, data, schema, compression, fmd)
    604     with f as f:
    605         f.write(MARKER)
--> 606         rg = make_row_group(f, data, schema, compression=compression)
    607         if fmd is None:
    608             fmd = parquet_thrift.FileMetaData(num_rows=len(data),

~/miniconda3/envs/py3default/lib/python3.6/site-packages/fastparquet/writer.py in make_row_group(f, data, schema, compression)
    592                 comp = compression
    593             chunk = write_column(f, data[column.name], column,
--> 594                                  compression=comp)
    595             rg.columns.append(chunk)
    596     rg.total_byte_size = sum([c.meta_data.total_uncompressed_size for c in

~/miniconda3/envs/py3default/lib/python3.6/site-packages/fastparquet/writer.py in write_column(f, data, selement, compression)
    532                                    data_page_header=dph, crc=None)
    533
--> 534     write_thrift(f, ph)
    535     f.write(bdata)
    536

~/miniconda3/envs/py3default/lib/python3.6/site-packages/fastparquet/thrift_structures.py in write_thrift(fobj, thrift)
     47     pout = TCompactProtocol(fobj)
     48     try:
---> 49         thrift.write(pout)
     50         fail = False
     51     except TProtocolException as e:

~/miniconda3/envs/py3default/lib/python3.6/site-packages/fastparquet/parquet_thrift/parquet/ttypes.py in write(self, oprot)
   1027     def write(self, oprot):
   1028         if oprot._fast_encode is not None and self.thrift_spec is not None:
-> 1029             oprot.trans.write(oprot._fast_encode(self, (self.__class__, self.thrift_spec)))
   1030             return
   1031         oprot.writeStructBegin('PageHeader')

TypeError: expecting list of size 2 for struct args

Same error on my local Mac and remote EC2 ubuntu 16.04 instance

Issue Analytics

State:
Created 6 years ago
Comments:12 (7 by maintainers)

Top GitHub Comments

1reaction

bschreckcommented, Jan 19, 2018

0.11.0

That does seem to be the issue. Installing 0.10.0 fixes it. Maybe update your requirements to force 0.10.0 exactly?

0reactions

mariusvniekerkcommented, Jan 21, 2018

https://github.com/conda-forge/thrift-cpp-feedstock/pull/15

Top Results From Across the Web

Writing data to Kafka | CDP Public Cloud

Writing data to Kafka. You can extract, transform, and load a Hive table to a Kafka topic for real-time streaming of a large...

python - How to reversibly store and load a Pandas dataframe ...

The easiest way is to pickle it using to_pickle : df.to_pickle(file_name) # where to save it, usually as a .pkl. Then you can...

All Configurations | Apache Hudi

This page covers the different ways of configuring your job to write/read Hudi tables. At a high level, you can control behaviour at...

Apache Arrow 3.0.0 (2021-01-18)

Parquet] Can write a jagged array column of strings to disk, but hit `ArrowNotImplementedError` on read; ARROW-5142 - [CI] Fix conda calls ...

Packages for 64-bit Windows with Python 3.9

Name Version Summary / License _libgcc_mutex 0.1 Mutex for libgcc and libgcc‑ng / None aiofiles 0.7.0 File support for asyncio / Apache 2.0 alembic 1.8.1 A...