question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Cannot write simple dataframe to disk in thrift 0.11.0

See original GitHub issue

Is there something simple I’m missing here? I’m just trying to do the most basic thing in the example:

df = pd.DataFrame(np.zeros((1000,1000)), columns=[str(i) for i in range(1000)])
from fastparquet import write
write('outfile2.parq', df)
write('outfile2.parq', df, row_group_offsets=[0, 10000, 20000],
       compression='GZIP', file_scheme='hive')
---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
<ipython-input-4-5b2fbc3e1a9e> in <module>()
      1 write('outfile2.parq', df, row_group_offsets=[0, 10000, 20000],
----> 2       compression='GZIP', file_scheme='hive')
      3

~/miniconda3/envs/py3default/lib/python3.6/site-packages/fastparquet/writer.py in write(filename, data, row_group_offsets, compression, file_scheme, open_with, mkdirs, has_nulls, write_index, partition_on, fixed_text, append, object_encoding, times)
    831                 with open_with(partname, 'wb') as f2:
    832                     rg = make_part_file(f2, data[start:end], fmd.schema,
--> 833                                         compression=compression, fmd=fmd)
    834                 for chunk in rg.columns:
    835                     chunk.file_path = part

~/miniconda3/envs/py3default/lib/python3.6/site-packages/fastparquet/writer.py in make_part_file(f, data, schema, compression, fmd)
    604     with f as f:
    605         f.write(MARKER)
--> 606         rg = make_row_group(f, data, schema, compression=compression)
    607         if fmd is None:
    608             fmd = parquet_thrift.FileMetaData(num_rows=len(data),

~/miniconda3/envs/py3default/lib/python3.6/site-packages/fastparquet/writer.py in make_row_group(f, data, schema, compression)
    592                 comp = compression
    593             chunk = write_column(f, data[column.name], column,
--> 594                                  compression=comp)
    595             rg.columns.append(chunk)
    596     rg.total_byte_size = sum([c.meta_data.total_uncompressed_size for c in

~/miniconda3/envs/py3default/lib/python3.6/site-packages/fastparquet/writer.py in write_column(f, data, selement, compression)
    532                                    data_page_header=dph, crc=None)
    533
--> 534     write_thrift(f, ph)
    535     f.write(bdata)
    536

~/miniconda3/envs/py3default/lib/python3.6/site-packages/fastparquet/thrift_structures.py in write_thrift(fobj, thrift)
     47     pout = TCompactProtocol(fobj)
     48     try:
---> 49         thrift.write(pout)
     50         fail = False
     51     except TProtocolException as e:

~/miniconda3/envs/py3default/lib/python3.6/site-packages/fastparquet/parquet_thrift/parquet/ttypes.py in write(self, oprot)
   1027     def write(self, oprot):
   1028         if oprot._fast_encode is not None and self.thrift_spec is not None:
-> 1029             oprot.trans.write(oprot._fast_encode(self, (self.__class__, self.thrift_spec)))
   1030             return
   1031         oprot.writeStructBegin('PageHeader')

TypeError: expecting list of size 2 for struct args

Same error on my local Mac and remote EC2 ubuntu 16.04 instance

Issue Analytics

  • State:closed
  • Created 6 years ago
  • Comments:12 (7 by maintainers)

github_iconTop GitHub Comments

1reaction
bschreckcommented, Jan 19, 2018

0.11.0

That does seem to be the issue. Installing 0.10.0 fixes it. Maybe update your requirements to force 0.10.0 exactly?

Read more comments on GitHub >

github_iconTop Results From Across the Web

Writing data to Kafka | CDP Public Cloud
Writing data to Kafka. You can extract, transform, and load a Hive table to a Kafka topic for real-time streaming of a large...
Read more >
python - How to reversibly store and load a Pandas dataframe ...
The easiest way is to pickle it using to_pickle : df.to_pickle(file_name) # where to save it, usually as a .pkl. Then you can...
Read more >
All Configurations | Apache Hudi
This page covers the different ways of configuring your job to write/read Hudi tables. At a high level, you can control behaviour at...
Read more >
Apache Arrow 3.0.0 (2021-01-18)
Parquet] Can write a jagged array column of strings to disk, but hit `ArrowNotImplementedError` on read; ARROW-5142 - [CI] Fix conda calls ...
Read more >
Packages for 64-bit Windows with Python 3.9
Name Version Summary / License _libgcc_mutex 0.1 Mutex for libgcc and libgcc‑ng / None aiofiles 0.7.0 File support for asyncio / Apache 2.0 alembic 1.8.1 A...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found