Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Appending parquet file from python to s3

See original GitHub issue

Here is my snippet in spark-shell

jdbcDF.write.mode("append").partitionBy("date").parquet("s3://bucket/Data/")

Problem description

Now, i am trying to do the same thing in python with fastparquet.

import s3fs
from fastparquet import write
s3 = s3fs.S3FileSystem()
myopen = s3.open
write('****/20180101.parq', data, compression='GZIP', open_with=myopen)

First thing, I tried to save as snappy compression, write('****/20180101.snappy.parquet', data, compression='SNAPPY', open_with=myopen)

but got error,

Compression ‘SNAPPY’ not available. Options: [‘GZIP’, ‘UNCOMPRESSED’]

Then, tried to use GZIP, it worked, but not sure how I can append or create partition here. Here is an issue I created in pandas. https://github.com/pandas-dev/pandas/issues/20638

Thanks.

Issue Analytics

State:
Created 5 years ago
Comments:5 (3 by maintainers)

Top GitHub Comments

1reaction

Jeeva-Ganesancommented, Apr 18, 2018

Ok. Let me explain. I have this folder structure in s3 - s3://bucketname/user/data/ . And this is my code to write my partition in it.

import s3fs
from fastparquet import write
s3 = s3fs.S3FileSystem()
myopen = s3.open
write('bucketname/user/data/', dataframe, file_scheme='hive', partition_on = ['date'], open_with=myopen)

I am running this in Jupyter notebook, when I run this, everything works fine and s3 path looks like this,

bucketname/user/data/date=2018-01-01/part-o.parquet.

However, in my local machine, I have this folder structure created automatically - bucketname/user/data/date=2018-01-01/, but no parquet file in it. I am wonder if it is creating a local copy before moving the file to s3.

0reactions

martindurantcommented, Apr 19, 2018

OK, understood. No, the files, are not first created locally and copied.

As documented , you should supply not only the function to open, but also the function to make directories. In the case of s3, there is no such concept as directories, so the function your need to provide should not actually do anything, but you still must provide it to avoid using the default, which makes local directories.

Top Results From Across the Web

Python Example to generates Parquet file for S3 Integration

append "; Resulted parquet file can be copied into the S3 bucket dedicated for Split S3 event integration. import pandas as pd

append row to parquet file on AWS s3 without read entire data

I have a parquet file stored in AWS s3 and it is such a large file that I can't read in memory. I...

Read and Write Parquet file from Amazon S3

Append to existing Parquet file on S3 ... Spark provides the capability to append DataFrame to existing parquet files using “append” save mode....

How to read the parquet file in data frame from AWS S3

Today we are going to learn How to read the parquet file in data frame from AWS S3 First of all, you have...

awswrangler.s3.to_parquet - Read the Docs

awswrangler.s3.to_parquet¶ ... Write Parquet file or dataset on Amazon S3. The concept of Dataset goes beyond the simple idea of ordinary files and...