question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Appending parquet file from python to s3

See original GitHub issue

Here is my snippet in spark-shell

jdbcDF.write.mode("append").partitionBy("date").parquet("s3://bucket/Data/")

Problem description

Now, i am trying to do the same thing in python with fastparquet.

import s3fs
from fastparquet import write
s3 = s3fs.S3FileSystem()
myopen = s3.open
write('****/20180101.parq', data, compression='GZIP', open_with=myopen)

First thing, I tried to save as snappy compression, write('****/20180101.snappy.parquet', data, compression='SNAPPY', open_with=myopen)

but got error,

Compression ‘SNAPPY’ not available. Options: [‘GZIP’, ‘UNCOMPRESSED’]

Then, tried to use GZIP, it worked, but not sure how I can append or create partition here. Here is an issue I created in pandas. https://github.com/pandas-dev/pandas/issues/20638

Thanks.

Issue Analytics

  • State:closed
  • Created 5 years ago
  • Comments:5 (3 by maintainers)

github_iconTop GitHub Comments

1reaction
Jeeva-Ganesancommented, Apr 18, 2018

Ok. Let me explain. I have this folder structure in s3 - s3://bucketname/user/data/ . And this is my code to write my partition in it.

import s3fs
from fastparquet import write
s3 = s3fs.S3FileSystem()
myopen = s3.open
write('bucketname/user/data/', dataframe, file_scheme='hive', partition_on = ['date'], open_with=myopen)

I am running this in Jupyter notebook, when I run this, everything works fine and s3 path looks like this,

bucketname/user/data/date=2018-01-01/part-o.parquet.

However, in my local machine, I have this folder structure created automatically - bucketname/user/data/date=2018-01-01/, but no parquet file in it. I am wonder if it is creating a local copy before moving the file to s3.

0reactions
martindurantcommented, Apr 19, 2018

OK, understood. No, the files, are not first created locally and copied.

As documented , you should supply not only the function to open, but also the function to make directories. In the case of s3, there is no such concept as directories, so the function your need to provide should not actually do anything, but you still must provide it to avoid using the default, which makes local directories.

Read more comments on GitHub >

github_iconTop Results From Across the Web

Python Example to generates Parquet file for S3 Integration
append "; Resulted parquet file can be copied into the S3 bucket dedicated for Split S3 event integration. import pandas as pd
Read more >
append row to parquet file on AWS s3 without read entire data
I have a parquet file stored in AWS s3 and it is such a large file that I can't read in memory. I...
Read more >
Read and Write Parquet file from Amazon S3
Append to existing Parquet file on S3 ... Spark provides the capability to append DataFrame to existing parquet files using “append” save mode....
Read more >
How to read the parquet file in data frame from AWS S3
Today we are going to learn How to read the parquet file in data frame from AWS S3 First of all, you have...
Read more >
awswrangler.s3.to_parquet - Read the Docs
awswrangler.s3.to_parquet¶ ... Write Parquet file or dataset on Amazon S3. The concept of Dataset goes beyond the simple idea of ordinary files and...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found