Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Moving file atomic/single operation way?

See original GitHub issue

I’m using s3fs and fastparquet to write parquet files to s3. I’ve configured presto to read from s3 using hive external table.

Problem here is, presto will read from the file where fast parquet is writing, so it is failing saying invalid parquet file. To outcome this problem, I’ll be writing to a temporary path, lets say, i’m supposed to write to

filename = 'bucket_name/account_type/yr=2017/mn=10/dt=8/19/de86d8ed-7447-420f-9f25-799412e377adparquet.json'
# let's write to temp file
tmp_file = filename.replace('account_type', 'tmp-account_type')
fastparquet.write(filename, df, open_with=opener)
fs.mv(tmp_file, fllename)

But even in this case, it looks like sometimes, rarely presto is reading incomplete file. How’s this possible? How can we make this atomic/isolated with s3fs?

Issue Analytics

State:
Created 6 years ago
Comments:14 (6 by maintainers)

Top GitHub Comments

1reaction

martindurantcommented, Oct 11, 2017

Yes, we are calling the server’s copy object command, not downloading and rewriting the data - that would be very expensive!

0reactions

martindurantcommented, Oct 14, 2017

Here is the reference: http://docs.aws.amazon.com/AmazonS3/latest/dev/Introduction.html#ConsistencyModel

Amazon claims never to give partial or corrupted data, you either get the old version or the new. That could be enough to break presto, if the versions are not compatible. Another failure mode would be one action to get the file-list, then the next to download data, but the file size of the new file is different from the old one. If you write your own code, you can check the generation of a key to make sure it hasn’t changed, or download a specific generation (old data may still be available), or be sure to match each file of a batch by time-stamp. I cannot, however, give any advice on how you might implement any of this for presto, sorry.