question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Moving file atomic/single operation way?

See original GitHub issue

I’m using s3fs and fastparquet to write parquet files to s3. I’ve configured presto to read from s3 using hive external table.

Problem here is, presto will read from the file where fast parquet is writing, so it is failing saying invalid parquet file. To outcome this problem, I’ll be writing to a temporary path, lets say, i’m supposed to write to

filename = 'bucket_name/account_type/yr=2017/mn=10/dt=8/19/de86d8ed-7447-420f-9f25-799412e377adparquet.json'
# let's write to temp file
tmp_file = filename.replace('account_type', 'tmp-account_type')
fastparquet.write(filename, df, open_with=opener)
fs.mv(tmp_file, fllename)

But even in this case, it looks like sometimes, rarely presto is reading incomplete file. How’s this possible? How can we make this atomic/isolated with s3fs?

Issue Analytics

  • State:closed
  • Created 6 years ago
  • Comments:14 (6 by maintainers)

github_iconTop GitHub Comments

1reaction
martindurantcommented, Oct 11, 2017

Yes, we are calling the server’s copy object command, not downloading and rewriting the data - that would be very expensive!

0reactions
martindurantcommented, Oct 14, 2017

Here is the reference: http://docs.aws.amazon.com/AmazonS3/latest/dev/Introduction.html#ConsistencyModel

Amazon claims never to give partial or corrupted data, you either get the old version or the new. That could be enough to break presto, if the versions are not compatible. Another failure mode would be one action to get the file-list, then the next to download data, but the file size of the new file is different from the old one. If you write your own code, you can check the generation of a key to make sure it hasn’t changed, or download a specific generation (old data may still be available), or be sure to match each file of a batch by time-stamp. I cannot, however, give any advice on how you might implement any of this for presto, sorry.

Read more comments on GitHub >

github_iconTop Results From Across the Web

Move file and preserve owner and mode in one atomic operation
If you're moving the file to a different filesystem, then the operation is not atomic: it involves creating a new file, then modifying...
Read more >
Moving file atomic/single operation way? · Issue #104 - GitHub
I'm using s3fs and fastparquet to write parquet files to s3. I've configured presto to read from s3 using hive external table.
Read more >
Is a move operation in Unix atomic? - Stack Overflow
A UNIX rename operation is atomic (see rename(2)). The UNIX mv command uses rename if the source and target path are on the...
Read more >
Is mv with wildcard still atomic - linux - Server Fault
For any individual file, the move or rename performed by mv is atomic provided that the file is moved within the same filesystem....
Read more >
Moving a File or Directory - The Java™ Tutorials
You can move a file or directory by using the move(Path, Path, CopyOption...) method. ... ATOMIC_MOVE – Performs the move as an atomic...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found