Moving file atomic/single operation way?
See original GitHub issueI’m using s3fs and fastparquet to write parquet files to s3. I’ve configured presto to read from s3 using hive external table.
Problem here is, presto will read from the file where fast parquet is writing, so it is failing saying invalid parquet file. To outcome this problem, I’ll be writing to a temporary path, lets say, i’m supposed to write to
filename = 'bucket_name/account_type/yr=2017/mn=10/dt=8/19/de86d8ed-7447-420f-9f25-799412e377adparquet.json'
# let's write to temp file
tmp_file = filename.replace('account_type', 'tmp-account_type')
fastparquet.write(filename, df, open_with=opener)
fs.mv(tmp_file, fllename)
But even in this case, it looks like sometimes, rarely presto is reading incomplete file. How’s this possible? How can we make this atomic/isolated with s3fs?
Issue Analytics
- State:
- Created 6 years ago
- Comments:14 (6 by maintainers)
Top Results From Across the Web
Move file and preserve owner and mode in one atomic operation
If you're moving the file to a different filesystem, then the operation is not atomic: it involves creating a new file, then modifying...
Read more >Moving file atomic/single operation way? · Issue #104 - GitHub
I'm using s3fs and fastparquet to write parquet files to s3. I've configured presto to read from s3 using hive external table.
Read more >Is a move operation in Unix atomic? - Stack Overflow
A UNIX rename operation is atomic (see rename(2)). The UNIX mv command uses rename if the source and target path are on the...
Read more >Is mv with wildcard still atomic - linux - Server Fault
For any individual file, the move or rename performed by mv is atomic provided that the file is moved within the same filesystem....
Read more >Moving a File or Directory - The Java™ Tutorials
You can move a file or directory by using the move(Path, Path, CopyOption...) method. ... ATOMIC_MOVE – Performs the move as an atomic...
Read more >Top Related Medium Post
No results found
Top Related StackOverflow Question
No results found
Troubleshoot Live Code
Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start FreeTop Related Reddit Thread
No results found
Top Related Hackernoon Post
No results found
Top Related Tweet
No results found
Top Related Dev.to Post
No results found
Top Related Hashnode Post
No results found
Top GitHub Comments
Yes, we are calling the server’s copy object command, not downloading and rewriting the data - that would be very expensive!
Here is the reference: http://docs.aws.amazon.com/AmazonS3/latest/dev/Introduction.html#ConsistencyModel
Amazon claims never to give partial or corrupted data, you either get the old version or the new. That could be enough to break presto, if the versions are not compatible. Another failure mode would be one action to get the file-list, then the next to download data, but the file size of the new file is different from the old one. If you write your own code, you can check the generation of a key to make sure it hasn’t changed, or download a specific generation (old data may still be available), or be sure to match each file of a batch by time-stamp. I cannot, however, give any advice on how you might implement any of this for presto, sorry.