Master process crashed: Failed to run journal checkpoint thread, crashing.
See original GitHub issueAlluxio Version: 1.8.1
Describe the bug
java.lang.RuntimeException: alluxio.exception.FileDoesNotExistException: inodeId 10113188364287 does not exist; too many retries
at alluxio.master.file.DefaultFileSystemMaster.completeFileFromEntry(DefaultFileSystemMaster.java:1294)
at alluxio.master.file.DefaultFileSystemMaster.processJournalEntry(DefaultFileSystemMaster.java:478)
at alluxio.master.journal.ufs.UfsJournalCheckpointThread.runInternal(UfsJournalCheckpointThread.java:146)
at alluxio.master.journal.ufs.UfsJournalCheckpointThread.run(UfsJournalCheckpointThread.java:123)
Caused by: alluxio.exception.FileDoesNotExistException: inodeId 10113188364287 does not exist; too many retries
at alluxio.master.file.meta.InodeTree.lockFullInodePath(InodeTree.java:393)
at alluxio.master.file.DefaultFileSystemMaster.completeFileFromEntry(DefaultFileSystemMaster.java:1290)
... 3 more
To Reproduce Create a dir with the same name concurrently
Additional context
sequence_number: 5222362
inode_file {
id: 10112383057919
parent_id: 9
name: "application_1537944329020_1020_1.inprogress"
persistence_state: "PERSISTED"
pinned: false
creation_time_ms: 1553121254632
last_modification_time_ms: 1553121254632
block_size_bytes: 536870912
length: 0
completed: false
cacheable: true
ttl: -1
owner: "xxx"
group: "xxx"
mode: 420
ttlAction: DELETE
ufs_fingerprint: ""
}
--------------------------------------------------------------------------------
sequence_number: 5222363
set_attribute {
id: 10112383057919
op_time_ms: 1553121255025
ttl: -1
permission: 420
ttlAction: DELETE
}
sequence_number: 5222866
inode_file {
id: 10113188364287
parent_id: 9
name: "application_1537944329020_1020_1.inprogress"
persistence_state: "PERSISTED"
pinned: false
creation_time_ms: 1553121288532
last_modification_time_ms: 1553121288532
block_size_bytes: 33554432
length: 0
completed: false
cacheable: true
ttl: -1
owner: "xxx"
group: "xxx"
mode: 438
ttlAction: DELETE
ufs_fingerprint: ""
}
--------------------------------------------------------------------------------
sequence_number: 5222867
complete_file {
block_ids: 10113171587072
id: 10113188364287
length: 103889
op_time_ms: 1553121283557
ufs_fingerprint: "TYPE:FILE UFS:xxx OWNER:omm GROUP:omm MODE:438 CONTENT_HASH:(len_103889,_modtime_1553121283557) "
}
sequence_number: 5222904
rename {
id: 10112383057919
dst_path: "/spark/sparkJobHistory/application_1537944329020_1020_1"
op_time_ms: 1553121288251
}
Issue Analytics
- State:
- Created 4 years ago
- Comments:12 (12 by maintainers)
Top Results From Across the Web
Check Point Security Gateway freezes, crashes, or
STEP 16 :- Once we face the issue like gateway freeze or crash then below command need to run. kdb>ps (Complete list of...
Read more >Nifi often restarts automatically, causing processor error
Solved: I have a processor group that fetches data from ElasticSearch and stores it in mysql. Every time I run - 208689.
Read more >Database: ERROR (spawn error) - CryoSPARC Discuss
HI, I have this error when I try to restart cryosparc. Starting cryoSPARC System master process.. CryoSPARC is not already running. database: ERROR...
Read more >Logging and Recovery - Washington
Recovery II: Surviving Aborts and System Crashes ... Start from a checkpoint (found via master record). Three phases. ... UNDO effects of failed...
Read more >Optimistic Crash Consistency - Computer Sciences Dept.
The in-order recovery process will use the checksums described below to determine if a transaction is written correctly and completely. 4.3.2 In-Order Journal...
Read more >Top Related Medium Post
No results found
Top Related StackOverflow Question
No results found
Troubleshoot Live Code
Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start FreeTop Related Reddit Thread
No results found
Top Related Hackernoon Post
No results found
Top Related Tweet
No results found
Top Related Dev.to Post
No results found
Top Related Hashnode Post
No results found
Top GitHub Comments
This issue is able to reproduce by 50 threads continually issue one of the four commands:
createFile(src)
,completeFile(src)
,rename(src, dst)
,delete(dst)
.The
FileDoesNotExistException: inodeId 10113188364287 does not exist
issue happens when we have the following journal sequenceThe
FileDoesNotExistException: Path "a" does not exist
issue happens when we have the following journal sequenceThe journal sequence may be different from the actual execution sequence. There may be some locking issues in our rename process and we are working to solve the issue.
@calvinjia This issue is fixed in #8768, Please help close this issue