Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Master process crashed: Failed to run journal checkpoint thread, crashing.

See original GitHub issue

Alluxio Version: 1.8.1

Describe the bug

java.lang.RuntimeException: alluxio.exception.FileDoesNotExistException: inodeId 10113188364287 does not exist; too many retries
	at alluxio.master.file.DefaultFileSystemMaster.completeFileFromEntry(DefaultFileSystemMaster.java:1294)
	at alluxio.master.file.DefaultFileSystemMaster.processJournalEntry(DefaultFileSystemMaster.java:478)
	at alluxio.master.journal.ufs.UfsJournalCheckpointThread.runInternal(UfsJournalCheckpointThread.java:146)
	at alluxio.master.journal.ufs.UfsJournalCheckpointThread.run(UfsJournalCheckpointThread.java:123)
Caused by: alluxio.exception.FileDoesNotExistException: inodeId 10113188364287 does not exist; too many retries
	at alluxio.master.file.meta.InodeTree.lockFullInodePath(InodeTree.java:393)
	at alluxio.master.file.DefaultFileSystemMaster.completeFileFromEntry(DefaultFileSystemMaster.java:1290)
	... 3 more

To Reproduce Create a dir with the same name concurrently

Additional context


sequence_number: 5222362
inode_file {
  id: 10112383057919
  parent_id: 9
  name: "application_1537944329020_1020_1.inprogress"
  persistence_state: "PERSISTED"
  pinned: false
  creation_time_ms: 1553121254632
  last_modification_time_ms: 1553121254632
  block_size_bytes: 536870912
  length: 0
  completed: false
  cacheable: true
  ttl: -1
  owner: "xxx"
  group: "xxx"
  mode: 420
  ttlAction: DELETE
  ufs_fingerprint: ""
}
--------------------------------------------------------------------------------
sequence_number: 5222363
set_attribute {
  id: 10112383057919
  op_time_ms: 1553121255025
  ttl: -1
  permission: 420
  ttlAction: DELETE
}


sequence_number: 5222866
inode_file {
  id: 10113188364287
  parent_id: 9
  name: "application_1537944329020_1020_1.inprogress"
  persistence_state: "PERSISTED"
  pinned: false
  creation_time_ms: 1553121288532
  last_modification_time_ms: 1553121288532
  block_size_bytes: 33554432
  length: 0
  completed: false
  cacheable: true
  ttl: -1
  owner: "xxx"
  group: "xxx"
  mode: 438
  ttlAction: DELETE
  ufs_fingerprint: ""
}
--------------------------------------------------------------------------------
sequence_number: 5222867
complete_file {
  block_ids: 10113171587072
  id: 10113188364287
  length: 103889
  op_time_ms: 1553121283557
  ufs_fingerprint: "TYPE:FILE UFS:xxx OWNER:omm GROUP:omm MODE:438 CONTENT_HASH:(len_103889,_modtime_1553121283557) "
}



sequence_number: 5222904
rename {
  id: 10112383057919
  dst_path: "/spark/sparkJobHistory/application_1537944329020_1020_1"
  op_time_ms: 1553121288251
}

Issue Analytics

State:
Created 4 years ago
Comments:12 (12 by maintainers)

Top GitHub Comments

2reactions

LuQQiucommented, Apr 10, 2019

This issue is able to reproduce by 50 threads continually issue one of the four commands: createFile(src), completeFile(src), rename(src, dst), delete(dst).

The FileDoesNotExistException: inodeId 10113188364287 does not exist issue happens when we have the following journal sequence

id1:create(a) -> id1:complete (a) -> id2:create(a) -> id2:complete(a) -> id1: rename(a -> b)

The FileDoesNotExistException: Path "a" does not exist issue happens when we have the following journal sequence

id1:create(a) -> id1:complete (a) -> id2:create(a) -> id1: rename(a -> b) ->  id2:complete(a)

The journal sequence may be different from the actual execution sequence. There may be some locking issues in our rename process and we are working to solve the issue.

0reactions

LuQQiucommented, Apr 17, 2019

@calvinjia This issue is fixed in #8768, Please help close this issue

Top Results From Across the Web

Check Point Security Gateway freezes, crashes, or

STEP 16 :- Once we face the issue like gateway freeze or crash then below command need to run. kdb>ps (Complete list of...

Nifi often restarts automatically, causing processor error

Solved: I have a processor group that fetches data from ElasticSearch and stores it in mysql. Every time I run - 208689.

Database: ERROR (spawn error) - CryoSPARC Discuss

HI, I have this error when I try to restart cryosparc. Starting cryoSPARC System master process.. CryoSPARC is not already running. database: ERROR...

Logging and Recovery - Washington

Recovery II: Surviving Aborts and System Crashes ... Start from a checkpoint (found via master record). Three phases. ... UNDO effects of failed...

Optimistic Crash Consistency - Computer Sciences Dept.

The in-order recovery process will use the checksums described below to determine if a transaction is written correctly and completely. 4.3.2 In-Order Journal...