question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Optimize the du -s command

See original GitHub issue

Is your feature request related to a problem? Please describe. A clear and concise description of what the problem is. Ex. I’m always frustrated when […] It’s hard for me to do alluxio fs du -sh / if there are large amounts of files under /.

For instance, I’ve got about 3.8 million files in my Aliyun OSS which I’ve already mounted on Alluxio. Now, if I try to run alluxio fs du -sh /, I would get an OOM error.

I’ve tried to set a larger JVM heap size by setting env variable ALLUXIO_USER_JAVA_OPTS to -Xmx8G, but I’ve got the same OOM error.

bash-4.4# alluxio fs du -sh /
File Size     In Alluxio       Path
SLF4J: Failed toString() invocation on an object of type [java.util.ArrayList]
java.lang.OutOfMemoryError: Java heap space
	at java.util.Arrays.copyOf(Arrays.java:3332)
	at java.lang.AbstractStringBuilder.ensureCapacityInternal(AbstractStringBuilder.java:124)
	at java.lang.AbstractStringBuilder.append(AbstractStringBuilder.java:448)
	at java.lang.StringBuilder.append(StringBuilder.java:136)
	at java.lang.StringBuilder.append(StringBuilder.java:131)
	at java.util.AbstractCollection.toString(AbstractCollection.java:462)
	at org.slf4j.helpers.MessageFormatter.safeObjectAppend(MessageFormatter.java:304)
	at org.slf4j.helpers.MessageFormatter.deeplyAppendParameter(MessageFormatter.java:276)
	at org.slf4j.helpers.MessageFormatter.arrayFormat(MessageFormatter.java:230)
	at org.slf4j.impl.Log4jLoggerAdapter.warn(Log4jLoggerAdapter.java:463)
	at alluxio.AbstractClient.retryRPC(AbstractClient.java:372)
	at alluxio.client.file.RetryHandlingFileSystemMasterClient.listStatus(RetryHandlingFileSystemMasterClient.java:228)
	at alluxio.client.file.BaseFileSystem.lambda$listStatus$9(BaseFileSystem.java:274)
	at alluxio.client.file.BaseFileSystem$$Lambda$71/825249556.call(Unknown Source)
	at alluxio.client.file.BaseFileSystem.rpc(BaseFileSystem.java:531)
	at alluxio.client.file.BaseFileSystem.listStatus(BaseFileSystem.java:270)
	at alluxio.cli.fs.command.DuCommand.runPlainPath(DuCommand.java:94)
	at alluxio.cli.fs.command.AbstractFileSystemCommand.runWildCardCmd(AbstractFileSystemCommand.java:92)
	at alluxio.cli.fs.command.DuCommand.run(DuCommand.java:207)
	at alluxio.cli.AbstractShell.run(AbstractShell.java:137)
	at alluxio.cli.fs.FileSystemShell.main(FileSystemShell.java:66)
84.29GB       0B (0%)          /

Here is my jmap -histo result:

bash-4.4# jps
2417 FileSystemShell
257 AlluxioJobMaster
258 AlluxioMaster
2504 Jps


bash-4.4# jmap -histo 2417 | head -20

 num     #instances         #bytes  class name
----------------------------------------------
   1:      34261544     3187371480  [C
   2:      34261508      822276192  java.lang.String
   3:       3804849      639214632  alluxio.wire.FileInfo
   4:      11414717      547906416  java.util.HashMap
   5:      15219885      365277240  java.util.ArrayList
   6:       3805045      304420728  [Ljava.util.HashMap$Node;
   7:       7612891      198845528  [Ljava.lang.Object;
   8:       7610004      182640096  java.lang.Long
   9:       3807207      121830624  java.util.HashMap$Node
  10:       3804850      121755200  alluxio.security.authorization.AccessControlList
  11:       3804847      121755104  alluxio.wire.BlockInfo
  12:       3804847      121755104  alluxio.wire.FileBlockInfo
  13:       3804874       60877984  java.util.HashSet
  14:       3804849       60877584  alluxio.client.file.URIStatus
  15:          2149       34411984  [I
  16:        537922        8606752  java.util.HashMap$KeySet
  17:          4144         465776  java.lang.Class

It looks like all the FileInfo instances stored in JVM heap, and it won’t be recycled for future use.

Describe the solution you’d like A clear and concise description of what you want to happen. Can we change the behavior for the command alluxio fs du -s <path>, and do all the sum work on Alluxio master side instead of client side, thus no need for client side to get all the FileInfo instances.

Describe alternatives you’ve considered A clear and concise description of any alternative solutions or features you’ve considered. Or maybe the client side don’t have to sum file size after all the FileInfo instances have been instantiated, summed FileInfos can be recycled during next GC

Urgency Explain why the feature is important Urgent, an UFS with large amount of small files may be common in our scenario.

Additional context Add any other context or screenshots about the feature request here. I’ve found some related issue here: #12088

Issue Analytics

  • State:closed
  • Created 3 years ago
  • Comments:5 (4 by maintainers)

github_iconTop GitHub Comments

1reaction
apc999commented, Oct 26, 2020

@TrafalgarZZZ I will take a look! thanks for the report

0reactions
apc999commented, Nov 3, 2020

Resolved by #12423

Read more comments on GitHub >

github_iconTop Results From Across the Web

How to cache or otherwise speed up `du` summaries?
summary takes over two minutes. I'd like to find a way to speed up a disk usage summary for arbitrary directories on that...
Read more >
Make du's output more useful with this neat trick - Red Hat
The du command summarizes disk usage of each file and recursively for each directory. It offers many helpful options individually or in the...
Read more >
How to Use the Du Command in Linux - RoseHosting
Using the “du” command is very simple by typing it on the console and adding additional phrases called options. In other words, the...
Read more >
Some useful "du - disk usage" command usages - Unix/Linux
DU command – The “disk usage” command is used to check the disk usage of files and folder under Unix/Linux system.
Read more >
How to Use the du Command in Linux - Liquid Web
s – This flag displays only the total for a given file system object rather than the individual file sizes of all of...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found