Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

[Bug] plots beyond ~4400 = harvester 100.0 load, cache_hit: false, plots check hangs before challenges

See original GitHub issue

What happened?

Noted that for the last few releases, chia_harvester was pegging a thread continuously while farming.

Info:

System has >20k plots direct attached. Single harvester.
plot_refresh_callback completes in 15 seconds and proof checks are typically 0.4-1 sec.
Aside from chia_harvester constantly pegging its thread, all else appears to function normally.

Elaboration:

Reinstalled chia_blockchain from scratch, only importing keys and mainnet/wallet db’s. No change.
Experimented with varying numbers of plots and noted that at below ~4400 plots, chia_harvester no longer pegs a thread (dropped to 0.0 load). Added 200 plots back and load jumped back to 100.0 indefinitely.
Experimented with various harvester config settings (num_threads, parallel_reads, batch_size). No change.
Noted that upon startup, and with >4400 plots, the found_plot messages from harvester transition from cache_hit: True to cache_hit: False.
Also noted that attempting to run a chia plots check on any of the drives/plots with cache_hit: False results in an indefinite hang of that check before it issues a single challenge.
Rewards are tracking for my total plot count (not 4400), so while the cache_hit: False causes high harvester CPU usage and inability to check those plots, they are still successfully farming.

Possible causes:

This feels like high plot counts not playing nicely with plot_refresh / chia.plotting.cache, resulting in one of the harvester threads pegging indefinitely while attempting to cache some portion of plots over some maximum, and perhaps that same thread fails to respond to a plots check of those same plots?

Version

1.5.0

What platform are you using?

Linux

What ui mode are you using?

CLI

Relevant log output

No response

Issue Analytics

State:
Created a year ago
Reactions:1
Comments:21 (6 by maintainers)

Top GitHub Comments

3reactions

xdustinfacecommented, Sep 19, 2022

Okay so… turned out that the reason for all this are plots created via bladebit RAM plotter where the DiskProver serializes into 524.659 bytes which:

Obviously takes a very long time based on the number of those plots
Lets the cache grow like crazy so that we end up with a number of bytes which doesn’t fit into uint32 -> Value 5794656522 does not fit into uint32 while we seralize the length of the bytes.
Leads to refresh thread constantly working on the serialization and as soon as its done it fails to write for the reason above and then in the next refresh event it tries the same again. This seems to be the reason for the 100% peg.

The reason why the DiskProver serializes into such a huge blob is that those plots seem to have 65.536 C2 entries.

Table pointers from a plot in question with table_begin_pointers[10] - table_begin_pointers[9] -> 262.144:

table_begin_pointers = {std::vector<unsigned long long>} size=11
 [0] = {unsigned long long} 0
 [1] = {unsigned long long} 262144
 [2] = {unsigned long long} 14839185408
 [3] = {unsigned long long} 28822208512
 [4] = {unsigned long long} 42911924224
 [5] = {unsigned long long} 57272958976
 [6] = {unsigned long long} 72367734784
 [7] = {unsigned long long} 89824165888
 [8] = {unsigned long long} 107538284544
 [9] = {unsigned long long} 107540119552
 [10] = {unsigned long long} 107540381696

Table pointers from a normally working plot with table_begin_pointers[10] - table_begin_pointers[9] -> 176:

table_begin_pointers = {std::vector<unsigned long long>} size=11
 [0] = {unsigned long long} 0
 [1] = {unsigned long long} 252
 [2] = {unsigned long long} 14839436976
 [3] = {unsigned long long} 28822365051
 [4] = {unsigned long long} 42911861451
 [5] = {unsigned long long} 57273202401
 [6] = {unsigned long long} 72368924901
 [7] = {unsigned long long} 89827257426
 [8] = {unsigned long long} 107543532882
 [9] = {unsigned long long} 107545250830
 [10] = {unsigned long long} 107545251006

Im going to talk with @harold-b about this and will post an update once we figured this out.

1reaction

malventanocommented, Aug 26, 2022

我的系统每次启动起来是自动删除 C:\Users\Administrator.*文件夹。不存在提到的缓存问题。

It could still be a caching-related issue since it would create a new cache on the next startup (and the cache is then used while the harvester runs). Either way, we won’t know unless we can figure out a way to tell what those pegged harvester threads are doing.