question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

[Bug] Objects not being spilled in 100MB bundles

See original GitHub issue

Search before asking

  • I searched the issues and found no similar issues.

Ray Component

Ray Core

What happened + What you expected to happen

Ray setup: 1GB object store memory.

We put 16000 of 1MB objects into the object store using ray.put(), causing it to spill most of them to disk, then schedule tasks that require these objects so that they get restored into memory.

Default RAY_min_spilling_size = 100 * 1024 * 1024.

Expected behavior: close to 16GB objects get spilled out, in 100MB chunks.

Reality: only the first 8 spills were over 100MB; all the subsequent spill requests are just 1–2MB. In the end, there were ~14000 spill requests, as opposed to ~160 as expected.

In raylet.out:

[2022-02-17 23:40:43,668 D 753034 753034] (raylet) local_object_manager.cc:191: Spilling objects of total size 105026250 num objects 105
[2022-02-17 23:40:43,678 D 753034 753034] (raylet) local_object_manager.cc:191: Spilling objects of total size 105026250 num objects 105
[2022-02-17 23:40:43,690 D 753034 753034] (raylet) local_object_manager.cc:191: Spilling objects of total size 105026250 num objects 105
[2022-02-17 23:40:43,701 D 753034 753034] (raylet) local_object_manager.cc:191: Spilling objects of total size 105026250 num objects 105
[2022-02-17 23:40:43,712 D 753034 753034] (raylet) local_object_manager.cc:191: Spilling objects of total size 105026250 num objects 105
[2022-02-17 23:40:43,723 D 753034 753034] (raylet) local_object_manager.cc:191: Spilling objects of total size 105026250 num objects 105
[2022-02-17 23:40:43,734 D 753034 753034] (raylet) local_object_manager.cc:191: Spilling objects of total size 105026250 num objects 105
[2022-02-17 23:40:43,745 D 753034 753034] (raylet) local_object_manager.cc:191: Spilling objects of total size 105026250 num objects 105
[2022-02-17 23:40:43,756 D 753034 753034] (raylet) local_object_manager.cc:191: Spilling objects of total size 26006500 num objects 26
[2022-02-17 23:40:43,767 D 753034 753034] (raylet) local_object_manager.cc:191: Spilling objects of total size 1000250 num objects 1
[2022-02-17 23:40:43,778 D 753034 753034] (raylet) local_object_manager.cc:191: Spilling objects of total size 1000250 num objects 1
[2022-02-17 23:40:43,789 D 753034 753034] (raylet) local_object_manager.cc:191: Spilling objects of total size 1000250 num objects 1
[2022-02-17 23:40:43,801 D 753034 753034] (raylet) local_object_manager.cc:191: Spilling objects of total size 1000250 num objects 1
[2022-02-17 23:40:43,812 D 753034 753034] (raylet) local_object_manager.cc:191: Spilling objects of total size 1000250 num objects 1
[2022-02-17 23:40:43,823 D 753034 753034] (raylet) local_object_manager.cc:191: Spilling objects of total size 1000250 num objects 1
[2022-02-17 23:40:43,834 D 753034 753034] (raylet) local_object_manager.cc:191: Spilling objects of total size 1000250 num objects 1
[2022-02-17 23:40:43,845 D 753034 753034] (raylet) local_object_manager.cc:191: Spilling objects of total size 1000250 num objects 1
[2022-02-17 23:40:43,856 D 753034 753034] (raylet) local_object_manager.cc:191: Spilling objects of total size 1000250 num objects 1
[2022-02-17 23:40:43,867 D 753034 753034] (raylet) local_object_manager.cc:191: Spilling objects of total size 1000250 num objects 1
[2022-02-17 23:40:43,878 D 753034 753034] (raylet) local_object_manager.cc:191: Spilling objects of total size 1000250 num objects 1
[2022-02-17 23:40:43,889 D 753034 753034] (raylet) local_object_manager.cc:191: Spilling objects of total size 1000250 num objects 1
[2022-02-17 23:40:43,900 D 753034 753034] (raylet) local_object_manager.cc:191: Spilling objects of total size 1000250 num objects 1

Versions / Dependencies

1.10.0 and master

Reproduction script

def get_args(*args, **kwargs):
    parser = argparse.ArgumentParser()
    parser.add_argument(
        "--total_data_size",
        default=16_000_000_000,
        type=int,
    )
    parser.add_argument(
        "--num_objects",
        default=16000,  # 1MB
        # default=16000 * 10,  # 100KB
        # default=16000 * 20,  # 50KB
        type=int,
    )
    parser.add_argument(
        "--num_objects_per_task",
        default=200,
        type=int,
    )
    parser.add_argument(
        "--object_store_memory",
        default=1 * 1024 * 1024 * 1024,
        type=int,
    )
    parser.add_argument(
        "--task_parallelism",
        default=1,
        type=int,
    )
    parser.add_argument(
        "--no_fusing",
        default=False,
        action="store_true",
    )
    parser.add_argument(
        "--no_prefetching",
        default=False,
        action="store_true",
    )
    args = parser.parse_args(args, **kwargs)
    args.object_size = args.total_data_size // args.num_objects
    args.num_tasks = args.num_objects // args.num_objects_per_task
    assert args.object_size * args.num_objects_per_task < args.object_store_memory, args
    return args

args = get_args()

@ray.remote
def consume(*xs):
    time.sleep(1)
    return sum(x.size for x in xs)

def consume_all(args, refs):
    tasks = [
        consume.remote(
            *refs[t * args.num_objects_per_task : (t + 1) * args.num_objects_per_task]
        )
        for t in range(args.num_tasks)
    ]
    with tqdm.tqdm(total=len(tasks)) as pbar:
        not_ready = tasks
        while not_ready:
            _, not_ready = ray.wait(not_ready, fetch_local=False)
            pbar.update(1)
    print(ray.get(tasks))


def produce_all(args):
    refs = []
    for i in tqdm.tqdm(range(args.num_objects)):
        obj = np.full(args.object_size, i % 256, dtype=np.uint8)
        refs.append(ray.put(obj))
    return refs

def microbenchmark(args):
    logging.info("Produce")
    refs = produce_all(args)

    logging.info("Dropping filesystem cache")
    subprocess.run("sudo bash -c 'sync; echo 3 > /proc/sys/vm/drop_caches'", shell=True)

    logging.info("Consume")
    consume_one_by_one(args, refs)

Anything else

The object store should always be under pressure since we are putting 16GB objects into a 1GB object store, so the object manager shouldn’t have trouble finding eligible objects to spill. But why is it only spilling 1–2 objects at a time?

cc @stephanie-wang @rkooo567 @scv119 @ericl

Are you willing to submit a PR?

  • Yes I am willing to submit a PR!

Issue Analytics

  • State:closed
  • Created 2 years ago
  • Reactions:2
  • Comments:8 (8 by maintainers)

github_iconTop GitHub Comments

1reaction
franklsf95commented, Feb 19, 2022

It’s not blocking me.

0reactions
stephanie-wangcommented, Feb 23, 2022

I agree with @franklsf95. I also agree with @scv119 that we shouldn’t block waiting for new objects to appear, but the fact that this is happening on Frank’s repro script makes it seem like there is an actual bug.

Looking at the code, it looks like this might be a pretty simple fix. I think the problem is that we’re calling this function repeatedly until we reach the end of the objects list or we hit the min number of bytes to spill, so that means the last call is almost always going to spill less than the min number of bytes to spill. Looks like we could just change it to check if we are at the end of the object list and there are no spills pending.

Read more comments on GitHub >

github_iconTop Results From Across the Web

Bug - Random Static Meshes are missing in Asset Bundles
For some reasons some random objects with "Batching Static" flag loose their mesh. Thank you for hint! Does anyone know, why does that...
Read more >
Bug listing with status UNCONFIRMED as at 2022/12/20 15 ...
Bug :128538 - "sys-apps/coreutils: /bin/hostname should be installed from coreutils not sys-apps/net-tools" status:UNCONFIRMED resolution: severity:enhancement ...
Read more >
Fix List for Db2 Version 11.1 for Linux, UNIX and Windows - IBM
APAR Sev. Abstract IT36427 2 MON_GET_AUTO_MAINT_QUEUE() MAY FAIL WITH SQL0493N IT37394 3 DB2PD ‑BUFFERPOOLS REPORTS INACCURATE HIT RATIO IT36689 2 REORGCHK F6 REPORTING INCORRECT VALUES
Read more >
Issues Fixed in Cloudera Manager 5 | 5.x
Fixed a bug where Cloudera Manager did not allow Gateway roles for Isilon ... in diagnostic bundle are now be limited to 100MB...
Read more >
Orzly Accessory Bundle Kit Designed for Nintendo Switch ...
Amazon.com: Orzly Accessory Bundle Kit Designed for Nintendo Switch Accessories ... (Console, joycon & games are for illustration only and not included.) ...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found