Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

How to work with database connections across processes?

See original GitHub issue

I’m opening a new issue, but this is the same question as #10778.

I am new to Ray, so I may have some of the concepts incorrect. I want to use Ray to process a large number of entries from an external database, e.g. PostgreSQL. Database connections (e.g. via SQLAlchemy) are not typically serializable - they are scoped to the process they are created in and not applicable outside of that process. A db session or SQLAlchemy object will break when serialized and created in another process.

The typical way to get around this problem is to pass the object’s (integer) ID to the multiprocessing code and then on the receiving end, query the database with a process-local session to get the object. The issue to avoid is creating and destroying n sessions to manage n objects. Ideally, one session would be created per process and then reused for every task processed by that session.

Is it possible to do this in Ray? Initialize and reuse an object to be scoped locally to a single process?

Issue Analytics

State:
Created 3 years ago
Reactions:2
Comments:8 (1 by maintainers)

Top GitHub Comments

1reaction

fwong03commented, Sep 12, 2022

@andreapiso I reread the opening comment and I misunderstood the request, what I had been looking to do was run a bunch of tasks in parallel while limiting the number of database connections created.

Just in case this is what you are also trying to do (and not share database connections between processes as the original request was), a pattern that I’ve found super helpful is to create a pool of actors and call map_unordered (or map if you care about order) like this:

import ray
from ray.util import ActorPool


@ray.remote
class ActorThatQueries:
    def __init__(self):
        # configure db connection stuff here

    def query_db(self, val):
        # code that queries db 
        return res

actors = [ActorThatQueries.remote() for _ in range(5)]
pool = ActorPool(actors)

# Get 100 things done but it's limited to max of 5 running at a time, so you limit the number of db connections you're using
results = list(pool.map_unordered(fn=lambda a, v: a.query_db.remote(v), values=list(range(100))))

0reactions

wjrforcybercommented, Oct 9, 2022

I’m opening a new issue, but this is the same question as #10778.

I am new to Ray, so I may have some of the concepts incorrect. I want to use Ray to process a large number of entries from an external database, e.g. PostgreSQL. Database connections (e.g. via SQLAlchemy) are not typically serializable - they are scoped to the process they are created in and not applicable outside of that process. A db session or SQLAlchemy object will break when serialized and created in another process.

The typical way to get around this problem is to pass the object’s (integer) ID to the multiprocessing code and then on the receiving end, query the database with a process-local session to get the object. The issue to avoid is creating and destroying n sessions to manage n objects. Ideally, one session would be created per process and then reused for every task processed by that session.

Is it possible to do this in Ray? Initialize and reuse an object to be scoped locally to a single process?

So any chance show us how did you solve it ? Thanks.