How to work with database connections across processes?
See original GitHub issueI’m opening a new issue, but this is the same question as #10778.
I am new to Ray, so I may have some of the concepts incorrect. I want to use Ray to process a large number of entries from an external database, e.g. PostgreSQL. Database connections (e.g. via SQLAlchemy) are not typically serializable - they are scoped to the process they are created in and not applicable outside of that process. A db session or SQLAlchemy object will break when serialized and created in another process.
The typical way to get around this problem is to pass the object’s (integer) ID to the multiprocessing code and then on the receiving end, query the database with a process-local session to get the object. The issue to avoid is creating and destroying n
sessions to manage n
objects. Ideally, one session would be created per process and then reused for every task processed by that session.
Is it possible to do this in Ray? Initialize and reuse an object to be scoped locally to a single process?
Issue Analytics
- State:
- Created 3 years ago
- Reactions:2
- Comments:8 (1 by maintainers)
@andreapiso I reread the opening comment and I misunderstood the request, what I had been looking to do was run a bunch of tasks in parallel while limiting the number of database connections created.
Just in case this is what you are also trying to do (and not share database connections between processes as the original request was), a pattern that I’ve found super helpful is to create a pool of actors and call
map_unordered
(ormap
if you care about order) like this:So any chance show us how did you solve it ? Thanks.