question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

How to work with database connections across processes?

See original GitHub issue

I’m opening a new issue, but this is the same question as #10778.

I am new to Ray, so I may have some of the concepts incorrect. I want to use Ray to process a large number of entries from an external database, e.g. PostgreSQL. Database connections (e.g. via SQLAlchemy) are not typically serializable - they are scoped to the process they are created in and not applicable outside of that process. A db session or SQLAlchemy object will break when serialized and created in another process.

The typical way to get around this problem is to pass the object’s (integer) ID to the multiprocessing code and then on the receiving end, query the database with a process-local session to get the object. The issue to avoid is creating and destroying n sessions to manage n objects. Ideally, one session would be created per process and then reused for every task processed by that session.

Is it possible to do this in Ray? Initialize and reuse an object to be scoped locally to a single process?

Issue Analytics

  • State:closed
  • Created 3 years ago
  • Reactions:2
  • Comments:8 (1 by maintainers)

github_iconTop GitHub Comments

1reaction
fwong03commented, Sep 12, 2022

@andreapiso I reread the opening comment and I misunderstood the request, what I had been looking to do was run a bunch of tasks in parallel while limiting the number of database connections created.

Just in case this is what you are also trying to do (and not share database connections between processes as the original request was), a pattern that I’ve found super helpful is to create a pool of actors and call map_unordered (or map if you care about order) like this:

import ray
from ray.util import ActorPool


@ray.remote
class ActorThatQueries:
    def __init__(self):
        # configure db connection stuff here

    def query_db(self, val):
        # code that queries db 
        return res

actors = [ActorThatQueries.remote() for _ in range(5)]
pool = ActorPool(actors)

# Get 100 things done but it's limited to max of 5 running at a time, so you limit the number of db connections you're using
results = list(pool.map_unordered(fn=lambda a, v: a.query_db.remote(v), values=list(range(100))))
0reactions
wjrforcybercommented, Oct 9, 2022

I’m opening a new issue, but this is the same question as #10778.

I am new to Ray, so I may have some of the concepts incorrect. I want to use Ray to process a large number of entries from an external database, e.g. PostgreSQL. Database connections (e.g. via SQLAlchemy) are not typically serializable - they are scoped to the process they are created in and not applicable outside of that process. A db session or SQLAlchemy object will break when serialized and created in another process.

The typical way to get around this problem is to pass the object’s (integer) ID to the multiprocessing code and then on the receiving end, query the database with a process-local session to get the object. The issue to avoid is creating and destroying n sessions to manage n objects. Ideally, one session would be created per process and then reused for every task processed by that session.

Is it possible to do this in Ray? Initialize and reuse an object to be scoped locally to a single process?

So any chance show us how did you solve it ? Thanks.

Read more comments on GitHub >

github_iconTop Results From Across the Web

db connection pool across processes - Stack Overflow
We have a client/server application that consists of multiple EXEs. The data access layer is on the same physical ...
Read more >
Improve database performance with connection pooling
Open a connection to the database using the database driver. Open a TCP socket for CRUD operations; Perform CRUD operations over the socket....
Read more >
Working with database while using Multiprocessing
Generally, The database connections do not travel across process boundaries. This causes the connections that have been closed or checked out to ...
Read more >
DIFFERENCES BETWEEN PROCESSES, SESSIONS AND ...
A connection is a physical circuit between you and the database. A connection might be one of many types -- most popular begin...
Read more >
How to Leverage Connection Pooling for Database Operations
When you are working with multiple threads and each thread requires a database connection. You can assign one connection to each thread from...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found