question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Unexpected timeouts in logical replication

See original GitHub issue

Hi!

I am trying out the logical replication feature (https://www.npgsql.org/doc/replication.html) and I have a few questions. Hope for help.

  1. I have created a table, a publication and a replication slot. Then I copied the code from the documentation:
await foreach (var message in connection.StartReplication(slot, options, cancellationToken))
{
    Console.WriteLine(message);
}

But every time I run the application, I get all messages from the beginning. Is there some way to confirm the processing of the message? I’ve tried using SendStatusUpdate but it doesn’t work:

await foreach (var message in connection.StartReplication(slot, options, cancellationToken))
{
    Console.WriteLine(message);

    await connection.SendStatusUpdate(cancellationToken);
}
  1. When the application does not receive messages for a long time, I get an exception:
Npgsql.NpgsqlException (0x80004005): Exception while reading from stream
 ---> System.TimeoutException: Timeout during reading attempt
   at Npgsql.NpgsqlConnector.<ReadMessage>g__ReadMessageLong|194_0(NpgsqlConnector connector, Boolean async, DataRowLoadingMode dataRowLoadingMode, Boolean readingNotifications, Boolean isReadingPrepend
edMessage)
   at Npgsql.Replication.ReplicationConnection.StartReplicationInternal(String command, Boolean bypassingStream, CancellationToken cancellationToken)+MoveNext()
   at Npgsql.Replication.ReplicationConnection.StartReplicationInternal(String command, Boolean bypassingStream, CancellationToken cancellationToken)+MoveNext()
   at Npgsql.Replication.ReplicationConnection.StartReplicationInternal(String command, Boolean bypassingStream, CancellationToken cancellationToken)+System.Threading.Tasks.Sources.IValueTaskSource<Syst
em.Boolean>.GetResult()
   at Npgsql.Replication.PgOutput.PgOutputAsyncEnumerable.StartReplicationInternal(CancellationToken cancellationToken)+MoveNext()
   at Npgsql.Replication.PgOutput.PgOutputAsyncEnumerable.StartReplicationInternal(CancellationToken cancellationToken)+MoveNext()
   at Npgsql.Replication.PgOutput.PgOutputAsyncEnumerable.StartReplicationInternal(CancellationToken cancellationToken)+System.Threading.Tasks.Sources.IValueTaskSource<System.Boolean>.GetResult()

I’ve tried using an infinite loop like this:

while (true)
{
    try
    {
        await foreach (var message in connection.StartReplication(slot, options, cancellationToken))
        {
            Console.WriteLine(message);
        }
    }
    catch (NpgsqlException ex)
    {
        Console.WriteLine(ex);
        continue;
    }
}

But this again reads all the messages from the beginning. How to handle this situation correctly?

  1. Is there some way to get old values in updated and deleted rows? In this case, there is no way to understand which row was deleted and process it:
if (message is DeleteMessage deleteMessage)
{
    // How to process this message?
}

Issue Analytics

  • State:closed
  • Created 3 years ago
  • Comments:16 (13 by maintainers)

github_iconTop GitHub Comments

2reactions
Brarcommented, Mar 25, 2021

@Chakrygin I just want to let you know that we’ve released 5.0.4 which contains the fix for the problem described above.

2reactions
Brarcommented, Mar 11, 2021

What is the difference between LastAppliedLsn and LastFlushedLsn? In what scenarios will the LastAppliedLsn update be useful?

It’s essentially two different levels of persistence that you can report back to the server.

Above I wrote that “I’d advise you to keep track of their log sequence number (LSN) in your consuming application” but I since have no idea what your application will do and what consistency guarantees it needs, I didn’t go any further. You might somehow process the transactions you received from the server in memory and report back, that you’ve successfully applied the transaction in your system (e. g. that it’s visible to users) via LastAppliedLsn. On the other hand you may not want to persist the transaction to disk storage immediately (e. g. for performance reasons) using fsync (or FileStream.Flush()) but once you do so, you can report this back to the server via LastFlushedLsn.

In synchronous replication you can use the synchronous_commit server configuration option to configure the guarantees the server will await from the replication standby (your application) for transaction commits.

You can have a look on our SynchronousReplication test if you want to look at the details.

Am I correct in understanding that updating LastAppliedLsn is optional?

I’d say yes, for asynchronous replication scenarios, but if you look at the documentation around synchronous_commit you’ll probably see that it’s pretty confusing. Personally I’d always assign both of them. Either at the same time or independently, depending on whether the client has applied the transaction or has flushed it to the storage system.

Read more comments on GitHub >

github_iconTop Results From Across the Web

How to understand why logical replication timeout
I have set up N-1 Postgresql (v12) logical replications: N publisher dbs to 1 subscriber db. And there are replication timeout logs in...
Read more >
Thread: Logical replication timeout problem
Logical replication is configured on one instance in version 10.18. Timeout errors occur regularly and the worker process exit with an exit code...
Read more >
How PostgreSQL 15 improved communication in logical ...
This may cause unexpected timeout error even though the walsender is working as expected. Contents. > Communication in logical replication.
Read more >
Why Is My Postgres Replication Slot Timing Out?
Learn how replication slots work.
Read more >
Re: Logical replication hangs up.
we are suing logical replication on 10.4 and it now hangs. After > > some timeout it is retarted again, replaying 18GB of...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found