question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

neobolt.exceptions.ServiceUnavailable in ecr.load_ecr_repository_images() when connecting to remote Neo4j

See original GitHub issue

Description:

What issue is being seen? Describe what should be happening instead of the bug, for example: Cartography should not crash, the expected value isn’t returned, the data schema is wrong, etc.

When loading 50MB worth of data to a remote Neo4j server (i.e. not located on the same machine), ecr.load_ecr_repository_images() crashes with a neobolt.exceptions.ServiceUnavailable error after running for 2 hours.

To Reproduce:

Steps to reproduce the behavior. Provide all data and inputs required to reproduce the issue.

Run ecr.load_ecr_repository_images() with 50MB of data.

POC code:

from neo4j import GraphDatabase
import cartography.intel.aws.ecr
import time
# You will need to provide your own data here.
# data shape = [{'repo_uri': 'uri', 'repo_images': [{'imageDigest': 'mydigest', 'imageTag': 'mytag'}, ...]},...]
from image_data import image_list

neo4j_driver = GraphDatabase.driver("bolt://your-remote-endpoint:7687")
neo4j_session = neo4j_driver.session()
account_id = '1234'
aws_update_tag = int(time.time())
region = 'us-east-1'

common_job_parameters = {
    "UPDATE_TAG": aws_update_tag,
    "AWS_ID": account_id,
}

cartography.intel.aws.ecr.load_ecr_repository_images(neo4j_session, image_list, region, aws_update_tag)
cartography.intel.aws.ecr.cleanup(neo4j_session, common_job_parameters)

Logs:

If applicable, copy and paste your console log with the failing stack trace.

Traceback (most recent call last):
  File "/Users/achantavy/.pyenv/versions/3.7.9/lib/python3.7/ssl.py", line 1071, in recv_into
    return self.read(nbytes, buffer)
  File "/Users/achantavy/.pyenv/versions/3.7.9/lib/python3.7/ssl.py", line 929, in read
    return self._sslobj.read(len, buffer)
ConnectionResetError: [Errno 54] Connection reset by peer
Exception ignored in: 'neobolt.bolt._io.ChunkedInputBuffer.receive'
Traceback (most recent call last):
  File "/Users/achantavy/.pyenv/versions/3.7.9/lib/python3.7/ssl.py", line 1071, in recv_into
    return self.read(nbytes, buffer)
  File "/Users/achantavy/.pyenv/versions/3.7.9/lib/python3.7/ssl.py", line 929, in read
    return self._sslobj.read(len, buffer)
ConnectionResetError: [Errno 54] Connection reset by peer
Traceback (most recent call last):
  File "load_ecr_list_images.py", line 17, in <module>
    cartography.intel.aws.ecr.load_ecr_repository_images(neo4j_session, image_list, region, aws_update_tag)
  File "/Users/achantavy/lyftsrc/cartography/cartography/util.py", line 63, in timed
    return method(*args, **kwargs)
  File "/Users/achantavy/lyftsrc/cartography/cartography/intel/aws/ecr.py", line 122, in load_ecr_repository_images
    Region=region,
  File "/Users/achantavy/.virtualenvs/env11/lib/python3.7/site-packages/neo4j/__init__.py", line 503, in run
    self._connection.fetch()
  File "/Users/achantavy/.virtualenvs/env11/lib/python3.7/site-packages/neobolt/direct.py", line 414, in fetch
    return self._fetch()
  File "/Users/achantavy/.virtualenvs/env11/lib/python3.7/site-packages/neobolt/direct.py", line 431, in _fetch
    self._receive()
  File "/Users/achantavy/.virtualenvs/env11/lib/python3.7/site-packages/neobolt/direct.py", line 472, in _receive
    raise self.Error("Failed to read from defunct connection {!r}".format(self.server.address))
neobolt.exceptions.ServiceUnavailable: Failed to read from defunct connection Address(host={IP}, port=7687)

Please complete the following information::

  • Cartography release version or commit hash [e.g. 0.12.0 or 95e8e11913e2a44a4d4682506d8364a638ceac69]

0c9a662672cf4925ac8214b44451b1c50374aa97

  • Python version: [e.g. 3.7.4]

3.7.9

  • OS (feel free to omit this if you don’t think it’s relevant to your issue): [e.g. Ubuntu bla bla, OSX bla bla]

Have observed this in a Docker container based on Debian as well as my OSX laptop. Neither of them appear to be resource constrained: CPU usage is around 0%, memory usage of the python process is about 200-300MB.

Additional context:

Add any other context about the problem here.

This appears related to

all of these issues involve sending fairly large objects over the Bolt connection.


Update:

I’ve also observed this issue on load_ecr_repositories().

Issue Analytics

  • State:closed
  • Created 3 years ago
  • Comments:11 (3 by maintainers)

github_iconTop GitHub Comments

2reactions
achantavycommented, Jul 23, 2021

@voutilad @jexp Thanks for the help so far. I saw this issue happen on a different section of internal code and was able to resolve it by using explicit transactions!

Deployment information: we have a k8s cronjob running neo4j python driver 1.7.6, writing data to a Neo4j enterprise 3.5.19 database across an AWS Network Load Balancer.

To summarize, the code would

  1. Instantiate a single neo4j session
  2. Use that session to write some data to the graph
  3. Do some more parsing/data transforms
  4. Go to (2) until we are done

Most times, step (3) would take longer than 380 seconds and the code would work fine, which is not what I would expect because this is longer than the timeout from our AWS NLB and the value of our neo4j driver’s max_connection_lifetime value. Anyway, we found that running this code with a specific set of data would cause step (4) to reliably fail with a ConnectionResetError, resulting in a neo4j.ServiceUnavailable exception.

To fix, I changed the code to explicitly use session.write_transaction() instead of auto-commit transactions with session.run() and now the code seems to work magically! I have not implemented any retry logic myself at all.

To get to this solution, I stumbled upon this section of the current driver doc: https://neo4j.com/docs/api/python-driver/current/api.html#managed-transactions-transaction-functions

[Managed transactoins] allow a function object representing the transactional unit of work to be passed as a parameter. This function is called one or more times, within a configurable time limit, until it succeeds.

Sure enough, this seemed encouraging and it worked! Prior to this I had only been reading the docs for the 1.7 driver but it seems that the docs have become more thorough for the current drivers.

I’ll push out a similar fix to address this specific issue and other related ones. I guess to summarize Python driver best practices that we’ve learned in this project,

  • Use the unwind pattern for speed and batching
  • Use explicit transaction functions
  • Consume the results within the transaction functions
  • Be aware of max_connection_lifetime, especially if you’re forced to deal with load balancers
  • Ensure the size of the transaction is not too large

This has been bugging me for months and I’m glad to finally have forward movement on this problem. 😃

1reaction
jexpcommented, Jul 2, 2021

@achantavy I’m also happy to help if @voutilad is busy. If you want to we can have a look at the code together, just drop me an email to schedule a call.

Read more comments on GitHub >

github_iconTop Results From Across the Web

How to resolve Python Bolt Driver when executed gives an ...
Take the example of Python with the latest Bolt driver 1.2. Here is the sample code. Python. Copy to Clipboard. from neo4j.
Read more >
neo4j.exceptions.ServiceUnavailable: Failed to establish ...
neo4j.exceptions.ServiceUnavailable: Failed to establish connection to IPv4Address(('0.0.0.0', 7687)) #509.
Read more >
Connecting to your database instance from Neo4j Desktop
Start by creating your Neo4j database in GrapheneDB. ... Connect to remote Graph . ... To connect your database user, please select username...
Read more >
Service unavailable error using neo4j driver for python
You have to add your bolt port when defining the bolt url like below: import os from json import dumps from flask import...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found