question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Neo4jCsvPublisher Speed Optimization (Parallelism)

See original GitHub issue

Hi Team, I’m wondering if there’s a plan to apply multiprocessing on the publishers. We have a large amount of metadata in our production, which ended up running 3 million queries on neo4j . It takes about 90 minutes to finish.

To investigate the bottleneck, I looked into the code and logged the time elapsed for each step in a single iteration in the _publish_node function. This is the result

  • Neo4j query: 0.1ms
  • Create statement: 1ms
  • Others: super fast, doesn’t matter image

Surprisingly, the bottleneck is not the db query, it’s the statement creation. The process is basically

  1. loop each row in csv
  2. parse the row into a dictionary
  3. loop through each key value pair in the dictionary to get the props
  4. fill the statement Jinjia template with the props
  5. execute the query with the statement

I’m thinking that instead of read a row => create a node in graph db one by one, maybe we could use multiprocessing to speed up the process. I believe there will be no dependency issue as long as we publish all the nodes before publishing relations, which is already handled in the current codebase. I’m planning on implementing multiprocessing for this, is there any potential problem? Like dependency, graph db load, etc…

Expected Behavior or Use Case

Speed up the performance of the publisher. Currently, a 90 min sync is not acceptable for our use case 😢

Service or Ingestion ETL

Ingestion ETL, publisher

Possible Implementation

Thanks to @dkunitsk 's idea, I think there are three possible implementations

  1. Multiprocessing on call side
  2. Multiprocessing on Neo4j publisher
  3. Neo4j UNWIND (Batch processing)

image image image

class HiveParallelIndexer:
    # Shim for adding all node labels to the NEO4J_DEADLOCK_NODE_LABELS config
    # which enables retries for those node labels. This is important for parallel writing
    # since we see intermittent Neo4j deadlock errors relatively often.
    class ContainsAllList(list):
        def __contains__(self, item):
            return True

    def __init__(self, publish_tag: str, parallelism: int):
        self.publish_tag = publish_tag
        self.parallelism = parallelism

    def __call__(self, worker_index: int):
        # Sharding:
        #   - take the md5 hash of the schema.table_name
        #   - convert the first 3 characters of the hash to decimal (3 chosen arbitrarily)
        #   - mod by total number of processes
        where_clause_suffix = """
            WHERE MOD(CONV(LEFT(MD5(CONCAT(d.NAME, '.', t.TBL_NAME)), 3), 16, 10), {total_parallelism}) = {worker_index}
            AND t.TBL_TYPE IN ('EXTERNAL_TABLE', 'MANAGED_TABLE', 'VIRTUAL_VIEW')
            AND (t.VIEW_EXPANDED_TEXT != '/* Presto View */' OR t.VIEW_EXPANDED_TEXT is NULL)
        """.format(total_parallelism=self.parallelism,
            worker_index=worker_index)

        # configs relevant for multiprocessing
        job_config = ConfigFactory.from_dict({
            'extractor.hive_table_metadata.{}'.format(HiveTableMetadataExtractor.WHERE_CLAUSE_SUFFIX_KEY):
                where_clause_suffix,
            # keeping this relatively low, in our experience, reduces neo4j deadlocks
            'publisher.neo4j.{}'.format(neo4j_csv_publisher.NEO4J_TRANSACTION_SIZE):
                100,
            'publisher.neo4j.{}'.format(neo4j_csv_publisher.NEO4J_DEADLOCK_NODE_LABELS):
                HiveParallelIndexer.ContainsAllList(),
        })
        job = DefaultJob(conf=job_config,
                         task=DefaultTask(
                             extractor=HiveTableMetadataExtractor(),
                             loader=FsNeo4jCSVLoader()),
                         publisher=Neo4jCsvPublisher())
        job.launch()


parallelism = 16
indexer = HiveParallelIndexer(
    publish_tag='2021-12-03'
    parallelism=parallelism)

with multiprocessing.Pool(processes=parallelism) as pool:
    def callback(_):
        # fast fail in case of exception in any process
        print('terminating due to exception')
        pool.terminate()
    res = pool.map_async(indexer, [i for i in range(parallelism)], error_callback=callback)
    res.get()

Screenshots of Slack Discussion

image image

Issue Analytics

  • State:closed
  • Created 2 years ago
  • Reactions:5
  • Comments:22 (8 by maintainers)

github_iconTop GitHub Comments

5reactions
zacrcommented, Feb 25, 2022

Wasn’t sure if I am allowed to make a branch in the main repo or if I am supposed to fork first. Hopefully you can see this, I did it as a new file so could run both side by side: https://github.com/SaltIO/amundsen/blob/neo4j_csv_publisher_apoc/databuilder/databuilder/publisher/neo4j_csv_publisher_apoc.py

I didn’t do any query optimization or tuning, the goal was to just try to get functional cypher queries using apoc.periodic.iterate which has support for parallelization. I suspect there is some meat on the bone by investigating and tuning these new cypher queries. The most recent version looks more like 3.5x faster instead of 5x, but I didn’t do any serious timing. If you turn on Parallel:true and retries: 1 it goes 5x but locks will cause you to drop maybe 0.3% of the inserts. These cypher queries may be on the path to 7x+ with tuning unless I did them wrong.

After you investigate if you think this APOC based solution has promise and you or someone else wants to run with it that is great by me. In the meantime I am just going to try to let it run in a few projects with different extractors over the next few weeks to surface issues.

3reactions
afuzzyriffcommented, Mar 23, 2022

Hello - I stumbled across this issue as I was facing a similar problem of poor performance with large Neo4j writes.

@zacr Thank you for sharing your Neo4j CSV Publisher with APOC code. I reviewed the class and started to test this in my team’s Amundsen POC implementation for Neo4j CSV publishing. The custom class provided seems to be working nicely.

I have tested a batch of data which resulted in writing 4,848,906 nodes and 9,715,678 relationships in 5704.64 seconds (~95 minutes) using an 8k batch size. I don’t have good non-APOC query timing results as that scenario was running rather slowly, but I do know this is a vast timing improvement.

Read more comments on GitHub >

github_iconTop Results From Across the Web

How to speed up uploading data from csv in graph db
Solved: Hello , this is my first topic in neo4j community and I am learning neo4j .I am recently trying to upload data...
Read more >
Configuration settings - Operations Manual - Neo4j
This page provides a complete reference to the Neo4j configuration settings.
Read more >
Solved: How to make load csv go faster? - Neo4j Community
It is loading at very slow pace. What is a way to increase the loading speed and not get dead lock errors. I...
Read more >
How can I improve the performance of this query? - Neo4j
I'm using the apoc.periodic.iterate to create a graph from a csv file containing 10k rows (to ... If I set the parallel parameter...
Read more >
Importing CSV Data into Neo4j - Developer Guides
Ways to Import CSV Files · LOAD CSV Cypher command: this command is a great starting point and handles small- to medium-sized data...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found