question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Architecture: Use multiple cores to run link archiving in parallel

See original GitHub issue

Add a --parallel=8 cli option to enable using multiprocessing to download a large number of links in parallel. Default to number of cores on machine, allow --parallel=1 to override it to 1 core.

Issue Analytics

  • State:open
  • Created 5 years ago
  • Reactions:6
  • Comments:15 (10 by maintainers)

github_iconTop GitHub Comments

4reactions
piratecommented, May 7, 2021

With v0.6 released now we’ve taken another step towards the goal of using a message-passing architecture to fully support parallel archiving. v0.6 moves that last bit of ArchiveResult state into the SQLite3 db where it can be managed with migrations and kept ACID compliant.

The next step of the process is to implement a worker queue for DB writes, and have all writes made to Snapshot/ArchiveResult models processed in a single thread, opening up other threads to be able to do things in parallel without locking the db anymore. Message passing is a big change though, so expect it to come in increments, with about 3~6 months of work to go depending on how much free time I have for ArchiveBox.

Side note: the UX of v0.6 is >10x faster in many other ways though (web UI, indexing, management tasks, etc.), only archiving itself remains to be sped up now. You can also still attempt to run arhcivebox add commands in parallel, it’s safe and works to speed up archiving a lot already, but you may encounter occasional database locked warnings that mean you have to restart stuck additions manually.

2reactions
piratecommented, Dec 10, 2020

A quick update for everyone watching this, v0.5.0 is going to be released soon with improvements to how ArchiveResults are stored (we moved them into the SqliteDB). This was a necessary blocker to fix before we can get around to parallel archiving in the next version.

v0.5.0 will be faster, but it wont have built-in concurrent archiving support yet, that will be the primary focus for v0.6.0. The plan is to add a background task queue handler like dramatiq or more likely huey (because it has sqlite3 support so we don’t need to run redis).

Once we have the background task worker system in place, we can implement a worker pool for Chrome/playwright and each of the other extractor methods. Then archiving can run in parallel by default, archiving like 5-10 sites at a time depending on the system resources available and how well the worker pool system performs for each extractor type. Huey and dramatic both have built-in rate limiting systems that will allow us to cap the number of concurrent requests going to each site or being handled by each extractor. It’s still quite a bit of work left, but we’re getting closer!

Having a background task system will also enable us to do many other cool things, like building the scheduled import system into the UI #578, using a single shared chrome process instead of relaunching chrome for each link, and many other small improvements to performance.

Read more comments on GitHub >

github_iconTop Results From Across the Web

Architecture: Use multiple cores to run link archiving in parallel
a program like curtsies two show each stream in a grid, tail -f style; having each subprocess >> /logfile, but obtain a lock...
Read more >
Parallelization Control and Configuration - NRAO CASA
CASA can be run in parallel on a cluster of computer nodes or on a single multi-core computer. In the multi-node case, the...
Read more >
Multithreading (computer architecture) - Wikipedia
In a multithreaded application, the threads share the resources of a single or multiple cores, which include the computing units, the CPU caches,...
Read more >
How to write apps for multiple cores: Divide and conquer - GCN
And so, until now, a software programmer's task was relatively easy: Write a series of sequential instructions for processors to execute one ...
Read more >
Introduction to Parallel Computing Tutorial | HPC @ LLNL
MAKE BETTER USE OF UNDERLYING PARALLEL HARDWARE. Modern computers, even laptops, are parallel in architecture with multiple processors/cores. Parallel software ...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found