Improve DB query performance for rendering snapshot_icons on snapshot index pages
See original GitHub issueDescribe the bug
Steps to reproduce
on a modern laptop with NVMe drive running Debian Testing:
archivebox init
cat 3000urls.txt | archivebox add
- cancel the addition, doesn’t need to complete
archivebox server
- visit
/public/
page load takes 15+ seconds.
Screenshots or log output
i’m including a run with Django profiling middleware when hitting /public/?prof
.
the majority of cumtime
is spent in snapshot_icons
:
12839151 function calls (12677025 primitive calls) in 15.232 seconds
Ordered by: internal time
List reduced from 698 to 100 due to restriction <100>
ncalls tottime percall cumtime percall filename:lineno(function)
12453 3.313 0.000 3.314 0.000 {function SQLiteCursorWrapper.execute at 0x7fbfd54e65e0}
14787 0.611 0.000 0.647 0.000 {built-in method posix.stat}
63827 0.249 0.000 0.415 0.000 pathlib.py:63(parse_parts)
6226 0.238 0.000 0.238 0.000 {method 'fetchone' of 'sqlite3.Cursor' objects}
3113 0.189 0.000 15.134 0.005 html.py:118(snapshot_icons)
155650 0.179 0.000 0.849 0.000 functional.py:218(wrapper)
778294/728484 0.176 0.000 0.505 0.000 {built-in method builtins.hasattr}
18680 0.170 0.000 0.170 0.000 {method 'close' of 'sqlite3.Cursor' objects}
49810 0.165 0.000 0.390 0.000 local.py:46(_get_context_id)
892363/892362 0.156 0.000 0.181 0.000 {built-in method builtins.isinstance}
193006 0.155 0.000 0.200 0.000 safestring.py:50(mark_safe)
9339 0.150 0.000 1.610 0.000 query.py:1203(build_filter)
28019 0.135 0.000 0.295 0.000 query.py:1423(names_to_path)
155650 0.135 0.000 0.220 0.000 __init__.py:12(escape)
348721/323816 0.126 0.000 0.341 0.000 {built-in method builtins.getattr}
34243 0.121 0.000 1.324 0.000 html.py:107(format_html)
46695 0.121 0.000 0.139 0.000 {method 'format' of 'str' objects}
155650 0.119 0.000 0.477 0.000 html.py:33(escape)
158763 0.117 0.000 1.007 0.000 html.py:92(conditional_escape)
60720 0.116 0.000 0.546 0.000 pathlib.py:671(_parse_args)
17894 0.112 0.000 0.692 0.000 schema.py:262(link_dir)
...
ArchiveBox version
$ archivebox version
ArchiveBox v0.5.4
Cpython Linux Linux-5.10.0-2-amd64-x86_64-with-glibc2.31 x86_64 (not in Docker)
[i] Dependency versions:
√ ARCHIVEBOX_BINARY v0.5.4 valid /home/...local/bin/archivebox
√ PYTHON_BINARY v3.9.1 valid /usr/bin/python3.9
√ DJANGO_BINARY v3.1.3 valid /home/.../.local/lib/python3.9/site-packages/django/bin/django-admin.py
√ CURL_BINARY v7.74.0 valid /usr/bin/curl
√ WGET_BINARY v1.21 valid /usr/bin/wget
√ NODE_BINARY v15.0.0 valid /home/.../.nvm/versions/node/v15.0.0/bin/node
√ SINGLEFILE_BINARY v0.3.6 valid ./node_modules/archivebox/node_modules/single-file/cli/single-file
√ READABILITY_BINARY v0.1.0 valid ./node_modules/archivebox/node_modules/readability-extractor/readability-extractor
√ MERCURY_BINARY v1.0.0 valid ./node_modules/archivebox/node_modules/@postlight/mercury-parser/cli.js
√ GIT_BINARY v2.30.0 valid /usr/bin/git
√ YOUTUBEDL_BINARY v2021.01.08 valid /usr/bin/youtube-dl
√ CHROME_BINARY v88.0.4324.96 valid /usr/bin/chromium
√ RIPGREP_BINARY v12.1.1 valid /usr/bin/rg
[i] Source-code locations:
√ PACKAGE_DIR 23 files valid /home/.../.local/lib/python3.9/site-packages/archivebox
√ TEMPLATES_DIR 3 files valid /home/.../.local/lib/python3.9/site-packages/archivebox/templates
[i] Secrets locations:
- CHROME_USER_DATA_DIR - disabled
- COOKIES_FILE - disabled
[i] Data locations:
√ OUTPUT_DIR 9 files valid /home/.../scratch/archivebox
√ SOURCES_DIR 1 files valid ./sources
√ LOGS_DIR 0 files valid ./logs
√ ARCHIVE_DIR 20 files valid ./archive
√ CONFIG_FILE 164.0 Bytes valid ./ArchiveBox.conf
√ SQL_INDEX 1.3 MB valid ./index.sqlite3
Issue Analytics
- State:
- Created 3 years ago
- Reactions:2
- Comments:7 (3 by maintainers)
Top Results From Across the Web
How to Improve Your Database Performance with Query ...
If you're running into application scalability issues, or expect to at some point, here's how query snapshots can help.
Read more >SQL Server index best practices to improve performance
These 11 SQL Server index best practices will help you ensure peak database performance and improve your approach to performance tuning.
Read more >Database Snapshots (SQL Server) - Microsoft Learn
Clients can query a database snapshot, which makes it useful for writing reports based on the data at the time of snapshot creation....
Read more >Chapter 4. Query Performance Optimization - O'Reilly
In the previous chapter, we explained how to optimize a schema, which is one of the necessary conditions for high performance.
Read more >Performance Best Practice for Efficient Queries - ServiceNow
Caching data to improve performance Improving Slow OR and JOIN ... #2 Use Database indexes with the most efficient operator for the job....
Read more >Top Related Medium Post
No results found
Top Related StackOverflow Question
No results found
Troubleshoot Live Code
Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start FreeTop Related Reddit Thread
No results found
Top Related Hackernoon Post
No results found
Top Related Tweet
No results found
Top Related Dev.to Post
No results found
Top Related Hashnode Post
No results found
Top GitHub Comments
The new performance improvements will be out with the new v0.6 release (coming soon).
this may not be idiomatic for django, but i couldn’t find any really clear documentation on how the overall call flow is meant to be for pagination. however, this seems to work and improves performance significantly: