question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Improve DB query performance for rendering snapshot_icons on snapshot index pages

See original GitHub issue

Describe the bug

Steps to reproduce

on a modern laptop with NVMe drive running Debian Testing:

  1. archivebox init
  2. cat 3000urls.txt | archivebox add
  3. cancel the addition, doesn’t need to complete
  4. archivebox server
  5. visit /public/

page load takes 15+ seconds.

Screenshots or log output

i’m including a run with Django profiling middleware when hitting /public/?prof.

the majority of cumtime is spent in snapshot_icons:

         12839151 function calls (12677025 primitive calls) in 15.232 seconds

   Ordered by: internal time
   List reduced from 698 to 100 due to restriction <100>

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
    12453    3.313    0.000    3.314    0.000 {function SQLiteCursorWrapper.execute at 0x7fbfd54e65e0}
    14787    0.611    0.000    0.647    0.000 {built-in method posix.stat}
    63827    0.249    0.000    0.415    0.000 pathlib.py:63(parse_parts)
     6226    0.238    0.000    0.238    0.000 {method 'fetchone' of 'sqlite3.Cursor' objects}
     3113    0.189    0.000   15.134    0.005 html.py:118(snapshot_icons)
   155650    0.179    0.000    0.849    0.000 functional.py:218(wrapper)
778294/728484    0.176    0.000    0.505    0.000 {built-in method builtins.hasattr}
    18680    0.170    0.000    0.170    0.000 {method 'close' of 'sqlite3.Cursor' objects}
    49810    0.165    0.000    0.390    0.000 local.py:46(_get_context_id)
892363/892362    0.156    0.000    0.181    0.000 {built-in method builtins.isinstance}
   193006    0.155    0.000    0.200    0.000 safestring.py:50(mark_safe)
     9339    0.150    0.000    1.610    0.000 query.py:1203(build_filter)
    28019    0.135    0.000    0.295    0.000 query.py:1423(names_to_path)
   155650    0.135    0.000    0.220    0.000 __init__.py:12(escape)
348721/323816    0.126    0.000    0.341    0.000 {built-in method builtins.getattr}
    34243    0.121    0.000    1.324    0.000 html.py:107(format_html)
    46695    0.121    0.000    0.139    0.000 {method 'format' of 'str' objects}
   155650    0.119    0.000    0.477    0.000 html.py:33(escape)
   158763    0.117    0.000    1.007    0.000 html.py:92(conditional_escape)
    60720    0.116    0.000    0.546    0.000 pathlib.py:671(_parse_args)
    17894    0.112    0.000    0.692    0.000 schema.py:262(link_dir)
...

ArchiveBox version

$ archivebox version
ArchiveBox v0.5.4
Cpython Linux Linux-5.10.0-2-amd64-x86_64-with-glibc2.31 x86_64 (not in Docker)

[i] Dependency versions:
 √  ARCHIVEBOX_BINARY     v0.5.4          valid     /home/...local/bin/archivebox                                         
 √  PYTHON_BINARY         v3.9.1          valid     /usr/bin/python3.9                                                          
 √  DJANGO_BINARY         v3.1.3          valid     /home/.../.local/lib/python3.9/site-packages/django/bin/django-admin.py 
 √  CURL_BINARY           v7.74.0         valid     /usr/bin/curl                                                               
 √  WGET_BINARY           v1.21           valid     /usr/bin/wget                                                               
 √  NODE_BINARY           v15.0.0         valid     /home/.../.nvm/versions/node/v15.0.0/bin/node                           
 √  SINGLEFILE_BINARY     v0.3.6          valid     ./node_modules/archivebox/node_modules/single-file/cli/single-file          
 √  READABILITY_BINARY    v0.1.0          valid     ./node_modules/archivebox/node_modules/readability-extractor/readability-extractor
 √  MERCURY_BINARY        v1.0.0          valid     ./node_modules/archivebox/node_modules/@postlight/mercury-parser/cli.js     
 √  GIT_BINARY            v2.30.0         valid     /usr/bin/git                                                                
 √  YOUTUBEDL_BINARY      v2021.01.08     valid     /usr/bin/youtube-dl                                                         
 √  CHROME_BINARY         v88.0.4324.96   valid     /usr/bin/chromium                                                  
 √  RIPGREP_BINARY        v12.1.1         valid     /usr/bin/rg                                                                 

[i] Source-code locations:
 √  PACKAGE_DIR           23 files        valid     /home/.../.local/lib/python3.9/site-packages/archivebox                 
 √  TEMPLATES_DIR         3 files         valid     /home/.../.local/lib/python3.9/site-packages/archivebox/templates       

[i] Secrets locations:
 -  CHROME_USER_DATA_DIR  -               disabled                                                                              
 -  COOKIES_FILE          -               disabled                                                                              

[i] Data locations:
 √  OUTPUT_DIR            9 files         valid     /home/.../scratch/archivebox                                            
 √  SOURCES_DIR           1 files         valid     ./sources                                                                   
 √  LOGS_DIR              0 files         valid     ./logs                                                                      
 √  ARCHIVE_DIR           20 files        valid     ./archive                                                                   
 √  CONFIG_FILE           164.0 Bytes     valid     ./ArchiveBox.conf                                                           
 √  SQL_INDEX             1.3 MB          valid     ./index.sqlite3  

Issue Analytics

  • State:closed
  • Created 3 years ago
  • Reactions:2
  • Comments:7 (3 by maintainers)

github_iconTop GitHub Comments

1reaction
piratecommented, Apr 6, 2021

The new performance improvements will be out with the new v0.6 release (coming soon).

1reaction
khimaroscommented, Feb 6, 2021

this may not be idiomatic for django, but i couldn’t find any really clear documentation on how the overall call flow is meant to be for pagination. however, this seems to work and improves performance significantly:

    def get_queryset(self, **kwargs): 
        qs = super().get_queryset(**kwargs) 
        query = self.request.GET.get('q')
        if query:
            qs = qs.filter(Q(title__icontains=query) | Q(url__icontains=query) | Q(timestamp__icontains=query) | Q(tags__name__icontains=query))

        _, _, pqs, _ = self.paginate_queryset(qs, self.paginate_by)
        for snapshot in pqs:
            snapshot.icons = snapshot_icons(snapshot)
        return qs
Read more comments on GitHub >

github_iconTop Results From Across the Web

How to Improve Your Database Performance with Query ...
If you're running into application scalability issues, or expect to at some point, here's how query snapshots can help.
Read more >
SQL Server index best practices to improve performance
These 11 SQL Server index best practices will help you ensure peak database performance and improve your approach to performance tuning.
Read more >
Database Snapshots (SQL Server) - Microsoft Learn
Clients can query a database snapshot, which makes it useful for writing reports based on the data at the time of snapshot creation....
Read more >
Chapter 4. Query Performance Optimization - O'Reilly
In the previous chapter, we explained how to optimize a schema, which is one of the necessary conditions for high performance.
Read more >
Performance Best Practice for Efficient Queries - ServiceNow
Caching data to improve performance Improving Slow OR and JOIN ... #2 Use Database indexes with the most efficient operator for the job....
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found