Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

KeyError: 'link' in capture

See original GitHub issue

When I was running this code:

$ python3.8 test_save.py
{'Server': 'nginx/1.15.8', 'Date': 'Wed, 15 Jul 2020 11:59:50 GMT', 'Content-Type': 'text/html', 'Transfer-Encoding': 'chunked', 'Connection': 'keep-alive', 'Cache-Control': 'no-cache'}
capture: 42.963274240493774 sec.
{'Server': 'nginx/1.15.8', 'Date': 'Wed, 15 Jul 2020 12:01:37 GMT', 'Content-Type': 'text/html', 'Transfer-Encoding': 'chunked', 'Connection': 'keep-alive', 'Cache-Control': 'no-cache'}
capture_or_cache: 97.4388906955719 sec.
Traceback (most recent call last):
  File "test_save.py", line 28, in <module>
    main()
  File "test_save.py", line 24, in main
    measure(fun, url)
  File "test_save.py", line 8, in measure                                                               
    print(f(*arg))
  File "/home/eggplants/.pyenv/versions/3.8.0/lib/python3.8/site-packages/savepagenow/api.py", line 55, in capture
    header_links = parse_header_links(response.headers['Link'])
  File "/home/eggplants/.pyenv/versions/3.8.0/lib/python3.8/site-packages/requests/structures.py", line 54, in __getitem__
    return self._store[key.lower()][1]
KeyError: 'link'

Issue Analytics

State:
Created 3 years ago
Comments:10 (4 by maintainers)

Top GitHub Comments

5reactions

tlcaputicommented, Jul 31, 2020

I needed to figure out a quick fix for this same problem, and I ended up writing this. It’s not the most exact or beautifully written piece of code in the world, but it works for my purposes. Maybe it’ll work for yours.


# MIT License

import requests
from bs4 import BeautifulSoup
from urllib.parse import urljoin
import datetime
from time import sleep


def archive_url(
    url, 
    timeout=100, 
    user_agent='Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/84.0.4147.89 Safari/537.36'
    ):

    """Submits a URL to WebArchive's Save Page Now Feature (working as of 2020-07-31 on Python 3.6.10)
    
    Keyword arguments:
    url -- The url you want to archive
    timeout -- Max number of seconds you're willing to wait
    user_agent -- You can pass a custom user agent here

    """

    # POST Request
    headers = {
        'authority': 'web.archive.org',
        'cache-control': 'max-age=0',
        'upgrade-insecure-requests': '1',
        'origin': 'https://web.archive.org',
        'content-type': 'application/x-www-form-urlencoded',
        'user-agent': user_agent,
        'accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.9',
        'sec-fetch-site': 'same-origin',
        'sec-fetch-mode': 'navigate',
        'sec-fetch-user': '?1',
        'sec-fetch-dest': 'document',
        'referer': 'https://web.archive.org/save',
        'accept-language': 'en-US,en;q=0.9,de;q=0.8',
    }

    data = {
        'url': url,
        'capture_all': 'on'
    }

    r = requests.post(f'https://web.archive.org/save/{url}', headers=headers, data=data)

    # BS4 get SCRIPTS and find watchJob arguments
    soup = BeautifulSoup(r.content, 'html.parser')
    scripts = soup.find_all("script")

    job_id = None
    for script in scripts:
        string = script.string
        if string and "watchJob" in string:
            args_string_list = string.strip().split('"')
            job_id = args_string_list[1]
            break

    assert job_id is not None, "Couldn't find job_id in html"


    # Request status of the job
    out_url = None
    was_pending = False
    wait_time = 0
    while wait_time < timeout:

        
        r = requests.get(f"https://web.archive.org/save/status/{job_id}?_t={datetime.datetime.now().timestamp()}", headers=headers)
        rj = r.json()

        if rj.get('status', 'none') == "pending":
            was_pending = True

        if rj.get('status', 'none') == "success":
            original_url = rj.get('original_url', 'none')
            ext_url = f"/web/{rj['timestamp']}/{rj['original_url']}"
            out_url = urljoin('https://web.archive.org', ext_url)
            break
        

        seconds_to_wait = int(r.headers.get("Retry-After", 5))
        print(f"[{wait_time} seconds elapsed] Waiting for archive to complete...")
        wait_time += seconds_to_wait
        sleep(seconds_to_wait)

    assert out_url is not None, f"Process did not complete after {timeout} seconds"

    out = {
        "original_url": original_url,
        "archive_url": out_url,
        "from_cache": was_pending == False
    }

    return out

if __name__ == "__main__":
    url = "https://ultimateframedata.com/"
    print(archive_url(url))

1reaction

palewirecommented, Sep 8, 2020

I’ve pushed a change as proposed here live in version 1.1.0. @dannguyen and @eggplants, tell me if it fixes things for you.

https://pypi.org/project/savepagenow/1.1.0/