All Internet Archive links, used in reference testing, are broken
See original GitHub issueApparently the Internet Archive, which we depend on for a very large number of (linked) reference test-cases, has recently changed how they serve PDF files.
Previously, a URL such as http://web.archive.org/web/20160112115354/http://www.fao.org/fileadmin/user_upload/tci/docs/2_About Stacks.pdf would return a PDF file directly. However, now a HTML file is returned instead (which then points to the actual PDF file). For someone cloning the PDF.js repo, and attempting to set-up testing for the first time, this means that all linked test-cases will now fail. Furthermore, it also means that we cannot use the Internet Archive when adding new test-cases.
Since the HTML file returned does contain a direct link to the PDF file, embedded in an <iframe>
tag, we could perhaps add special-casing for Internet Archive URLs in test/downloadutils.js, such that the HTML file is first downloaded and parsed to obtain a direct PDF link.
Issue Analytics
- State:
- Created 6 years ago
- Comments:8 (5 by maintainers)
Top GitHub Comments
I have one more idea for this. It’s a bit of a hybrid approach for the two solutions. How about in
test/downloadutils.js
we detect that we are dealing with an Internet Archive URL and perform theif_
transformation there? That way we don’t have to touch the link files (search/replace) and can easily adjust the code if the Internet Archive were to change its format again (or implement HTML parsing there later on if it happens often)? It will be quick and keep the option for HTML parsing open (while we avoid it for now).Yes, I’m hoping I can take a look at this before or during the weekend.