searching with ctrl+f doesn't work with two words
See original GitHub issueAttach (recommended) or Link to PDF file here: dee752ed0f726d8785abf360ca783d91f96f9a2e.pdf
Configuration:
- Web browser and its version: Firefox 60/Chromium 66
- Operating system and its version: Linux/Windows 7
- PDF.js version: v1.10.88 or v1.9.426 or the version built into Firefox 60
Steps to reproduce the problem:
- Hit ctrl+f and search for “pioneer of”
- Pioneer will be highlighted, but as soon as you type a space no results are found
pdftotext shows the correct text:
in nit ris hington 1D C
boerge W lacan a pioneer of butali
and an influential man aw at richfield last walk
It works in chrome’s built-in PDF viewer, so it’s not a problem with the pdf.
Link to a viewer (if hosted on a site other than mozilla.github.io/pdf.js or as Firefox/Chrome extension): https://newspapers.lib.utah.edu/pdfjs1.9/web/viewer.html?file=/udn_files/de/e7/dee752ed0f726d8785abf360ca783d91f96f9a2e.pdf
Issue Analytics
- State:
- Created 5 years ago
- Comments:5 (2 by maintainers)
Top Results From Across the Web
Ctrl F Not Working? 12 Proven Ways To Fix It - TechNewsToday
Ctrl + F is a handy shortcut that lets you find words or phrases instantly in your document. But this feature ceases to...
Read more >Solving Word's Ctrl + F find problem - Office Watch
Then press Alt + D to switch to the Find dialog. It's not as simple or fast as Ctrl + F.
Read more >Highlight two different words with the CTRL+F Search function ...
Hello, is it possible to put two different words to search and highlight on a web page using the search function please?
Read more >when I do a ctrl F to find something it doesn't always work all
Do you get an error message saying "Search string not found" ? The first time you search, the cursor is moved to the...
Read more >The search function Ctrl-F, or Command-F (Mac) not working ...
If one chooses the option to “mark” the desired word, or words, it will do so alright, but it will not take the...
Read more >Top Related Medium Post
No results found
Top Related StackOverflow Question
No results found
Troubleshoot Live Code
Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start FreeTop Related Reddit Thread
No results found
Top Related Hackernoon Post
No results found
Top Related Tweet
No results found
Top Related Dev.to Post
No results found
Top Related Hashnode Post
No results found
Top GitHub Comments
I would love to work on this, @timvandermeij Please help me, where to start while working on this issue
Hello guys, as I’m sure you’re aware other PDF rendering projects suffer from this as well. I am currently using a web app (Nextcloud) that employs pdf.js as a PDF renderer for its browser application.
Here’s an example of a file that I have worked with on other utilities. This is a scanned excerpt from an aircraft’s autopilot service manual, originally printed in the 1970s on unknown equipment.
CenturyIIB-origscan.pdf CenturyIIB-tesseract_hocr-uncleaned.pdf CenturyIIB-tesseract_hocr-cleaned.pdf
The first file is the original scan without a text layer. The second (hocr-uncleaned) is a PDF/A that has been processed with Tesseract (v4.0) to create a hidden text layer. The third (hocr-uncleaned) has been de-skewed with unpaper (v6.1) and then OCR’d with the same version of Tesseract and output as a PDF/A as well. In both PDF/A cases the original scan has been transcoded to 300 dpi jpeg for the final output.
In both the second and third cases, the ‘hocr’ rendering option with Tesseract was used for the OCR rendering stage (Tesseract has multiple internal renderers). If you take a look at Tesseract’s issues forum on github you’ll see they have made some changes to their more recent renderer in an attempt to tackle this issue as well.
Here are some excerpts copied/pasted from various utilities…
hocr-unlceaned on Safari 11.1 (13605.1.33.1.4)
The Century IIB Autopilot is an "Open Loop" system which responds only to the dynamics of the aircraft in flight, thus the only ground checks that can be accomplished are functional checks as described in this bulletin.
hocr-uncleaned on Chrome 66.0.3359.181
The Century IIB Autopilot is an "Open Loop" system which responds only to the dynamics of the aircraft in flight, thus the only ground checks that can be accomplished are functional checks as described in this bulletin.
hocr-uncleaned on Adobe Acrobat Pro X
hocr-uncleaned on pdf.js (Firefox 60.0.1)
hocr-cleaned on the same version of Safari above
The Century IIB Autopilot is an "Open Loop’ system which responds only to the dynamics of the aircraft in flight, thus the only ground checks that can be accomplished are functional checks as described in this bulletin.
hocr-cleaned on the same version of Chrome above
hocr-cleaned on the same version of Adobe Acrobat Pro above
hocr-cleaned on the same version of pdf.js (Firefox) above
For anyone who might want to reproduce my toolchain for other sample files (main/depedency)…
tesseract 4.00.00alpha (for OCR) leptonica 1.76.0 libjpeg-turbo 1.5.3 libpng 1.6.34+apng libtiff 4.0.9
unpaper 6.1 (for de-skew, de-noise, etc) libav 12.1 opencv 2.4.13.1 freetype2 2.8
qpdf 8.0.1 (for inspection/modification/creation of pdfs) ghostscript 9.16
OCRmyPDF 6.2.0 (python v3 wrapper for the above utilities)
All of the above are in virtually any common Linux package repo, OCRmyPDF is in pip, and modern builds of all of them are in Homebrew for OSX as well (tesseract must be tagged to their git HEAD since v4.0 is still marked beta). I have also run them all on FreeBSD (must build Tesseract, Leptonica, and unpaper from source). Tesseract/Leptonica is a great baseline to use for making such test files, in my opinion. They’ve brought open source OCR forward by leaps and bounds. Here is an example from a scan of an 18th century document that it even does an admirable job on, despite not knowing what 'long S’s are and transcribing them into lowercase 'f’s.