Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Search: Extract text from documents

See original GitHub issue

By the looks of it, this should be quick and easy to implement now that theres a good python module that can do most of the work for us (https://github.com/deanmalmgren/textract).

Notes:

textract has many dependencies (including lxml) so it would be a good idea to make it only an optional dependency
For large PDFs (eg, ebooks). textract can take a few minutes to extract the text

Implementation ideas:

Add a method to Document model called extract_text. Calling this will use the textract module to extract text from the document file and return it as a string.
Add a new field to Document called extracted_text. This is to store the extracted text in the database to make indexing the document into the database faster. This field should be nullable (which would indicate that the text hasn’t been extracted from the document yet).
Create some kind of background task to extract text from documents.
Add a custom manager for Document and override the get_queryset method to run .defer('extracted_text') on the queryset (https://docs.djangoproject.com/en/dev/ref/models/querysets/#defer). This prevents django from selecting the extracted text when it’s not needed so large documents don’t slow anything down.
Add extracted_text into search_fields of Document.

Issue Analytics

State:
Created 9 years ago
Reactions:2
Comments:12 (10 by maintainers)

Top GitHub Comments

2reactions

khinkcommented, May 8, 2018

Just a short note: We did an alpha release of https://github.com/fourdigits/wagtail_textract today. We’re hoping this may scratch other peoples’ itch as well. Maybe this helps pave the way for getting this functionality in Wagtail core, although there should be a fallback when Textract’s installation requirements aren’t met. We welcome any comments, hints, PRs and other feedback. If at one point the package is deemed good enough that the repository can be placed in the Wagtail organisation on Github, i’d welcome that.

1reaction

BertrandBordagecommented, May 1, 2018

@khink transcription is the term used by librarians & digital humanities researchers for a plain text version of a document, either a photography, a video or an audio document.

In lots of cases in digital humanities, we want to manually write transcriptions instead of using OCR or extracting the already OCRed text from a PDF. For example, medievalists transcribe documents almost impossible to OCR, even today. Or musicians transcribe scores using languages such as LilyPond, again almost impossible to OCR.

That’s why I think it’s better to have an editable transcription field. And for consistency, use the verb transcribe for the functions/commands that automatically fill the transcription. The transcription method itself should be configurable, of course, so we can specify the backend and its options, like mention we want Tesseract with these letters only and this dictionary, etc.

Top Results From Across the Web

Extracting Text from PDF Documents By Search - Evermap

Introduction: This tutorial shows how to extract text from PDF documents by text search using the AutoDocSearch™ plug-in for Adobe® Acrobat®.

How to Extract Text From a PDF In Seconds - Docparser

Extracting text from PDF (Portable Document Format) isn't easy. Not many PDF readers can extract text from PDF images or scanned PDFs.

Extract text from documents - GroupDocs Documentation

This article shows how to extract text with GroupDocs.Parser from PDF, Emails, Ebooks (EPUB, FB2, CHM), Microsoft Office formats: Word (DOC, DOCX), ...

Intelligently Extract Text & Data with OCR - Amazon Textract

Extract text and structured data such as tables and forms from documents using artificial intelligence (AI)—no configuration or templates necessary.

Document Extraction cognitive skill - Azure - Microsoft Learn

This skill extracts text and images. Text extraction is free. Image extraction is metered by Azure Cognitive Search. On a free search ......