Search: Extract text from documents
See original GitHub issueBy the looks of it, this should be quick and easy to implement now that theres a good python module that can do most of the work for us (https://github.com/deanmalmgren/textract).
Notes:
textract
has many dependencies (includinglxml
) so it would be a good idea to make it only an optional dependency- For large PDFs (eg, ebooks).
textract
can take a few minutes to extract the text
Implementation ideas:
- Add a method to
Document
model calledextract_text
. Calling this will use thetextract
module to extract text from the document file and return it as a string. - Add a new field to
Document
calledextracted_text
. This is to store the extracted text in the database to make indexing the document into the database faster. This field should be nullable (which would indicate that the text hasn’t been extracted from the document yet). - Create some kind of background task to extract text from documents.
- Add a custom manager for
Document
and override theget_queryset
method to run.defer('extracted_text')
on the queryset (https://docs.djangoproject.com/en/dev/ref/models/querysets/#defer). This prevents django from selecting the extracted text when it’s not needed so large documents don’t slow anything down. - Add
extracted_text
intosearch_fields
ofDocument
.
Issue Analytics
- State:
- Created 9 years ago
- Reactions:2
- Comments:12 (10 by maintainers)
Top Results From Across the Web
Extracting Text from PDF Documents By Search - Evermap
Introduction: This tutorial shows how to extract text from PDF documents by text search using the AutoDocSearch™ plug-in for Adobe® Acrobat®.
Read more >How to Extract Text From a PDF In Seconds - Docparser
Extracting text from PDF (Portable Document Format) isn't easy. Not many PDF readers can extract text from PDF images or scanned PDFs.
Read more >Extract text from documents - GroupDocs Documentation
This article shows how to extract text with GroupDocs.Parser from PDF, Emails, Ebooks (EPUB, FB2, CHM), Microsoft Office formats: Word (DOC, DOCX), ...
Read more >Intelligently Extract Text & Data with OCR - Amazon Textract
Extract text and structured data such as tables and forms from documents using artificial intelligence (AI)—no configuration or templates necessary.
Read more >Document Extraction cognitive skill - Azure - Microsoft Learn
This skill extracts text and images. Text extraction is free. Image extraction is metered by Azure Cognitive Search. On a free search ......
Read more >Top Related Medium Post
No results found
Top Related StackOverflow Question
No results found
Troubleshoot Live Code
Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start FreeTop Related Reddit Thread
No results found
Top Related Hackernoon Post
No results found
Top Related Tweet
No results found
Top Related Dev.to Post
No results found
Top Related Hashnode Post
No results found
Top GitHub Comments
Just a short note: We did an alpha release of https://github.com/fourdigits/wagtail_textract today. We’re hoping this may scratch other peoples’ itch as well. Maybe this helps pave the way for getting this functionality in Wagtail core, although there should be a fallback when Textract’s installation requirements aren’t met. We welcome any comments, hints, PRs and other feedback. If at one point the package is deemed good enough that the repository can be placed in the Wagtail organisation on Github, i’d welcome that.
@khink
transcription
is the term used by librarians & digital humanities researchers for a plain text version of a document, either a photography, a video or an audio document.In lots of cases in digital humanities, we want to manually write transcriptions instead of using OCR or extracting the already OCRed text from a PDF. For example, medievalists transcribe documents almost impossible to OCR, even today. Or musicians transcribe scores using languages such as LilyPond, again almost impossible to OCR.
That’s why I think it’s better to have an editable
transcription
field. And for consistency, use the verbtranscribe
for the functions/commands that automatically fill thetranscription
. The transcription method itself should be configurable, of course, so we can specify the backend and its options, like mention we want Tesseract with these letters only and this dictionary, etc.