question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Search: Extract text from documents

See original GitHub issue

By the looks of it, this should be quick and easy to implement now that theres a good python module that can do most of the work for us (https://github.com/deanmalmgren/textract).

Notes:

  • textract has many dependencies (including lxml) so it would be a good idea to make it only an optional dependency
  • For large PDFs (eg, ebooks). textract can take a few minutes to extract the text

Implementation ideas:

  • Add a method to Document model called extract_text. Calling this will use the textract module to extract text from the document file and return it as a string.
  • Add a new field to Document called extracted_text. This is to store the extracted text in the database to make indexing the document into the database faster. This field should be nullable (which would indicate that the text hasn’t been extracted from the document yet).
  • Create some kind of background task to extract text from documents.
  • Add a custom manager for Document and override the get_queryset method to run .defer('extracted_text') on the queryset (https://docs.djangoproject.com/en/dev/ref/models/querysets/#defer). This prevents django from selecting the extracted text when it’s not needed so large documents don’t slow anything down.
  • Add extracted_text into search_fields of Document.

Issue Analytics

  • State:open
  • Created 9 years ago
  • Reactions:2
  • Comments:12 (10 by maintainers)

github_iconTop GitHub Comments

2reactions
khinkcommented, May 8, 2018

Just a short note: We did an alpha release of https://github.com/fourdigits/wagtail_textract today. We’re hoping this may scratch other peoples’ itch as well. Maybe this helps pave the way for getting this functionality in Wagtail core, although there should be a fallback when Textract’s installation requirements aren’t met. We welcome any comments, hints, PRs and other feedback. If at one point the package is deemed good enough that the repository can be placed in the Wagtail organisation on Github, i’d welcome that.

1reaction
BertrandBordagecommented, May 1, 2018

@khink transcription is the term used by librarians & digital humanities researchers for a plain text version of a document, either a photography, a video or an audio document.

In lots of cases in digital humanities, we want to manually write transcriptions instead of using OCR or extracting the already OCRed text from a PDF. For example, medievalists transcribe documents almost impossible to OCR, even today. Or musicians transcribe scores using languages such as LilyPond, again almost impossible to OCR.

That’s why I think it’s better to have an editable transcription field. And for consistency, use the verb transcribe for the functions/commands that automatically fill the transcription. The transcription method itself should be configurable, of course, so we can specify the backend and its options, like mention we want Tesseract with these letters only and this dictionary, etc.

Read more comments on GitHub >

github_iconTop Results From Across the Web

Extracting Text from PDF Documents By Search - Evermap
Introduction: This tutorial shows how to extract text from PDF documents by text search using the AutoDocSearch™ plug-in for Adobe® Acrobat®.
Read more >
How to Extract Text From a PDF In Seconds - Docparser
Extracting text from PDF (Portable Document Format) isn't easy. Not many PDF readers can extract text from PDF images or scanned PDFs.
Read more >
Extract text from documents - GroupDocs Documentation
This article shows how to extract text with GroupDocs.Parser from PDF, Emails, Ebooks (EPUB, FB2, CHM), Microsoft Office formats: Word (DOC, DOCX), ...
Read more >
Intelligently Extract Text & Data with OCR - Amazon Textract
Extract text and structured data such as tables and forms from documents using artificial intelligence (AI)—no configuration or templates necessary.
Read more >
Document Extraction cognitive skill - Azure - Microsoft Learn
This skill extracts text and images. Text extraction is free. Image extraction is metered by Azure Cognitive Search. On a free search ......
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found