question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Messed order of parsed texts

See original GitHub issue

Hi Guys,

Found something interesting not in a pleasant manner about the texts parsed. The bulk paragraphs are wrongly placed at the bottom of the parsed texts. Could anyone help correct this wrong order?

Thanks so much. Luke

Input Pdf file left and output text right shown in screenshot below:

image

Below is the code: if pdf_file.endswith('.pdf') or pdf_file.endswith('.PDF'): parsed_curr_pg = parser.from_file(pdf_file, 'http://localhost:9998/tika') curr_pg_text = parsed_curr_pg['content'] with open('%s.txt' %pdf_file_nm, 'a', encoding='utf-8') as curr_pg: curr_pg.write(curr_pg_text) curr_pg.write('\page_break')

Issue Analytics

  • State:closed
  • Created 4 years ago
  • Comments:13 (2 by maintainers)

github_iconTop GitHub Comments

1reaction
jfrfonsecacommented, Mar 8, 2021

I managed to get it to work! I used a Docker container to run Tika as a server. Following the instructions in the official apache tika-docker repo (https://github.com/apache/tika-docker#custom-config) I created a custom config file setting the sortByPosition propriety in the PDF parser, and run the container mapping a volume to a local config file. Results as expected!

1reaction
paconiuscommented, Nov 17, 2019

Sorry, I’m in the same boat as you WRT having no concrete idea of how to fix this. I am simply passing along new leads as I find them. AFAIK, you need to create a separate config file to provide to parser. The format of the file is described here: https://tika.apache.org/1.18/configuring.html. However, I’m not sure what string should go in the config file to set the sort preference. Also, I’m not 100% sure changing the sort order will fix the problem – however, it appears to be the best option based on what I’ve read in other threads.

Read more comments on GitHub >

github_iconTop Results From Across the Web

Python parse text from multiple txt file - Stack Overflow
The "speech order" for Q&A section is indicated with a number in square brackets. The participants are are always indicated in the beginning...
Read more >
Parsing Text with PowerShell (1/3) - Microsoft Developer Blogs
This is the first post in a three part series. Part 1: Useful methods on the String class; Introduction to Regular Expressions ...
Read more >
JSON Parse Text - Online Text Tools
Super simple, free and fast browser-based utility for JSON parsing text. Just paste your text and you'll instantly get unstringified text. Textabulous!
Read more >
da.InsertCursor and Parsing Large Text file - Esri Community
I have a very large text file (~5 GB, ~30 million lines) that I need to parse and then output some of the...
Read more >
Parsing Fixed Width Text Files with Pandas | by Amy Rask
The UniProt Knowledgebase (UniProtKB) is a freely accessible and comprehensive database for protein sequence and annotation data available under ...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found