Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Messed order of parsed texts

See original GitHub issue

Hi Guys,

Found something interesting not in a pleasant manner about the texts parsed. The bulk paragraphs are wrongly placed at the bottom of the parsed texts. Could anyone help correct this wrong order?

Thanks so much. Luke

Input Pdf file left and output text right shown in screenshot below:

Below is the code: if pdf_file.endswith('.pdf') or pdf_file.endswith('.PDF'): parsed_curr_pg = parser.from_file(pdf_file, 'http://localhost:9998/tika') curr_pg_text = parsed_curr_pg['content'] with open('%s.txt' %pdf_file_nm, 'a', encoding='utf-8') as curr_pg: curr_pg.write(curr_pg_text) curr_pg.write('\page_break')

Issue Analytics

State:
Created 4 years ago
Comments:13 (2 by maintainers)

Top GitHub Comments

1reaction

jfrfonsecacommented, Mar 8, 2021

I managed to get it to work! I used a Docker container to run Tika as a server. Following the instructions in the official apache tika-docker repo (https://github.com/apache/tika-docker#custom-config) I created a custom config file setting the sortByPosition propriety in the PDF parser, and run the container mapping a volume to a local config file. Results as expected!

1reaction

paconiuscommented, Nov 17, 2019

Sorry, I’m in the same boat as you WRT having no concrete idea of how to fix this. I am simply passing along new leads as I find them. AFAIK, you need to create a separate config file to provide to parser. The format of the file is described here: https://tika.apache.org/1.18/configuring.html. However, I’m not sure what string should go in the config file to set the sort preference. Also, I’m not 100% sure changing the sort order will fix the problem – however, it appears to be the best option based on what I’ve read in other threads.

Top Results From Across the Web

Python parse text from multiple txt file - Stack Overflow

The "speech order" for Q&A section is indicated with a number in square brackets. The participants are are always indicated in the beginning...

Parsing Text with PowerShell (1/3) - Microsoft Developer Blogs

This is the first post in a three part series. Part 1: Useful methods on the String class; Introduction to Regular Expressions ...

JSON Parse Text - Online Text Tools

Super simple, free and fast browser-based utility for JSON parsing text. Just paste your text and you'll instantly get unstringified text. Textabulous!

da.InsertCursor and Parsing Large Text file - Esri Community

I have a very large text file (~5 GB, ~30 million lines) that I need to parse and then output some of the...

Parsing Fixed Width Text Files with Pandas | by Amy Rask

The UniProt Knowledgebase (UniProtKB) is a freely accessible and comprehensive database for protein sequence and annotation data available under ...