Messed order of parsed texts
See original GitHub issueHi Guys,
Found something interesting not in a pleasant manner about the texts parsed. The bulk paragraphs are wrongly placed at the bottom of the parsed texts. Could anyone help correct this wrong order?
Thanks so much. Luke
Input Pdf file left and output text right shown in screenshot below:
Below is the code:
if pdf_file.endswith('.pdf') or pdf_file.endswith('.PDF'):
parsed_curr_pg = parser.from_file(pdf_file, 'http://localhost:9998/tika')
curr_pg_text = parsed_curr_pg['content']
with open('%s.txt' %pdf_file_nm, 'a', encoding='utf-8') as curr_pg:
curr_pg.write(curr_pg_text)
curr_pg.write('\page_break')
Issue Analytics
- State:
- Created 4 years ago
- Comments:13 (2 by maintainers)
Top Results From Across the Web
Python parse text from multiple txt file - Stack Overflow
The "speech order" for Q&A section is indicated with a number in square brackets. The participants are are always indicated in the beginning...
Read more >Parsing Text with PowerShell (1/3) - Microsoft Developer Blogs
This is the first post in a three part series. Part 1: Useful methods on the String class; Introduction to Regular Expressions ...
Read more >JSON Parse Text - Online Text Tools
Super simple, free and fast browser-based utility for JSON parsing text. Just paste your text and you'll instantly get unstringified text. Textabulous!
Read more >da.InsertCursor and Parsing Large Text file - Esri Community
I have a very large text file (~5 GB, ~30 million lines) that I need to parse and then output some of the...
Read more >Parsing Fixed Width Text Files with Pandas | by Amy Rask
The UniProt Knowledgebase (UniProtKB) is a freely accessible and comprehensive database for protein sequence and annotation data available under ...
Read more >Top Related Medium Post
No results found
Top Related StackOverflow Question
No results found
Troubleshoot Live Code
Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start FreeTop Related Reddit Thread
No results found
Top Related Hackernoon Post
No results found
Top Related Tweet
No results found
Top Related Dev.to Post
No results found
Top Related Hashnode Post
No results found
Top GitHub Comments
I managed to get it to work! I used a Docker container to run Tika as a server. Following the instructions in the official apache tika-docker repo (https://github.com/apache/tika-docker#custom-config) I created a custom config file setting the sortByPosition propriety in the PDF parser, and run the container mapping a volume to a local config file. Results as expected!
Sorry, I’m in the same boat as you WRT having no concrete idea of how to fix this. I am simply passing along new leads as I find them. AFAIK, you need to create a separate config file to provide to parser. The format of the file is described here: https://tika.apache.org/1.18/configuring.html. However, I’m not sure what string should go in the config file to set the sort preference. Also, I’m not 100% sure changing the sort order will fix the problem – however, it appears to be the best option based on what I’ve read in other threads.