question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

two single row tables in two separate pdfs don't bet read by camelot as tables

See original GitHub issue

Windows-10-10.0.19043-SP0 Python 3.9.6 (tags/v3.9.6:db3ff76, Jun 28 2021, 15:26:21) [MSC v.1929 64 bit (AMD64)] NumPy 1.21.2 OpenCV 4.5.3 Camelot 0.10.1

Describe the bug 2 of 55 pdfs with K-12 education table data have one row tables that don’t process as tables. https://www.dpi.nc.gov/media/8350/open https://www.dpi.nc.gov/media/8325/open

It doesn’t find a table, likely related to one row entry.

Steps to reproduce the bug ran: tables = camelot.read_pdf(weburl, pages=‘all’) where weburl is set to the above two urls in a loop.

Expected behavior

Should have one row table output for these two separate 1 page pdfs.

Code

tables = camelot.read_pdf(weburl, pages=‘all’)

import camelot

# add your code here

PDF

https://www.dpi.nc.gov/media/8350/open https://www.dpi.nc.gov/media/8325/open

Screenshots

Environment

  • OS: [e.g. macOS]
  • Python version:
  • Numpy version:
  • OpenCV version:
  • Ghostscript version:
  • Camelot version:

Windows-10-10.0.19043-SP0 Python 3.9.6 (tags/v3.9.6:db3ff76, Jun 28 2021, 15:26:21) [MSC v.1929 64 bit (AMD64)] NumPy 1.21.2 OpenCV 4.5.3 Camelot 0.10.1

Additional context

there are 55 educator prep urls being cycled through in a loop. These two failed to produce tables, and I bet it’s related to only have one row entry after headers or something.

Issue Analytics

  • State:open
  • Created 2 years ago
  • Comments:5 (3 by maintainers)

github_iconTop GitHub Comments

2reactions
myrhillioncommented, Oct 1, 2021

Yeah it wasn’t returning a table on those two examples, I thought it may have been due to one line tables. It was odd that out of 55 pdfs with similar table formatting, the only two that failed to return tables were the one data row tables in those 2 pdfs.

I’ll try the suggestions you provided and see if that works when I can. Thank you.

On Fri, Oct 1, 2021 at 7:29 PM Tiago Samaha Cordeiro < @.***> wrote:

If just not return the table, I guess it’s because the background color from table.

Try to use line_scale (docs https://camelot-py.readthedocs.io/en/master/user/advanced.html#detect-short-lines) argument to get fine tune on line detection.

tables = camelot.read_pdf(weburl, line_scale=45, pages=‘all’)

Another thing to try is background_process (docs https://camelot-py.readthedocs.io/en/master/user/advanced.html#process-background-lines), because header line is totally black.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/camelot-dev/camelot/issues/268#issuecomment-932632483, or unsubscribe https://github.com/notifications/unsubscribe-auth/AA323TJVXCVIB6MGLPCSZG3UEY74TANCNFSM5EYJW73A . Triage notifications on the go with GitHub Mobile for iOS https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675 or Android https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub.

– Doug Taggart

0reactions
myrhillioncommented, Nov 9, 2021

Sorry, I haven’t had time to try the fix on this project yet. Had to back burn it for a bit.

On Tue, Nov 9, 2021 at 10:53 AM Tiago Samaha Cordeiro < @.***> wrote:

@myrhillion https://github.com/myrhillion worked?

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/camelot-dev/camelot/issues/268#issuecomment-964283210, or unsubscribe https://github.com/notifications/unsubscribe-auth/AA323TK65QFR3GU6J7V6E23ULE7ZBANCNFSM5EYJW73A . Triage notifications on the go with GitHub Mobile for iOS https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675 or Android https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub.

– Doug Taggart

Read more comments on GitHub >

github_iconTop Results From Across the Web

Problems to extract table data using camelot without error ...
So that, for example, the header of the first table is not read as three separate rows but as one row. Specifically 'Concentração/Composição'...
Read more >
Advanced Usage — Camelot 0.10.1 documentation
When a list of table areas is specified and you need to specify column separators as well, the length of both lists should...
Read more >
How to extract tables from PDFs with Camelot | by Chetan Ambi
We have understood camelot library to extract the tables from PDF files which you can make use of in your next project. To...
Read more >
Camelot Documentation - Read the Docs
3.1.2 Why another PDF table extraction library? There are both open (Tabula ... Reading a PDF to extract tables with Camelot is very...
Read more >
Extracting tabular data from PDFs with Camelot & Excalibur
Vinayak MehtaExtracting tables from PDFs is hard. The Portable Document Format was not designed for tabular data. Sadly, a lot of open data ......
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found