question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Incorrectly Parsed Object on Microsoft invoice PDF

See original GitHub issue

Hi!

Thanks for a really welcome module.

I’m encountering thousands of different kinds of PDFs generated by other people, and got into some trouble with one specific one from Microsoft, getting the following error:

Incorrectly parsed object contents

These are the PDFs that I try to combine, I think the offending one is the top one as it’s the only one not generated by Puppeteer: Din_Microsoft-fakturaoversikt.pdf 3e63ebd0-8775-11e9-888e-1f95e38b402c.pdf

Presumably the PDF doesn’t follow the standards, though there’s little I can do about that.

My use case is to combine this PDF with a generated page that gives some info about it, for accounting purposes. As such, I don’t really need to parse it any more than what’s needed to append it to my PDF.

My code looks as follows:

// pdfsToMerge is an array of filePaths
async function mergePdfs(pdfsToMerge, filePath) {
  const mergedPdf = PDFDocumentFactory.create();
  pdfsToMerge.forEach(pdfFilePath => {
    const pdf = fs.readFileSync(pdfFilePath)
    const pagesToMerge = PDFDocumentFactory.load(pdf).getPages()
    pagesToMerge.forEach( page => {
      mergedPdf.addPage(page)
    })
  })
  const mergedPdfFile = await PDFDocumentWriter.saveToBytes(mergedPdf)
  await fs.writeFileSync(filePath, mergedPdfFile)
  logger.verbose("Merged PDFs", { mergedPdfs: pdfsToMerge, filePath });
  return
}

Issue Analytics

  • State:closed
  • Created 4 years ago
  • Comments:8 (4 by maintainers)

github_iconTop GitHub Comments

1reaction
Hopdingcommented, Jun 9, 2019

@DanielJackson-Oslo The RC should be perfectly stable. The only change it includes is the fix for this issue. And of course, it passed all the unit and integration tests before I cut it. So if it’s working well for you, then there shouldn’t be anything to worry about. (I always cut RCs for every release, no matter how trivial the changes).

It would certainly be possible to get away with less object parsing (and therefore tolerate more invalid objects) if you just want to copy pages. However, in order to find and copy the page objects (and any other objects they reference) it is still necessary to parse some objects.

Implementing this sort of “lazy parsing” would take more than just writing a function, though. It would be necessary to modify some of pdf-lib’s parsing code. The parser currently scans input PDFs from start to finish, parsing each object it encounters along the way.

If this is something you’d be interested in working on, I’d be open to working with you on it. Just note that it would require learning about the structure of PDF files. Please open a new issue if you’d like to continue the discussion further!

1reaction
DanielJackson-Oslocommented, Jun 9, 2019

@DanielJackson-Oslo I’d like to add the Din_Microsoft-fakturaoversikt.pdf file you shared to the pdf-lib GitHub repo to create a regression test for this issue. Do you mind? Does the file contain any sensitive information?

@Hopding Feel free to use it! It’s a bill for my own Office 365, presumably the same one they generate for all customers.

Thanks for the quick follow up. Looking forward to 0.6.4 releasing. How stable is the rc?

Read more comments on GitHub >

github_iconTop Results From Across the Web

Parse text as JSON or XML (Power Query)
Parse text as JSON or XML (Power Query) ... You can parse (or deconstruct) the contents of a column with text strings that...
Read more >
Error: Can't read file, or Presentation cannot be opened
Right-click the file in File Explorer and select Open. Still having trouble? If you're having a problem with PowerPoint that's not resolved here,...
Read more >
Error while getting data from zoho invoices and to save ...
And noticed that code column doesn't have any value and hence the parsing is failing with above error message, please see below for...
Read more >
Overview of Released Application Hotfixes for ...
This page lists application hotfixes (code fixes) that have been released in cumulative updates for Microsoft Dynamics NAV 2018.
Read more >
Web service error codes (Microsoft Dataverse) - Power Apps
Message: {0} The Billing system cannot find the object (e.g. account or ... Message: Crm expression body parsing error occurred. 0x80040260
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found