Incorrectly Parsed Object on Microsoft invoice PDF
See original GitHub issueHi!
Thanks for a really welcome module.
I’m encountering thousands of different kinds of PDFs generated by other people, and got into some trouble with one specific one from Microsoft, getting the following error:
Incorrectly parsed object contents
These are the PDFs that I try to combine, I think the offending one is the top one as it’s the only one not generated by Puppeteer: Din_Microsoft-fakturaoversikt.pdf 3e63ebd0-8775-11e9-888e-1f95e38b402c.pdf
Presumably the PDF doesn’t follow the standards, though there’s little I can do about that.
My use case is to combine this PDF with a generated page that gives some info about it, for accounting purposes. As such, I don’t really need to parse it any more than what’s needed to append it to my PDF.
My code looks as follows:
// pdfsToMerge is an array of filePaths
async function mergePdfs(pdfsToMerge, filePath) {
const mergedPdf = PDFDocumentFactory.create();
pdfsToMerge.forEach(pdfFilePath => {
const pdf = fs.readFileSync(pdfFilePath)
const pagesToMerge = PDFDocumentFactory.load(pdf).getPages()
pagesToMerge.forEach( page => {
mergedPdf.addPage(page)
})
})
const mergedPdfFile = await PDFDocumentWriter.saveToBytes(mergedPdf)
await fs.writeFileSync(filePath, mergedPdfFile)
logger.verbose("Merged PDFs", { mergedPdfs: pdfsToMerge, filePath });
return
}
Issue Analytics
- State:
- Created 4 years ago
- Comments:8 (4 by maintainers)
@DanielJackson-Oslo The RC should be perfectly stable. The only change it includes is the fix for this issue. And of course, it passed all the unit and integration tests before I cut it. So if it’s working well for you, then there shouldn’t be anything to worry about. (I always cut RCs for every release, no matter how trivial the changes).
It would certainly be possible to get away with less object parsing (and therefore tolerate more invalid objects) if you just want to copy pages. However, in order to find and copy the page objects (and any other objects they reference) it is still necessary to parse some objects.
Implementing this sort of “lazy parsing” would take more than just writing a function, though. It would be necessary to modify some of pdf-lib’s parsing code. The parser currently scans input PDFs from start to finish, parsing each object it encounters along the way.
If this is something you’d be interested in working on, I’d be open to working with you on it. Just note that it would require learning about the structure of PDF files. Please open a new issue if you’d like to continue the discussion further!
@Hopding Feel free to use it! It’s a bill for my own Office 365, presumably the same one they generate for all customers.
Thanks for the quick follow up. Looking forward to 0.6.4 releasing. How stable is the rc?