question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Embedded pages are uncompress

See original GitHub issue

Hi,

I’m migrating our pdf generation tool from HummusJS to pdf-lib.

I have to merge two pdf pages into a new pdf. To achieve that I embed the pages into a new pdf document. The size of the generated file is significantly increased compared to the original one. When I compress the generated pdf, the file size is closer than what I expect.

I did a test case to show the issue:

import {readFileSync, writeFileSync} from 'fs'
import {PDFDocument} from 'pdf-lib'

(async () => {

    const pdfDoc = await PDFDocument.create();

    const pdfSource = await PDFDocument.load(readFileSync('./Lorem ipsum dolor sit amet.pdf'));
    const embeddedPage = await pdfDoc.embedPage(pdfSource.getPage(0));

    const page = pdfDoc.addPage();
    page.drawPage(embeddedPage);

    writeFileSync('embedded.pdf', await pdfDoc.save());

})();

The source file is a simple pdf of 200 KB made with Word : Lorem ipsum dolor sit amet.pdf The file generated by this script is 1.5 MB. It’s 7.5x larger : embedded.pdf

I think that the LZW stream from the source file is uncompressed before being embedded into the new PDF file.

Am I correct?

EDIT: I inspected both PDF and I found a FlateDecode stream in the source pdf which is decoded in the destination pdf. Screenshot 2020-10-20 at 09 48 14

Regards,

Julien

Issue Analytics

  • State:closed
  • Created 3 years ago
  • Reactions:2
  • Comments:5

github_iconTop GitHub Comments

8reactions
tobiasfuhlrothcommented, Jul 20, 2021

I also face a problem related to uncompressed embedded pages. In my scenario i want to apply a page background (letter paper) to every page of a content PDF. The only way i found to achieve this is to embed both the page background and all content pages to a new PDF and then draw them in the correct order on newly created pages.

export const applyPageBackground = async function (contentPdfBuffer, pageBackgroundPdfBuffer) {
    const contentPdfDocument = await PDFDocument.load(contentPdfBuffer)
    const pageBackgroundPdfDocument = await PDFDocument.load(pageBackgroundPdfBuffer)

    const pdfDocument = await PDFDocument.create()
    const embeddedLayoutPage = await pdfDocument.embedPage(pageBackgroundPdfDocument.getPage(0))
    const embeddedContentPages = await pdfDocument.embedPdf(contentPdfDocument, contentPdfDocument.getPageIndices())

    contentPdfDocument.getPages().forEach((contentPage, index) => {
        const page = pdfDocument.addPage()
        page.drawPage(embeddedLayoutPage)
        page.drawPage(embeddedContentPages[index])
    })

    return await pdfDocument.save()
}

So far, so good. But because the embedded pages are not compressed the final PDF has a size of approx 750 KB instead of 85 KB. Those values are for a 5 page PDF. If i do the same with 50 pages i end up with over 8 MB (with compression it’s down to 975 KB).

With the proposed change in the PR i end up with approx 125 KB for the 5 page PDF which is fine.

@Hopding Any chance this PR get’s accepted and released? Is there anything i can do to get this done?

0reactions
momijizukamoricommented, Aug 25, 2021

I’m also seeing huge file size increases with embedded pages - my use-case is splitting an input pdf up into sections and rearranging the pages for bookbinding - with a 3mb/296 page input file, for example, I’m getting outputs of 12mb/16 page sections (each of which are made up of 32 pages of the original doc, arranged two-to-a-sheet)

EDIT: After doing some more digging it looks like part of my problem is embedded fonts, though the amount of space embedded fonts take up in the generated documents is still much higher than in the original (1.3mb versus 7mb+)

EDIT2: Even more digging, turns out some of my issue was user error - every time you call PDFDocument.embedPdf() it embeds a new copy of the fonts, and I was doing it for each page, instead of embedding all the pages in one call and then reorganizing them after - rewriting my logic knocked down my filesize significantly.

(Sorry for the extra notifs - I thought it was probably better to keep my findings here instead of just deleting my comment, in case someone else hits the same problem)

Read more comments on GitHub >

github_iconTop Results From Across the Web

Compact decompression library for embedded use
compression - Compact decompression library for embedded use - Stack Overflow. Stack Overflow for Teams – Start collaborating and sharing organizational ...
Read more >
Extract Embedded Media in PowerPoint with the “Unzip Trick”
With a simple trick, you can gain easy access to all media assets within a PowerPoint file.
Read more >
How to Extract Embedded PDF Online - DeftPDF
Step One: Go to the website that contains the embedded PDF document. Step three: Once you click the floppy disk or the Printer...
Read more >
Unembed images in Illustrator - Adobe Support
Select an embedded image in your document, and do one of the following: In the Control panel, click Unembed. · In the Unembed...
Read more >
WD: How To Extract Embedded Images from a Word Document
Word documents containing embedded images can not be easily extracted. Attempts to copy and paste the images result in poor quality images or...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found