Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Embedded pages are uncompress

See original GitHub issue

Hi,

I’m migrating our pdf generation tool from HummusJS to pdf-lib.

I have to merge two pdf pages into a new pdf. To achieve that I embed the pages into a new pdf document. The size of the generated file is significantly increased compared to the original one. When I compress the generated pdf, the file size is closer than what I expect.

I did a test case to show the issue:

import {readFileSync, writeFileSync} from 'fs'
import {PDFDocument} from 'pdf-lib'

(async () => {

    const pdfDoc = await PDFDocument.create();

    const pdfSource = await PDFDocument.load(readFileSync('./Lorem ipsum dolor sit amet.pdf'));
    const embeddedPage = await pdfDoc.embedPage(pdfSource.getPage(0));

    const page = pdfDoc.addPage();
    page.drawPage(embeddedPage);

    writeFileSync('embedded.pdf', await pdfDoc.save());

})();

The source file is a simple pdf of 200 KB made with Word : Lorem ipsum dolor sit amet.pdf The file generated by this script is 1.5 MB. It’s 7.5x larger : embedded.pdf

I think that the LZW stream from the source file is uncompressed before being embedded into the new PDF file.

Am I correct?

EDIT: I inspected both PDF and I found a FlateDecode stream in the source pdf which is decoded in the destination pdf. Screenshot 2020-10-20 at 09 48 14

Regards,

Julien

Issue Analytics

State:
Created 3 years ago
Reactions:2
Comments:5

Top GitHub Comments

8reactions

tobiasfuhlrothcommented, Jul 20, 2021

I also face a problem related to uncompressed embedded pages. In my scenario i want to apply a page background (letter paper) to every page of a content PDF. The only way i found to achieve this is to embed both the page background and all content pages to a new PDF and then draw them in the correct order on newly created pages.

export const applyPageBackground = async function (contentPdfBuffer, pageBackgroundPdfBuffer) {
    const contentPdfDocument = await PDFDocument.load(contentPdfBuffer)
    const pageBackgroundPdfDocument = await PDFDocument.load(pageBackgroundPdfBuffer)

    const pdfDocument = await PDFDocument.create()
    const embeddedLayoutPage = await pdfDocument.embedPage(pageBackgroundPdfDocument.getPage(0))
    const embeddedContentPages = await pdfDocument.embedPdf(contentPdfDocument, contentPdfDocument.getPageIndices())

    contentPdfDocument.getPages().forEach((contentPage, index) => {
        const page = pdfDocument.addPage()
        page.drawPage(embeddedLayoutPage)
        page.drawPage(embeddedContentPages[index])
    })

    return await pdfDocument.save()
}

So far, so good. But because the embedded pages are not compressed the final PDF has a size of approx 750 KB instead of 85 KB. Those values are for a 5 page PDF. If i do the same with 50 pages i end up with over 8 MB (with compression it’s down to 975 KB).

With the proposed change in the PR i end up with approx 125 KB for the 5 page PDF which is fine.

@Hopding Any chance this PR get’s accepted and released? Is there anything i can do to get this done?

0reactions

momijizukamoricommented, Aug 25, 2021

I’m also seeing huge file size increases with embedded pages - my use-case is splitting an input pdf up into sections and rearranging the pages for bookbinding - with a 3mb/296 page input file, for example, I’m getting outputs of 12mb/16 page sections (each of which are made up of 32 pages of the original doc, arranged two-to-a-sheet)

EDIT: After doing some more digging it looks like part of my problem is embedded fonts, though the amount of space embedded fonts take up in the generated documents is still much higher than in the original (1.3mb versus 7mb+)

EDIT2: Even more digging, turns out some of my issue was user error - every time you call PDFDocument.embedPdf() it embeds a new copy of the fonts, and I was doing it for each page, instead of embedding all the pages in one call and then reorganizing them after - rewriting my logic knocked down my filesize significantly.

(Sorry for the extra notifs - I thought it was probably better to keep my findings here instead of just deleting my comment, in case someone else hits the same problem)

Top Results From Across the Web

Compact decompression library for embedded use

compression - Compact decompression library for embedded use - Stack Overflow. Stack Overflow for Teams – Start collaborating and sharing organizational ...

Extract Embedded Media in PowerPoint with the “Unzip Trick”

With a simple trick, you can gain easy access to all media assets within a PowerPoint file.

How to Extract Embedded PDF Online - DeftPDF

Step One: Go to the website that contains the embedded PDF document. Step three: Once you click the floppy disk or the Printer...

Unembed images in Illustrator - Adobe Support

Select an embedded image in your document, and do one of the following: In the Control panel, click Unembed. · In the Unembed...

WD: How To Extract Embedded Images from a Word Document

Word documents containing embedded images can not be easily extracted. Attempts to copy and paste the images result in poor quality images or...