skip to Main Content

Why do PDFs generated from HTML with images consume significantly more storage than the images themselves, and how can I avoid this?

E.g.

This image have ~1.6MB
https://cdn.wallpapersafari.com/21/63/kGOzq7.jpg

<!DOCTYPE html>
<html lang="de">
    <head>
        <meta charset="utf-8" />
        <meta name="viewport" content="width=device-width, initial-scale=1.0" />
        <title>Example</title>
        <style>
            body {
                position: relative;
                width: 148mm;
                height: 210mm;
                margin: 0;
                display: flex;
                flex-direction: column;
                font-family: "Times New Roman", Times, serif;
                border: 1px dashed black;
                user-select: none;
            }
            div.textarea {
                position: relative;
                flex-grow: 1;
                margin: 12mm 15mm 20mm 15mm;
            }
            img {
                width: 100%;
            }
            @media print {
                @page {
                    size: A5;
                    margin: 0;
                }
            }
        </style>
    </head>
    <body>
        <div class="textarea odd">
            <img src="https://cdn.wallpapersafari.com/21/63/kGOzq7.jpg" />
        </div>
    </body>
</html>

Please take a look at this html file. It includes the image I posted before. If I print this html in a pdf file by chrom(ium), the pdf file is 5,3MB but the image has only a size of 1.6MB.

Why is it so much higher?

2

Answers


  1. Generating a pdf from the provided image in firefox seems to give the same result with 5.7 MiB, and looking at the hexdump reveals a possible cause:

    hexdump of the pdf, showing the image is encoded

    The PDF standard might include its own image encoding scheme, which the browser needs to convert to before embedding the image. This might make the image several times larger in size, since it’s no longer in the same file format. You can confirm this isn’t a JPG anymore, since the header should usually start with ff d8 ff e0 ... followed by "JFIF", which is nowhere to be found in this case.

    Login or Signup to reply.
  2. Images such as specifically JPEG should not alter in size / content (Thus no change in "quality") when embedded in HTML. The only change that may be needed is convert to dataURL due to CORS thus 134% larger as text.

    When a "baseline" JPEG image container is wrapped by insert into a PDF page it should retain all its metadata without altering image rotation or native density.

    What can be changed during placement is, the image frame may be scaled / rotated or its aspect changed as here from wide screen to narrow.

    enter image description here

    So the Question is why PDF generation changes the quality or compression upwards or downwards (thus altering byte size). The many methods boil down to, "Is the ’embeddment’ retained or changed"?

    The "Print as PDF" process naturally changes the quality, but raw reuse in a fresh build should maintain image quality.

    Thus the answer has to be, if maintaining the source image is important, only use true PDF generation not canvas reprinting. the important point about JPEG extraction from PDF is, it is byte for byte perfect as what went in comes out exactly as it was. However other formats like PNG are converted during insertion thus there never is a GIF / XXX / PNG IN a PDF, but there can be a JPEG IN a PDF.

    So that narrowed image I show above will look like this when exported. But dont use a graphics app like paint as it has to resave the image in its own format (Use a dedicated image extraction like pdfimages without specify any density).

    enter image description here

    Side note you can byte for byte save a lossless PNG as Lossless JPEG thus 100% Quality and the PNG.JPG compression be inserted in the PDF so it is thus academic as to what format is bigger or smaller inside a PDF.

    Login or Signup to reply.
Please signup or login to give your own answer.
Back To Top
Search