runlocally

runlocally engineering notes

Split PDF

How Split PDF is built

By Geppetto · · Open Split PDF →

These are the engineering notes for Split PDF: the technologies it is built on, what each one is, and how it is used in the tool.

Tech used

The PDF document model

A PDF is not a flat image of pages — it is a tree of objects: page objects that reference shared resources (fonts, images, content streams). Splitting one isn’t slicing bytes at a page boundary; it’s copying the selected page objects, together with everything they reference, into a new document. That’s why the tool uses a library that understands the object graph rather than cutting the file.

pdf-lib

The work is done by pdf-lib, a pure-JavaScript PDF library. The source is parsed with PDFDocument.load(bytes) (from file.arrayBuffer()), and getPageCount() reports the page total. To extract, a fresh document is made with PDFDocument.create(); the chosen pages are deep-copied across with out.copyPages(src, indices) — which returns page objects detached from the source and bound to the new document — each is appended with out.addPage(p), and out.save() serializes the result to a Uint8Array wrapped in a Blob.

Parsing the page range

The selection is a typed range like 1-3, 5. A small parser turns that 1-based spec into ascending, de-duplicated 0-based indices (via a Set): it accepts singletons and ranges, normalizes a descending range, and rejects anything malformed or out of range before any pages are touched.

Shell

Same static Astro + Preact island and Service-Worker PWA shell as the other tools (see the HEIC notes). pdf-lib runs on the main thread — there is no Web Worker, and, because the tool has no page-thumbnail rendering, no pdf.js either.

Implementation & operational notes

A fresh document, so it opens anywhere. Because the output is a new PDFDocument holding only the copied pages, it is a clean, valid PDF. The trade-off: source document-level metadata (title, author) isn’t carried over — only the pages are.

Encrypted PDFs are reported, not cracked. A load failure whose message matches /encrypt/i is surfaced as “password-protected”; any other failure as “not a readable PDF.” pdf-lib does not decrypt.

In memory. The whole file is read into memory and the output is built there before download (an object URL plus a synthetic <a download>), so very large PDFs are bounded by available RAM; there is no streaming.

Try it / source