How to split PDF file or PDF stream data using HummusJS and Node.JS ? — Part 1

Mahek Chhabra
2 min readJun 16, 2020

--

With more focus towards data science and data analysis, data range has been changed from a few bytes to quintillion bytes of data every day.

We operate a lot on files and PDFs, and artificial intelligence is playing a significant role in this area. We have seen a significant change the way AI is replacing human load by analysing and extracting data out of files.

We know each page of a document can be used for different functionalities, so it gets crucial to split each page in a document and perform on it without loss of data. Each page can be used for training ML models to performing OCR and much more.

What are we going to do

We are working with a JavaScript platform Node.JS to work around our functionality of reading, writing and splitting PDF file in this tutorial.

In order to do so, we need the following prerequisites as displayed bellow.

  • Node 8+
  • A Linux OS, Mac or Windows

Installing Node on your machine

Follow a step by step guide to install node through https://nodejs.org/en/download/ and set up an environment.

Installing HummusJS node package

A NodeJS Module for Creating, Parsing an Manipulating PDF Files and Streams.

Installing command for hummusJS
Figure 1: Install HummusJS globally

Read a PDF file and split into individual pages

Step 1: Import ‘hummus’ module into a file.

Step 2: Read file using createReader() function of hummus by providing path of input file.

Step 3: getPagesCount() is used to extract number of pages present in a file.

Step 4: The next step is to create write file for each page looping on the basis of total pages.

Step 5: Use createPDFCopyingContext() and appendPDFPageFromPDF() function to append data to write file from the actual file page wise.

Read, Parse and Split PDF File
Figure 2 : Read and Split PDF File

Result

Using a step by step guide, you can split the PDF file into individual pages without using any external split tool.

To work with security and integrity of data, we can use splitting on stream data stored in buffer to perform the functionality. Click here to know about it more.

--

--

No responses yet