How to split PDF file or PDF stream data using HummusJS and Node.JS ? — Part 1
With more focus towards data science and data analysis, data range has been changed from a few bytes to quintillion bytes of data every day.
We operate a lot on files and PDFs, and artificial intelligence is playing a significant role in this area. We have seen a significant change the way AI is replacing human load by analysing and extracting data out of files.
We know each page of a document can be used for different functionalities, so it gets crucial to split each page in a document and perform on it without loss of data. Each page can be used for training ML models to performing OCR and much more.
What are we going to do
We are working with a JavaScript platform Node.JS to work around our functionality of reading, writing and splitting PDF file in this tutorial.
In order to do so, we need the following prerequisites as displayed bellow.
- Node 8+
- A Linux OS, Mac or Windows
Installing Node on your machine
Follow a step by step guide to install node through https://nodejs.org/en/download/ and set up an environment.
Installing HummusJS node package
A NodeJS Module for Creating, Parsing an Manipulating PDF Files and Streams.
Read a PDF file and split into individual pages
Step 1: Import ‘hummus’ module into a file.
Step 2: Read file using createReader() function of hummus by providing path of input file.
Step 3: getPagesCount() is used to extract number of pages present in a file.
Step 4: The next step is to create write file for each page looping on the basis of total pages.
Step 5: Use createPDFCopyingContext() and appendPDFPageFromPDF() function to append data to write file from the actual file page wise.
Result
Using a step by step guide, you can split the PDF file into individual pages without using any external split tool.
To work with security and integrity of data, we can use splitting on stream data stored in buffer to perform the functionality. Click here to know about it more.