The key to unlocking your PDFs

Daniel Mitchell •


What do the latest government legislation, your employment contract, and the menu from your local takeaway have in common?

Well, they all contain vital information certainly; but also, they are all more than likely to be distributed as PDFs.

PDFs are everywhere. I interact with them every day; many office workers do. Within them is the pulse of the world going round: invoices, purchase orders, price lists, product catalogues, personnel forms, shipping notes, company reports, presentations, and press releases.

And no wonder PDFs are popular, they’re essentially the closest thing the digital world has come to a replacement for paper. They work irrespective of operating system, don’t require specialist software, ensure fidelity with the original design, and protect (a bit) against the data being changed.

PDFs are a great way for people and companies to share information, but they are a terrible way of transferring data. They contain all this juicy information but it’s painful to access and use analytically; for a start, there’s no easy way to extract it accurately into spreadsheets or databases. As a result, it’s tough to bring modelling and analytical techniques to bear on the most popular file format in the world.

Why is it so hard? Well, essentially there are three problems:

1. It’s not really one file format, but many. Most strikingly, there’s a difference between native PDFs and image-based PDFs which have to be treated in different ways. This is a frustration rather than a real problem, but it needs to be remembered.

2. The information isn’t always in the same place. There aren’t often many rules even about the order of information on a PDF let alone exactly where it is on the page. For instance, there’s no single structure for an invoice or a product catalogue; each company will produce them differently, and even over a short period of time one company can change the way it produces them.

3. The information can be in any of a number of different formats. It’s often possible for the same or similar information to be presented in any of a range of different formats: text, tables, charts, infographics. And again, there’s little consistency between documents.

Each of these problems creates uncertainty. Writing code which can deal with so much uncertainty in the input is extremely challenging: it can take a long time, is very unlikely to deal with every eventuality, and even then it’s common to need human oversight to check the quality of the output. The other option - getting a team of people to transcribe the information manually - seems too resource-intensive, slow, and expensive to contemplate.

Our approach is a judicious combination of the two: looking to combine the flexibility of the human transcriber with the speed of an automated process. It’s a basic three-step system:

1. Use a human workforce to identify and label the key information you want to extract. While a purely workforce-led approach to extracting information from PDFs is slow because of all the rote transcription involved, a human workforce is well suited to finding and categorising the information quickly.

2. Use automated methods to extract what you can. By locating and categorising the information (eg. as text or a table), you make the job of automatically transcribing it much easier. You can use relatively simple techniques which can be implemented in minutes.

3. Clean up and fill in the gaps with a workforce. The reality is that it will be necessary to have a layer of human oversight: to check the output of the automated extraction, to normalise across the various naming conventions or units found in different documents so the data from multiple PDFs can stack up for analysis, and because there’s no parser on the planet that can deal with some of the infographics and more baroque structures you find in some PDFs.

The principle here is to treat data extraction as a series of simple tasks rather than a single complex one. This allows the workforce and the automation to concentrate on the parts of the process they do well; the workforce triages, and then in a separate step provides oversight, while the computer can perform the transcription much faster than a human workforce can.

A well-designed data extraction process along these lines can provide a dataset which is accurate and complete, is flexible to the variety inherent in PDFs, and doesn’t require months of development time.

If you’d like to learn more about our methods or try it out with your own PDF documents, get in touch.