A session on rescuing usable data supplied in PDFs, led by Martin.
A client of one of the session participants needed an automated process to check which PDFs had changed data in them – and which didn’t. They had been doing it manually. However, a computational solution isn’t as easy as it looks. For example, software often finds it hard to spot a table. It’s relatively easy to extract data from a table in a PDF, if it looks clearly like a table – borders around “cells”. However, many tables in PDFs are clear to humans – but not to computers. Extracting those sorts of tables is much more tricky.
Some tools to extract table data
- AWS has a Textract tool which pulls out tables from PDFs. Azure has one, too. The Google equivalent is not as good.
- Docparser proved to be the solution that was needed.
- PDFPlumber is a Python library that works well with a jupyter notebook to allow you to extract elements from PDFs.
- Excalibur (May not be in active development)
Why do we end up with PDFs?
People settle on using PDFs because it’s easy. But what do we move them to? Markdown? HTML has the advantage that there are lots of ways to export to it. People don’t use PDF because they think it’s the best format, but because it’s the best option on their dropdown menu.
Perhaps dual release of PDF and CSV (for the original data), for example, is a better option.
Publishing is a service. If you don’t design it as a service, you will end up with people doing their best from a place of relative ignorance or, at worst, passive-aggressive publishing. Let’s be a bit kinder to publishers. There’s some service design work needed here to make the publishing right.
However, once you get to hand-annotated data, you have a problem, as happened with one set of data about Manchester transport.
Open data needs to be mandated at the start of a contact, not three years in. There was some talk of standard open clauses by the Institute for Government. It’s been implemented in some departments, but not others.
There’s more than one type of PDF
The three types of PDFs you might encounter:
- Image PDFs
- Image PDFs with a text layer from OCR or other text extraction
- Text PDFs, created from the original source
Tools for creating the text layer
Does the file actually move to the service’s cloud? That creates some problems with sensitive data. So, there’s a need for local solutions, like PDFPlumber and Tesseract. But that comes with a configuration and understanding price.
Rekey errors – how prevalent are they? We’re not sure – but it’s a good argument for technical solutions, rather than rekeying data. If the extraction tools are more accurate than rekeying, it’s worth pursuing that route. Are there academic reports out there?
Could we use neural networks to extract patterns from large number of documents? Possibly – but we’re not there yet. And it would be amazing.
A couple more tools
- Import.io is a tool that some have used to extract data from HTML, but it can be fiddly and require constant changes if people change their publication formats.
- Google has an API that can do entity analysis on any document: Cloud Natural Language API