Tag Archives: PDFs

Extracting Open Data from PDFs in usable formats

A session on rescuing usable data supplied in PDFs, led by Martin.

A client of one of the session participants needed an automated process to check which PDFs had changed data in them – and which didn’t. They had been doing it manually. However, a computational solution isn’t as easy as it looks. For example, software often finds it hard to spot a table. It’s relatively easy to extract data from a table in a PDF, if it looks clearly like a table – borders around “cells”. However, many tables in PDFs are clear to humans – but not to computers. Extracting those sorts of tables is much more tricky.

Continue reading Extracting Open Data from PDFs in usable formats