Open Data Horror Stories: 2017 Edition

There’s a tendency to focus on personal data as the major risk of open data. But there has to be more than that.

ODI Devon has made a policy of holding its meetings around the county. This avoids everything becoming Exeter-centric, but there is a cost to hiring the meeting rooms, and as they publish their spending as open data, it’s led to some criticism.

There’s lots of work going on around databases of GPs. That could be used for ranking GPs, on a simple scale. That could be too simplistic. And there’s not really consumer choice in GPs – so how useful would that be? Could you end up with property price issues as you do with schools.

Fun fact: there are no such thing as school catchments, there’s only derived areas when the school is over-subscribed…

Trafford has a selective education system, with an exam splitting pupils between grammar and high schools. The net result? The grammars are full of children whose parents can afford tutors. So, people started looking at the ward by ward data, to move the discussion beyond anecdote, through use of a visualisation people could explore. The Labour councillors could see that their wards were being discriminated against in favour of people from outside Trafford – but then nothing really happened.

Data does not come with intent. But it can then enable dynamics which lead to inequality or gaming the system. Is it right, ethically, to withhold the data because of that? The instinct seems to be “no” – but the system needs to be looked at.

Personal data problems

If we cock up and release personal data – that’s on us. It’s not the fault of the open data system. It’s good that people examine how we spend money – because it’s their money! But be a dialogue, not a broadcast – let them come back and discuss what they find in the data.

Does open data make accidental personal data releases more likely?

Well, possibly, if you put deadline and statutory pressure on people, without the resources and expertise to do it well.

Matching data sets is one concern: where you can de-anonymise data by matching sets together. It’s very complex to deal with. You don’t only have to think about your own data, but also be aware of what else is out there. That’s challenging. Pollution data is very easily connected with location and individual farms, for example. The converse risk is aggregating it upwards until it become meaningless.

There’s also the risk of releasing data that harms people economically.

Analysing the extent of risk

Astrophysics is rarely front-page news. Medical research is. Medical researchers can’t self-publish. In physics you can self-publish. Open data needs this – a sense of the potential damage a dataset can. For some it will be negligible, for some it will be serious.

There are two dimensions worth considering:

  • Likelihood of risk, from unlikely to almost certain
  • Severity of risk, from minor boo boo to full-scale zombie outbreak

At some places, no data is released until it’s analysed through that process. However, it assumes that you have experts that have the knowledge to do it well. You also have issues of impartiality – repetitional risk shouldn’t be a factor, but it will be for some organisations. Innate bias, political, racial or sexual could influence the person making the decisions or scoring.

How do you balance this against the opportunity cost of NOT releasing the data?

There are a small number of high wall reservoirs that are at high risk for catastrophic damage if they fail. The government won’t release which they are, because they could become terrorist targets, but equally, the people who live in the area at risk have no idea and can’t prepare.

Session Notes