Generative AI, large language models and open data

Alex Ivanov, a data scientist from Faculty, wanted to talk about some of the technology that has been making waves in the press recently.

Usefully, he started by defining a few terms. “LLMs are a subset of AI models,” he said. “They are trained on vast amounts of text data and they can learn the intricacies of human language to do things like answer questions or search databases. At heart, they are trained to predict the next piece of text.

“Generative AI is a wider thing that can create things that are new, including text, and images, and even drugs: they are very broad. So, in any AI, we are talking about a machine learning from data. And the main difference between normal AI and generative AI is the output.

“In traditional AI, we focus on data and classification, to predict things like whether someone will develop diabetes, or even house prices. Wheras with generative AI we create data that was not there already.

“Where open data comes in is that these models are often trained on big datasets, so it can provide the raw material. However, there are certain challenges. One is data quality. If you just pick up lots of data without thinking about its quality that can cause problems.

“Then, there is privacy. Most open data doesn’t identify individuals, but there are some cases where that can happen. You need standardisation to bring all these sources together. Scaleability can be an issue. There are legal issues.

“And we need to think about transparency: some of these AIs are like black boxes, their outputs are almost like magic, so we need to understand what kind of output they are likely to have, and what impact that is likely to make.

“So, I’d like to think about how open data works in this context, and how we address some of these issues around transparency and bias.”

The first issue raised by people at the session is what kind of open data is available to feed generative AIs. It’s not just text – other types of data, such as images or audio, can also be used.

This led to a discussion of what kind of generative AIs would be most useful, in the sense of likely to make activities more efficient. One computer scientist pointed out that any development in technology is likely to displace existing activities and jobs: “So we should be asking how different this really is.”

However, another participant argued that in some contexts, such as healthcare delivery, there will need to be a discussion about where people should be used, and where remote delivery models are more appropriate.

And others felt this point may apply in many areas in which generative AI can displace experts. “An issue that is live in my organisation, is using AI to develop code,” one of the public sector participants chipped in.

“But if you don’t employ coders to check what it is doing, that will cause problems. “So we need to be careful about where we are going.” In particular, she added, because generative AI can produce “hallucinations”, or errors, or invented “facts”, it requires a different governance model: one that looks at outputs, rather than data inputs and coding processes.

Another participant looked back to the early days of social media, when people got together on platforms like Twitter to have generative conversations, and compared this with the situation today, when “people just shout at each other, and use it for marketing.”

It will be necessary to make sure that generative AI is used to deliver valuable outputs, and not just to cut corners, to avoid it going along the same trajectory. In fact, there was considerable concern that the sheer volume of online and social media material that is being generated by bots and AIs will make it very hard to pick out what is real, correct and true.

And Alex suggested the problem could get worse as generative AIs are fed this material, creating a feedback loop. Using AI for specific things, instead of “employing these behomoth models”, might help to address the problem.

A further issue is that widespread and indiscriminate use of generative AI could lead to a “homogenisation” of popular outputs. One session participant suggested that AIs trained to produce predictable music that people will like will produce “drek.” There needs to be a way to maintain distinctiveness. Which is one area in which distinctive open data sets could help.

Summing up, Alex said, “I think we agree that this AI can be useful, but we need to be careful to check its outputs, and about the impact of those.” Perhaps, he suggested, there should be a “scary” commissioner for AI to oversee this new field, and make sure that data, processes, and checks are being properly applied.