Is any web content basically freeware that people can do anything they want with? That was the position of the Microsoft AI CEO Mustafa Suleyman.
How do Open Data Camp 9 people feel about this?
For some people, it wasn’t clearcut. What’s our responsibility around using and credit content on the open web?
Others have been through this before — having to educate people about the copyright status of web images. (It remains copyright of the creator, unless it is given an explicit language). The problem with the approach of many AI companies is that they’ve already done it.
The ethics of AI training
Is it unethical? Some people think so — and they think it’s unfortunate that big corporate players using content this way normalises it. But how do you resolve this after they’re already ingested millions of points of data? The horse has long bolted.
Should we be actively trying to steer people towards the tools that have trained their models ethically on data they have legal permission for? Even they are facing pressure to use more data. There’s a desperation to catch up with the market leaders and ethics is getting lost on the way. But, if what the market leaders are doing is made illegal, they can’t do it any more. So we could start to address these issues. If we stop it being possible to scrape the whole internet, the edge you get from doing that disappears.
However, we’re talking about huge, rich and powerful companies here. They can probably delay legislation for years through lobbying and legal challenges. One attendee has been on the receiving end of that before. You need to differentiate between a corporate being wilfully ignorant of concepts, and being unintentionally so.
The permissions you gave for your data a decade ago make little sense now because you had no idea of the sorts of tools that couple be applied to it now.
Have we already lost this battle? Some people thought so.
Is using AI risky?
Even personal productivity tools like Microsoft’s Co-pilot could be training itself on corporate data that the user doesn’t actually have permission to use. Will it amplify the biases of staff members? In which case, where does liability lie? Can they guarantee that the data will stay, for example, within the UK? No.
There’s a push in artistic circles to poison their work against AI training. If this spreads, will companies start pressuring politicians to allow them to use the data they already store for companies, as long as it is anonymised. Certainly, people can see that happening.
There’s an inherent tension between the open data community, which is a loose aggregation of people seeking public good through opening things up, and then big corporates seeking competitive advantage.
But some people are content creators too, and their work is being taken and used without permission and compensation. What might it take to get the big AI companies to step back from this? It might be something huge — but this conversation needs to be had.
The limits of LLMs
Here’s a heretical view: these LLMs are not very good. They throw up results like what’s the best glue to put on pizza. Could this be a bubble that will burst in a couple of years, and the companies will maximise shareholder profit by turning off those expensive servers…?
Perhaps — but even if it is a bubble, they’ve already normalised taking content in this way. We need to address that at the very least.
If we know that AI is wrong 15% of the time, we have to persuade other people who don’t really understand data but want AI that this is a problem. You can’t have people making decisions based on this data — what if you have to go to court to defend that decision? What if an FOI request comes in that requites you to give the justification for a decision.
We’ve already seen examples of machine learning algorithms making terrible decisions.
People still underestimate how often they just get things wrong. And they’re bald boxes — we can’t see how they made the decision, we can only see the inputs and outputs from them. Do we want digital sociopaths living among us?