Tag Archives: ethics

AI and the open web: what the hell is happening?

Is any web content basically freeware that people can do anything they want with? That was the position of the Microsoft AI CEO Mustafa Suleyman.

How do Open Data Camp 9 people feel about this?

For some people, it wasn’t clearcut. What’s our responsibility around using and credit content on the open web?

Others have been through this before — having to educate people about the copyright status of web images. (It remains copyright of the creator, unless it is given an explicit language). The problem with the approach of many AI companies is that they’ve already done it.

The ethics of AI training

Is it unethical? Some people think so — and they think it’s unfortunate that big corporate players using content this way normalises it. But how do you resolve this after they’re already ingested millions of points of data? The horse has long bolted.

Should we be actively trying to steer people towards the tools that have trained their models ethically on data they have legal permission for? Even they are facing pressure to use more data. There’s a desperation to catch up with the market leaders and ethics is getting lost on the way. But, if what the market leaders are doing is made illegal, they can’t do it any more. So we could start to address these issues. If we stop it being possible to scrape the whole internet, the edge you get from doing that disappears.

However, we’re talking about huge, rich and powerful companies here. They can probably delay legislation for years through lobbying and legal challenges. One attendee has been on the receiving end of that before. You need to differentiate between a corporate being wilfully ignorant of concepts, and being unintentionally so.

The permissions you gave for your data a decade ago make little sense now because you had no idea of the sorts of tools that couple be applied to it now.

Have we already lost this battle? Some people thought so.

Is using AI risky?

Even personal productivity tools like Microsoft’s Co-pilot could be training itself on corporate data that the user doesn’t actually have permission to use. Will it amplify the biases of staff members? In which case, where does liability lie? Can they guarantee that the data will stay, for example, within the UK? No.

There’s a push in artistic circles to poison their work against AI training. If this spreads, will companies start pressuring politicians to allow them to use the data they already store for companies, as long as it is anonymised. Certainly, people can see that happening.

There’s an inherent tension between the open data community, which is a loose aggregation of people seeking public good through opening things up, and then big corporates seeking competitive advantage.

But some people are content creators too, and their work is being taken and used without permission and compensation. What might it take to get the big AI companies to step back from this? It might be something huge — but this conversation needs to be had.

The limits of LLMs

Here’s a heretical view: these LLMs are not very good. They throw up results like what’s the best glue to put on pizza. Could this be a bubble that will burst in a couple of years, and the companies will maximise shareholder profit by turning off those expensive servers…?

Perhaps — but even if it is a bubble, they’ve already normalised taking content in this way. We need to address that at the very least.

If we know that AI is wrong 15% of the time, we have to persuade other people who don’t really understand data but want AI that this is a problem. You can’t have people making decisions based on this data — what if you have to go to court to defend that decision? What if an FOI request comes in that requites you to give the justification for a decision.

We’ve already seen examples of machine learning algorithms making terrible decisions.

People still underestimate how often they just get things wrong. And they’re bald boxes — we can’t see how they made the decision, we can only see the inputs and outputs from them. Do we want digital sociopaths living among us?

AI: Ethics and implications for open data

It’s almost impossible to talk about data without talking about AI now. And in some contexts, the large language models that underlie Generative AI can be very impressive. One attendee had some personal finance data in Excel. He took a screenshot of it, popped it into ChatGPT, and asked it if he was getting better or worse at saving. It worked — and was right. It had extracted the numbers from the screenshot, written some Python code, and run that to create the plot.

Obviously, whenever you’re working with AI you need to check it — they do hallucinate. For most use cases, there’s a need for editors and checkers. You’re not necessarily replacing people, just allowing them to produce more by letting the GenAI to do the work, and checking it. One attendee always asks the AI to show him how it did the work.

To some degree, you need the same skills the AI was using to check it. So, is it actually helping people without these skills? Is this just an awkward stage the technology is going through? Will it ever end? There’s no understanding in these tools, they just give what they statistically predict is the correct answer based on what they’ve seen before.

The hidden ethical costs of AI

Remember that AI uses both electricity and water. To what extent is what we’re doing with them actually needed? There’s a climate cost. The biggest part of the cost comes from training, though, so the new race is towards the same quality of response from smaller training data, and thus smaller models. And smaller models mean you can fit them on phones.

Currently, these costs are partially being disguised by the fact that the companies aren’t — yet — passing those costs onto the consumer. And those emissions can never be clawed back. They’re out there now. And the climate cost gets worse with each new model.

The big models are stuck in time — at the moment they stop training it. Each time they come out with a new model, they have to train from scratch again. And it’s a big assumption that things get better — people are finding ways of preventing using the content being used for training, or even “poisoning” the data, so it hurts the training.

The data risks of AI

We need to make people more aware of the risks: everyone’s heard of AI, everyone wants to use it to make their lives easier. And so they upload, say, a legal document to get it summarised. What happens to that file then? Who knows?

The Excel example above was inherently anonymous. But if you’re going to upload data to ChatGPT, you need to strip out personally identifying information, if you’re going to stay ethical.

The example of Air Canada’s Ai chatbot giving a customer a discount that the company was legally obliged to honour is an example of how they way you tune an AI can have an impact. There’s been a rapid switch from chatbots as public-facing tools, to internal co-pilots instead, because of these legal risks.

Bias or oppression?

The world is biased — so training LLMs on general data makes them biased. It just exacerbates the existing bias in the data. Because of this, it’s very difficult to buy products off the shelf, without knowing how they were trained. Many companies can’t afford to buy black boxes, they can’t understand.

However, as one attendee pointed out, humans are black boxes even to themselves. We all make decisions every day based on biased data.

One attendee suggested that we should avoid the way “bias” — it’s actually a form of oppression. She recommends the book Data Feminism that explores this. People are trying to address these issues, but occasionally, they end up over-correcting. And it remains a persistent problem: the USA produces the most data in the world, so the models will lean towards US ways of being.

We need AI seatbelts

We’re in the phase where we’re driving cars without seatbelts. What will be the digital equivalent of the crashes that led to seatbelt legislation? It may already be happening: LLMs are bing used to produce misinformation and extremist content. And so the companies behind the LLMs are working to stop them being used for that. The machines have no sense of ethics, so will produce material statistically likely to be harmful if asked.

There are plenty of ways of producing harmful materials online. But AI accelerates and scales that — one person can rapidly produce vast volumes of propaganda.

But, to go back to the seatbelt example, they’re an open design anyone can use. If OpenAI found a way of vaccinating its AI against producing propaganda, it probably wouldn’t share that with the market. Will we end up with a situation like we have with the internet, where there’s a web and a dark web, for things illegal on the “main” web? AI and DarkAI…?