Why regulators in Canada and Italy are digging into ChatGPT's use of personal information
OpenAI's chatbot is trained using data scraped from the open web
As governments rush to address concerns about the rapidly-advancing generative artificial intelligence industry, experts in the field say greater oversight is needed over what data is used to train the systems.
Earlier this month, Italy's data protection agency launched a probe of OpenAI and temporarily banned ChatGPT, their AI-powered chatbot. On Tuesday, Canada's privacy commissioner also announced an investigation of OpenAI. Both agencies cited concerns around data privacy.
"You might say, 'Oh, maybe it feels a bit heavy handed,'" said Katrina Ingram, founder of Edmonton-based consulting company Ethically Aligned AI.
"On the other hand, a company decided that it was just going to drop this technology onto the world and let everybody deal with the consequences. So that doesn't feel very responsible as well."
Concerns about ChatGPT, transparency
Since it was released late last year, ChatGPT's ability to write everything from tweets to computer code has raised questions about its potential use in education and business. Similar AI products have been launched by Microsoft and Google in recent weeks.
These generative systems are trained to provide responses or generate output using data that is openly available on the internet — and it's not always clear what kind of information is included, experts say.
"One of the challenges right now is that I think we may not know enough about what's going on under the hood. An investigation can help to clarify that," said Teresa Scassa, Canada Research Chair in Information Law and Policy and a law professor at University of Ottawa.
The lack of transparency has prompted organizations and governments to call for a slow down — and even a pause — on launches of new generative AI projects.
OpenAI complied with Italy's request, and CEO Sam Altman tweeted, "we think we are following all privacy laws." European Union countries including France and Ireland have said they will examine Italy's findings on the issue, while Germany said it could block the service. Sweden has ruled out a ban on ChatGPT.
OpenAI published a blog post on Wednesday outlining its approach to safety and accuracy. The post also stated that "some" training data includes personal information. The data is not used to track users or advertise to them, but to make products more "helpful," according to the post.
The company said in the post that steps they have taken "minimize the possibility that our models might generate responses that include the personal information of private individuals."
Late last month, OpenAI said it fixed a "significant issue" that exposed some users' conversation history to a small subset of other users.
What data is scooped up?
Experts say there has been a lack of transparency around what data companies are using to train the large language models that underpin systems like OpenAI's ChatGPT.
According to Ingram, the systems are being trained with data that users have not specifically provided to the company. OpenAI says it uses a "broad corpus" of data, including licensed content, "content generated by human reviewers" and content publicly available on the internet.
"We didn't consent to any of this," Ingram said. "But as a byproduct of living in a digital age, we are entangled in this."
Information provided directly to OpenAI through ChatGPT may also be used to train AI, but that is disclosed in the product's terms of service, she said.
CBC News asked OpenAI questions about what is included in the data used to train their products. In response, they provided a link to the blog post published Wednesday.
'New version of an old controversy'
Philip Dawson, head of policy for Armilla AI — a tech company providing risk-mitigation products to companies using AI — says emerging concerns about data privacy in AI are a continuation of long-standing worries over online tracking by social networks and web companies.
"It's a new version of an old controversy. And it really calls into question some of the building blocks of large language models, which is really all about the vast amounts of data that these models are trained on and the computing power that enables that training," he said.
Dawson noted that companies are beginning to provide more information on the data sets used to train AI systems — especially as companies employing AI seek to avoid potential risk — but there's no requirement for them to do so.
Chatbot may provide inaccurate info
Whether sensitive personal data could appear in the output of a generative AI system is unclear. However, concerns have been raised about ChatGPT providing inaccurate information in response to queries.
In one example, an Australian mayor said on Wednesday that he may sue OpenAI if it does not correct false information shared about him by ChatGPT.
Brian Hood, the mayor of Hepburn Shire, became worried about his reputation after members of the public informed him that the chatbot named him as a guilty party in a foreign bribery scandal involving the Reserve Bank of Australia.
Lawyers representing Hood said that while he did work for the subsidiary, he was the person who notified authorities about the payment of bribes to foreign officials to win currency printing contracts.
OpenAI cautions that ChatGPT "may produce inaccurate information about people."
Is a ban on AI needed?
There's already precedent for cases of internet data harvesting violating privacy law, said Scassa. In 2021, American technology firm Clearview AI violated Canadian privacy laws by collecting photos of Canadians without their knowledge or consent.
Part of the challenge for tech companies, regulators and consumers is that laws vary from one jurisdiction to the next. While an American company scraping online data to train large language models may be legitimate in the U.S., the same rules may not apply in Europe.
"We can have whatever law we want in Canada, but we're ultimately dealing with a technology that's coming from another country and that may be operating by different norms," said Scassa.
Canada considers stronger rules on personal data use
A proposed Canadian law, Bill C-27, which is currently on its second reading in the House of Commons, aims to strengthen rules about how personal data is used by tech companies. The Artificial Intelligence and Data Act, tabled alongside C-27, would also require technology companies to provide documentation on how their AI systems are developed and report compliance to prescribed safeguards.
The EU is also developing a regulatory framework for artificial intelligence that outlines high- and unacceptable-risk use scenarios with the aim of protecting users.
But many experts say that a ban on generative AI — or moratorium, like the one suggested last week in an open letter signed by a group of artificial intelligence experts, industry executives and Tesla CEO Elon Musk — is not necessarily the solution.
"I think a ban is a short-term solution at best," said Ingram, noting that a slowing on new product releases may be warranted.
"We need to speed up the regulatory process and move a bit faster on that front. And we need to have more conversations with stakeholders, including just regular people who are encountering AI in various ways in their daily life."
Addressing AI threats a challenge
On Thursday, in response to the ban by Italy's privacy regulator, OpenAI said it had no intention of putting a brake on developing AI, but it reiterated the importance of respecting rules aimed at protecting the personal data of citizens in both that country and the EU.
Until stronger regulations are in place, Scassa, the law professor, worries that addressing AI's potential threats will be a challenge.
"There is a need for government to put in place something that will structure our response so that we set legal parameters that will help us govern AI," she said.
"I certainly think that this is a very pressing issue that until we have those frameworks in place, it will be very difficult to respond to and to shape AI."
With files from CBC News and Reuters