But because it used billions of conversations from real people, its problems soon went beyond sexually explicit comments and “verbally abusive” language:
It also soon became clear that the huge training dataset included personal and sensitive information. This revelation emerged when the chatbot began exposing people’s names, nicknames, and home addresses in its responses. The company admitted that its developers “failed to remove some personal information depending on the context,” but still claimed that the dataset used to train chatbot Lee-Luda “did not include names, phone numbers, addresses, and emails that could be used to verify an individual.” However, A.I. developers in South Korea rebutted the company’s statement, asserting that Lee-Luda could not have learned how to include such personal information in its responses unless they existed in the training dataset. A.I. researchers have also pointed out that it is possible to recover the training dataset from the AI chatbot. So, if personal information existed in the training dataset, it can be extracted by querying the chatbot.
To make things worse, it was also discovered that ScatterLab had, prior to Lee-Luda’s release, uploaded a training set of 1,700 sentences, which was a part of the larger dataset it collected, on Github. Github is an open-source platform that developers use to store and share code and data. This Github training dataset exposed names of more than 20 people, along with the locations they have been to, their relationship status, and some of their medical information…
[T]his incident highlights the general trend of the A.I. industry, where individuals have little control over how their personal information is processed and used once collected. It took almost five years for users to recognize that their personal data were being used to train a chatbot model without their consent. Nor did they know that ScatterLab shared their private conversations on an open-source platform like Github, where anyone can gain access.
What makes this unusual, the article points out, is how the users became aware of just how much their privacy had actually been compromised. “[B]igger tech companies are usually much better at hiding what they actually do with user data, while restricting users from having control and oversight over their own data.”
And “Once you give, there’s no taking back.”
Read more of this story at Slashdot.