Security researchers are warning that data exposed to the internet, even for a moment, can linger in online generative AI chatbots like Microsoft Copilot long after the data is made private.
Thousands of once-public GitHub repositories from some of the world’s biggest companies are affected, including Microsoft’s, according to new findings from Lasso, an Israeli cybersecurity company focused on emerging generative AI threats.
Lasso co-founder Ophir Dror told TechCrunch that the company found content from its own GitHub repository appearing in Copilot because it had been indexed and cached by Microsoft’s Bing search engine. Dror said the repository, which had been mistakenly made public for a brief period, had since been set to private, and accessing it on GitHub returned a “page not found” error.
“On Copilot, surprisingly enough, we found one of our own private repositories,” said Dror. “If I was to browse the web, I wouldn’t see this data. But anyone in the world could ask Copilot the right question and get this data.”
After it realized that any data on GitHub, even briefly, could be potentially exposed by tools like Copilot, Lasso investigated further.
Lasso extracted a list of repositories that were public at any point in 2024 and identified the repositories that had since been deleted or set to private. Using Bing’s caching mechanism, the company found more than 20,000 since-private GitHub repositories still had data accessible through Copilot, affecting more than 16,000 organizations.
Affected organizations include Amazon Web Services, Google, IBM, PayPal, Tencent, and Microsoft itself, according to Lasso. For some affected companies, Copilot could be prompted to return confidential GitHub archives that contain intellectual property, sensitive corporate data, access keys, and tokens, the company said.
Lasso noted that it used Copilot to retrieve the contents of a GitHub repo — since deleted by Microsoft — that hosted a tool allowing the creation of “offensive and harmful” AI images using Microsoft’s cloud AI service.
Dror said that Lasso reached out to all affected companies who were “severely affected” by the data exposure and advised them to rotate or revoke any compromised keys.
None of the affected companies named by Lasso responded to TechCrunch’s questions. Microsoft also did not respond to TechCrunch’s inquiry.
Lasso informed Microsoft of its findings in November 2024. Microsoft told Lasso that it classified the issue as “low severity,” stating that this caching behavior was “acceptable,” Microsoft no longer included links to Bing’s cache in its search results starting December 2024.
However, Lasso says that though the caching feature was disabled, Copilot still had access to the data even though it was not visible through traditional web searches, indicating a temporary fix.