The Common Crawl dataset, which is used to train several artificial intelligence models, has over 12,000 legitimate secrets, including API keys and passwords. The Common Crawl non-profit organisation maintains a vast open-source archive of petabytes of web data collected since 2008, which is free to use.
Because of the huge dataset, various artificial intelligence initiatives, including OpenAI, DeepSeek, Google, Meta, Anthropic, and Stability, may rely on the digital archive to train large language models (LLMs).
Truffle Security researchers discovered legitimate secrets after scanning 400 terabytes of data from 2.67 billion web pages in the Common Crawl December 2024 database. They uncovered 11,908 secrets that were successfully authenticated and were hardcoded by developers, highlighting that LLMs could be trained on insecure code.
It should be noted that LLM training data is not used in its raw form; instead, it is cleaned and filtered to remove extraneous content such as useless data, duplicate, malicious, or sensitive data. Despite these
[…]
Content was cut in order to protect the source.Please visit the source for the rest of the article.
[…]
Content was cut in order to protect the source.Please visit the source for the rest of the article.
This article has been indexed from CySecurity News – Latest Information Security and Hacking Incidents
Read the original article: