Increase in Leaked Secrets Reported by GitGuardian in 2023
/ 4 min read
Quick take - In 2023, GitGuardian reported a 113% increase in leaked secrets in public GitHub repositories, prompting the introduction of AssetHarvester, a new static analysis tool designed to enhance the detection of secret-asset pairs, achieving high precision and recall rates while addressing challenges faced by existing detection tools.
Fast Facts
- GitGuardian reported over 12 million leaked secrets in public GitHub repositories in 2023, a 113% increase from 2021, with 1.7 million developers inadvertently exposing sensitive information.
- Commonly leaked secrets include database credentials and API keys, often hard-coded into applications, increasing the risk of exposure to malicious actors.
- Current secret detection tools struggle with high false positive rates (25% to 99%), complicating the identification of genuine threats and leading to developer alert fatigue.
- A new tool, AssetHarvester, was introduced to improve secret detection, achieving a precision of 97% and recall of 90%, with 0% false positives in its data flow analysis component.
- The study emphasizes the importance of asset information in secret management and suggests that AssetHarvester could be adapted for other programming languages and secret types.
Increase in Leaked Secrets on GitHub
In 2023, GitGuardian reported a significant increase in the number of secrets leaked in public GitHub repositories. Over 12 million instances of leaked secrets were documented, marking a 113% rise from 2021. Approximately 1.7 million developers, out of 14.9 million who contributed code, were found to have inadvertently exposed sensitive information. These secrets include database credentials and API keys, which are crucial for integrating with external services. Many developers resort to hard-coding them into application packages and version control systems, raising the risk of exposure to malicious actors.
Challenges with Current Detection Tools
Current secret detection tools face challenges, including high false positive rates that range from 25% to 99%. High false positive rates can lead developers to overlook important warnings. Additionally, these tools typically lack the capability to provide asset information, complicating the filtering of false positives for developers.
Recognizing the need for more effective tools, a new study introduced AssetHarvester, a static analysis tool designed to detect secret-asset pairs within repositories. The research identified four co-location patterns between secrets and their corresponding assets to enhance detection efficiency.
AssetHarvester’s Effectiveness
To assess the effectiveness of AssetHarvester, the study utilized three methodologies: pattern matching, data flow analysis, and fast-approximation heuristics. A benchmark dataset known as AssetBench was curated, consisting of 1,791 secret-asset pairs sourced from 188 public GitHub repositories. AssetHarvester demonstrated impressive results in detecting these pairs, achieving a precision of 97%, recall of 90%, and an F1-score of 94%. Notably, the data flow analysis component recorded 0% false positives, enhancing the overall recall for secret detection tools.
The research highlights the critical role of asset information in alleviating developer alert fatigue, emphasizing that secrets can protect both sensitive and non-sensitive assets. Database credentials emerged as the most frequently leaked secret type, posing detection challenges due to their varied identifier formats. The study specifically examined four database types: PostgreSQL, MySQL, MongoDB, and SQL Server, while excluding SQLite due to its file-based nature.
AssetHarvester’s development involved formulating regular expressions for database connection strings and employing data flow analysis to track asset interactions. The study explored the significance of neighboring lines in identifying assets linked to secrets, finding a high percentage of asset presence within three lines of a detected secret. When evaluated against existing secret detection tools, AssetHarvester generally achieved superior precision and recall.
The implications of this research extend beyond the immediate findings, as AssetHarvester has the potential for adaptation to other programming languages and non-database secret types. Ethical considerations were prioritized in the study, addressing the sensitive nature of the dataset through selective distribution. However, the study acknowledged limitations, including possible bias in manual analysis and the inherent challenges of data flow analysis across diverse programming languages.
Overall, this research contributes to the broader understanding of secret management practices and underscores the necessity for improved tools to assist developers in effectively managing sensitive information.
Original Source: Read the Full Article Here