Disclaimer: I am the CEO of GitGuardian, which offers solutions for detecting, alerting and remediating secrets leaked within GitHub, therefore this article may contain some biases. GitGuardian has been monitoring public GitHub for over 3 years which is why we are uniquely qualified to share our views on this important security issue.
Security professionals are often overwhelmed by an army of vendors, many of which are equipped with disputable facts and figures, and favor the use of scare tactics. These professionals therefore prefer to leverage their network or peer recommendations to make buying decisions.
The information in this guide can be backed up by solid and objective evidence. If you’d like to share your comments on it, please email me directly at jeremy [dot] thomas [at] gitguardian [dot-com].
We’ve classified features in functional categories:
We will go through these requirements one by one and explain why they are important.
Monitoring your perimeter requires the ability to automatically associate repositories, developers and published code with your organization. There are millions of commits per day on public GitHub, how can organizations look through the noise and focus exclusively on the information that is of direct interest to them?
Organization repositories monitoring
These are the repositories that are listed under your company’s GitHub Organization, if your company has one. This only concerns companies which have open-source projects. Less than 20% of corporate leaks on GitHub occur within public repositories owned by organizations. The majority of the remaining leaks occur on developers’ personal repositories, and a small portion also occurs on IT service providers' or other suppliers’ repositories.
Developers’ personal public repositories monitoring
Around 80% of corporate leaks on GitHub occur on their developers’ personal public repositories. And yes, I’m really talking about corporate leaks, not personal ones.
In the vast majority of the cases, these leaks are unintentional, not malevolent. They happen for many reasons:
Developers typically have one GitHub account that they use both for personal and professional purposes, sometimes mixing the repositories.It is easy to misconfigure git and push wrong data.It is easy to forget that the entire git history is still publicly visible even if sensitive data has since been deleted from the actual version of source code.
Sensitive information that is leaked on the platform generally falls under two categories:
Secret: anything that gives access to a system: API keys, database connection strings, private keys, usernames and passwords. Secrets can give access to cloud infrastructure, databases, payment systems, messaging systems, file sharing systems, CRMs, internal portals, ...
It is very rare, in our experience, to see valid PII leaked on the platform, although we often see secrets giving access to systems containing PII.
Precision answers the question: “What is the percentage of sensitive information that you detect that is actually sensitive?”. This question is perfectly legitimate, especially in the context of SOCs being overwhelmed with too many false positive alerts.
Precision is easily measurable: the vendor sends alerts, and users can give feedback through a “true alert” / “false alert” button. Your vendor should be able to present precision metrics, backed by strong evidence records and well-defined methodology.
This one is a bit tougher than precision. Recall answers the question: "What is the percentage of sensitive information you failed to detect?". Having a high recall means having a small number of missed secrets. This question is also very important, considering the impact that a single undetected credential can have for an organization.
Recall is more complicated to measure than precision. This is because finding sensitive information in source code is like finding needles in a haystack: there are a lot more sticks than there are needles. You need to manually go through thousands of sticks in order to realize that you’ve missed a needle or two. A decent proxy for recall is the number of individual API keys and additional sensitive information supported by your vendor.
A good algorithm is able to achieve excellence in precision AND recall.
Ability to detect unprefixed credentials
Some secrets are easier to find than others, especially prefixed credentials that are strictly defined by a distinctive, unambiguous pattern.
The majority of published credentials however, don’t fall into this category. Therefore, any solution based entirely on prefix detection will miss a lot of leaked credentials. Your vendor must be able to detect Datadog keys or OAuth tokens for example, using techniques involving a combination of entropy statistics and sophisticated pattern matching applied not on the presumed key itself, but on its context.
When choosing keywords that you would like to be alerted on, make sure keywords are distinctive enough to be uniquely linked to your company. Good keywords are typically: internal project names (providing they are not common), internal URLs or a reserved IP address range (although not technically a keyword).
Do you want to know if a given keyword is distinctive enough? Try using the GitHub built-in search for an estimation of the results it would bring (this is just an estimation, as the GitHub search is rather limited).
A concrete example with “docker.com”. With over 597K source codes containing the keyword, it is not a good candidate for keyword matching.
When remediating GitHub leaks, you are in a race against time.
This is especially true when discussing leaked credentials (as opposed to Intellectual Property):
Any detection that involves human operators (typically filtering too many false positives) and is not fully automated is probably already too long to react. You must expect your vendor to come up with strong evidence that their reaction time is counted in minutes, not hours or days.
Facing a leak can be a tough process that requires speed, and knowledge from multiple people (typically Threat Response / Application Security / Developers).
Since the developer responsible for the leak is at the forefront of the issue, they can be your first responders. This is especially the case if your solution raised the alert fast enough after the leak occurred, so that your developer is still in front of their computer.
Integration with your remediation workflow
It is quite obvious that the solution should be integrated with your preferred SIEM, ITSM, ticketing system or chat.
One thing to keep in mind: if your organization is spread over multiple geographies / time zones, automatically associating a GitHub leak with a geography for incident dispatching purposes might not be always possible without first providing additional information to your vendor . This means that you will either have to provide your vendor with a list of your developers and their geographies, or indicate a single entry point that has global responsibilities for your vendor to alert you.
Collaboration with developers
Potential damage can rarely be estimated just by looking at the code surrounding the leaked credential, and sensitive corporate credentials are often leaked on developers’ personal projects. When alerted about a GitHub leak, your first remediation step will always be to check whether or not the developer is still working in your company, and to reach out to them with a questionnaire to gather input for impact assessment and incident prioritization.
Logging can answer multiple needs, depending of course on the data that is logged: post-incident analysis, reporting to management, demonstrating compliance to customers or auditors, security (audit trail), or transparency in what the solution is really doing.
I’d like to briefly insist on this last point: ask your vendor for proof points! An ideal solution would provide a detailed list of every monitored developer and repository, as well as logs of every single commit that was analyzed, and reproducible results of conducted scans.
As the CEO of GitGuardian, I’m always extremely grateful for the time and trust that security professionals give us. We’ve built a sales organization that is thoroughly trained to behave with extreme respect and professionalism. This is how we sell cybersecurity software at GitGuardian:
Previously published at https://blog.gitguardian.com/github-security/