paint-brush
Modernizing Secrets Scanning: Part 2–the Semantic Eurekaby@ntoskrnl
1,128 reads
1,128 reads

Modernizing Secrets Scanning: Part 2–the Semantic Eureka

by Nikolai KhechumovApril 24th, 2023
Read on Terminal Reader
Read this story w/o Javascript
tldt arrow

Too Long; Didn't Read

Nikolai Khechumov is a security expert at Avito. He has been working on a project to improve secrets detection. In this post, he explains how to create a new type of secret detection tool.
featured image - Modernizing Secrets Scanning: Part 2–the Semantic Eureka
Nikolai Khechumov HackerNoon profile picture


Hi! I’m Nikolai Khechumov from Avito’s Application Security team. This is the second part of our journey, where we are trying to improve secrets detection. In the previous part, we examined different types of secrets, understood the core problems, and hit a dead end. Now we are going to make a breakthrough and do something useful.

Imitating SAST

Okay, since we cannot easily use the existing tools, let’s try to understand how they work and review the first two steps any SAST tool makes to build an AST.


A very simplified scheme is shown below:


These two steps are “lexing” and “parsing.”


The Lexing (also known as ‘tokenization’) stage receives a code as just a stream of chars. Then it finds very basic language-specific syntax constructions and outputs them as a set of typed tokens — small strings with semantic interpretation. For example, a token ‘def’ is not just three chars but a Python keyword reserved for function declaration. We now have valuable insights about the purpose of every token, but the context itself is still missing.


The parsing stage adds valuable context. It combines tokens into higher-level language constructions and outputs them in a form called an “abstract syntax tree.” Now we have more structural details: groups of tokens became variables, functions, and so on.

For us, the question is still the same: how.

Eureka!

One day, while making a presentation for a talk, I had a code snippet. I wanted to format it nicely, started a browser, and typed “syntax highlighting online.” Then I opened the first link, pasted my code, pressed “highlight,” and.. a lighting strike hit somewhere near me. What I did is I found a tool that:


  • definitely makes tokenization under the hood
  • light enough to work inside a web browser
  • supports dozens of languages and formats out of the box


Forget about the presentation. Let’s find something similar running on Python. And yes, I found it.

538 languages supported

I found Pygments, a fantastic library that solved the most significant problem: language-dependent tokenization. It still uses regexes under the hood, but those regexes are not written by me.


That’s a killer feature!


The library is about syntax highlighting but it features RawTokenLexer, so we can output a raw stream of tokens with their semantic meaning.



Our first problem — decreasing the number of strings for analysis — is solved. Now we understand the type of every token and can just ignore useless ones: keywords, punctuation, numbers, operators, etc., leaving only literals and comments. But we are still unable to understand the names and values of variables.

Variable Detection

The lexing problem seems to be solved. Now let's move on to parsing. Building a true AST looks a bit redundant for our problem, so let’s optimize.


Further research has shown that the type of token is more important than its value for variable detection. We need to look for patterns inside a stream of token types to detect variables. Once the pattern is found, we can make additional checks against the values of the tokens inside it.


To create a Token Type Stream (TTS), I took one character from the type name of each token. Thanks to Pygments, token types are common across languages and formats.


Eventually, a Variable Detection Rule (VDR) may contain the following:


  • Reference pattern to look inside a TTS (regex with match groups)
  • MatchRules for match groups that check for expected values inside the pattern (eq. useful for finding assignment punctuation)
  • MatchSemantics to clarify a group’s semantic purpose


Example!

So let’s take our simple code snippet:

def main(arg: str):
  a = 3 
  b = 'hello'
  return f'{a}{b}'



Represent as an array with TTS.


Basically, the variable of our interest (b) is between the 20th and the 25th token.

The reference pattern here should be something like this:

(n)t*(o|p)t*(s)(s)(s)t



Let’s apply it to the TTS






Match groups help us to apply additional verification logic.

And finally, a rule may be supplemented with our vision of semantics.









Summarizing everything

The cool part here is that approach allows you to cover any language with variable detection using only 4-6 rules.

New Secrets Detection Rule

Semantic information about a given file enables us to analyze deeper in the right context. For example, we can now calculate entropy for a variable’s value, knowing for sure that this specific substring is a variable's value.


Names of variables are also important for analysis. There are two new rules I introduced:

  • Suspicious variable naming (low confidence)
  • Suspicious variable naming + high entropy of a variable's value

Now we can also use hashed secrets as a source of new rules. A hashed secret rule can contain information about the original secret length to improve performance.

Outcomes

This approach became a giant step forward for Avito: we received a noticeable performance boost (up to 60%) and +200% new findings we could not see before. Of course, the false positive rate has also grown, but it’s mostly due to test secrets. The findings are semantically correct.

Limitations

  • Plaintext files without extensions or incorrect extensions are obviously unsupported; tokenization fallbacks to the “blind mode”
  • Variable Detection Rules (VDRs) are still a kind of compromise; they’re language-specific and may not guarantee absolute correctness (although we did not face any problem with that)

We’re ready to share it with you!

Our code scanning model requires every scanner to be a web service that supports our internal protocol, so we decided to extract the core functionality and open-source it as a CLI tool called DeepSecrets.


You can find it here: https://github.com/avito-tech/deepsecrets

Release notes are in a separate article here.

Thank you!



The featured image for this piece was generated with stable diffusion v2.1

Prompt: Illustrate a computer screen with the caution symbol.