Understaning Chrome V8 - Chapter 23: Compiler Workflow Parse, AST, and Token

Welcome to other chapters of Let’s Understand Chrome V8

In this article, I will talk about how the Parse, AST, and Token work together, and walk you through the compilation as well as watch and see the detail. Specifically, you will see how to scan tokens, and how Parse consumes tokens for generating an AST tree.

1. Parse_Info

Below is the Parse_Info class. It’s the kernel class of the Parser, that manages JavaScript source code and an AST tree. Simply, the input of the Parser is JavaScript code and the output is an AST tree. But note that from the moment V8 loaded the JavaScript source code, the code has been encoded as UTF-16 already.

Let’s step into the Parse_Info.

1.  // A container for the inputs, configuration options, and outputs of parsing.
2.  class V8_EXPORT_PRIVATE ParseInfo {
3.  public:
4.  //omit...
5.  AstValueFactory* ast_value_factory() const {
6.    DCHECK(ast_value_factory_.get());
7.    return ast_value_factory_.get();
8.  }
9.  const AstRawString* function_name() const { return function_name_; }
10.  void set_function_name(const AstRawString* function_name) {
11.    function_name_ = function_name;
12.  }
13.  FunctionLiteral* literal() const { return literal_; }
14.  void set_literal(FunctionLiteral* literal) { literal_ = literal; }
15.  private:
16.   //------------- Inputs to parsing and scope analysis -----------------------
17.   const UnoptimizedCompileFlags flags_;
18.   UnoptimizedCompileState* state_;
19.   std::unique_ptr<Zone> zone_;
20.     v8::Extension* extension_;
21.     DeclarationScope* script_scope_;
22.     uintptr_t stack_limit_;
23.     int parameters_end_pos_;
24.     int max_function_literal_id_;
25.     //----------- Inputs+Outputs of parsing and scope analysis -----------------
26.     std::unique_ptr<Utf16CharacterStream> character_stream_;
27.     std::unique_ptr<ConsumedPreparseData> consumed_preparse_data_;
28.     std::unique_ptr<AstValueFactory> ast_value_factory_;
29.     const AstRawString* function_name_;
30.     RuntimeCallStats* runtime_call_stats_;
31.     SourceRangeMap* source_range_map_;  // Used when block coverage is enabled.
32.     //----------- Output of parsing and scope analysis ------------------------
33.     FunctionLiteral* literal_;
34.     bool allow_eval_cache_ : 1;
35.   #if V8_ENABLE_WEBASSEMBLY
36.     bool contains_asm_module_ : 1;
37.   #endif  // V8_ENABLE_WEBASSEMBLY
38.     LanguageMode language_mode_ : 1;
39.   };

In the above code, line 5 is the AST factory method; Line 13 the literal returns the AST tree; Line 17 the flags_ shows the reason why the function is not optimized; Line 33 the literal_ is responsible for holding the completed AST tree. As I mentioned the AST is an output of the parser, so if you want to watch the workflow of the parser in detail, you just debug it and keep watching the member variable literal_, in this way you are going to see the AST generation and how the factory method is called.

2. AST Tree

As I mentioned above, the variable literal_ represents an AST tree, which its declaration class is the FunctionLiteral.

1.  class FunctionLiteral final : public Expression {
2.   public:
3.    template <typename IsolateT>//省略很多代码...............
4.    MaybeHandle<String> GetName(IsolateT* isolate) const {  }
5.    const AstConsString* raw_name() const { return raw_name_; }
6.    DeclarationScope* scope() const { return scope_; }
7.    ZonePtrList<Statement>* body() { return &body_; }
8.    bool is_anonymous_expression() const {
9.      return syntax_kind() == FunctionSyntaxKind::kAnonymousExpression;
10.    }
11.    V8_EXPORT_PRIVATE LanguageMode language_mode() const;
12.    void add_expected_properties(int number_properties) {}
13.    std::unique_ptr<char[]> GetDebugName() const;
14.    Handle<String> GetInferredName(Isolate* isolate) {  }
15.    Handle<String> GetInferredName(LocalIsolate* isolate) const {}
16.    const AstConsString* raw_inferred_name() { return raw_inferred_name_; }
17.    FunctionSyntaxKind syntax_kind() const {}
18.    FunctionKind kind() const;
19.    bool IsAnonymousFunctionDefinition() const {  }
20.    int suspend_count() { return suspend_count_; }
21.    int function_literal_id() const { return function_literal_id_; }
22.    void set_function_literal_id(int function_literal_id) {  }
23.   private:
24.    const AstConsString* raw_name_;
25.    DeclarationScope* scope_;
26.    ZonePtrList<Statement> body_;
27.    AstConsString* raw_inferred_name_;
28.    Handle<String> inferred_name_;
29.    ProducedPreparseData* produced_preparse_data_;
30.  };

I would emphasize that the AST granularity is one function, in other words, a JavaScript function corresponds to only one AST tree. In the above code, line 26 member variable body_ holds the AST body, go through the body_ if you wonder about the details. Lines 4–15 set language mode as strict or sloppy.

Let’s look more at line 18 FunctionKind.

1.  enum FunctionKind : uint8_t {
2.    // BEGIN constructable functions
3.    kNormalFunction,kModule,kAsyncModule,kBaseConstructor,kDefaultBaseConstructor,
4.    kDefaultDerivedConstructor,
5.    kDerivedConstructor,
6.  //omit....
7.    kLastFunctionKind = kClassStaticInitializerFunction,
8.  };

You should see that the function kinds are not the types in the JavaScript books because the function kinds just represent the function kinds in a compiler. See compiler books for details.

3. AST Node Type

The following macro templates are non-exhaustive node type lists.

1.  #define DECLARATION_NODE_LIST(V) \
2.    V(VariableDeclaration)         \
3.    V(FunctionDeclaration)
4.  #define ITERATION_NODE_LIST(V) \
5.    V(DoWhileStatement)          \
6.    V(WhileStatement)            \
7.    V(ForStatement)              \
8.    V(ForInStatement)            \
9.    V(ForOfStatement)
10.  #define BREAKABLE_NODE_LIST(V) \
11.    V(Block)                     \
12.    V(SwitchStatement)
13.  #define STATEMENT_NODE_LIST(V)       \
14.    ITERATION_NODE_LIST(V)             \
15.    BREAKABLE_NODE_LIST(V)             \
16.    V(ExpressionStatement)             \
17.    V(EmptyStatement)                  \
18.    V(SloppyBlockFunctionStatement)    \
19.    V(IfStatement)                     \
20.    V(InitializeClassStaticElementsStatement)//omit
21.  #define LITERAL_NODE_LIST(V) \
22.    V(RegExpLiteral)           \
23.    V(ObjectLiteral)           \
24.    V(ArrayLiteral)
25.  #define EXPRESSION_NODE_LIST(V) \
26.    LITERAL_NODE_LIST(V)          \
27.    V(Assignment)                 \
28.    V(Await)                      \
29.    V(BinaryOperation)            \
30.    V(NaryOperation)              \
31.    V(Call)                       \
32.    V(YieldStar)//omit
33.  #define FAILURE_NODE_LIST(V) V(FailureExpression)
34.  #define AST_NODE_LIST(V)                        \
35.    DECLARATION_NODE_LIST(V)                      \
36.    STATEMENT_NODE_LIST(V)                        \
37.    EXPRESSION_NODE_LIST(V)

In the above code, see line 34, it organizes all the templates together to generate an AST tree for your JavaScript function in the parser which is a part of the compiler pipeline. By the macro AST_NODE_LIST, we should see that there are three types in an AST, and every type also has more subtypes. All of these types correspond to our JavaScript code and represent semantics accurately.

4. Token Type

The following macro templates are non-exhaustive token lists.

1.  #define TOKEN_LIST(T, K)                                           \
2.    T(TEMPLATE_SPAN, nullptr, 0)                                     \
3.    T(TEMPLATE_TAIL, nullptr, 0)                                     \
4.    /* BEGIN Property */                                             \
5.    T(PERIOD, ".", 0)                                                \
6.    T(LBRACK, "[", 0)                                                \
7.    /* END Property */                                               \
8.    /* END Member */                                                 \
9.    T(QUESTION_PERIOD, "?.", 0)                                      \
10.    T(LPAREN, "(", 0)                                                \
11.    /* END PropertyOrCall */                                         \
12.    T(RPAREN, ")", 0)                                                \
13.    T(RBRACK, "]", 0)                                                \
14.    T(LBRACE, "{", 0)                                                \
15.    T(COLON, ":", 0)                                                 \
16.    T(ELLIPSIS, "...", 0)                                            \
17.    T(CONDITIONAL, "?", 3)                                           \
18.  //omit.

The token is used in the scanner which is a part of the V8 compiler, the scanner is responsible for converting the JavaScript code to tokens. Let’s say a simple example, see line 5, the token type is PERIOD and corresponds to the dot in JavaScript code, which means that the scanner will convert every dot that it sees to a PERIOD token. Samely, lines 10 and 12 are light paren and right paren, which are used to describe the ( and ) in JavaScript code. So in every line, in every macro which starts with a T, the first parameter is the token type, the second is lexical corresponding to the type, and the last number is the token priority, the number is smaller, the token is higher, and will be converted early.

5. Finite-state Automation

In most compilers, the lexer and parser use the FSA(finite-state automation) to analyze the source code. In V8, it uses the switch-case to implement the FSA, let’s have a look at the lexer FSA, namely the token scanner.

1.  V8_INLINE Token::Value Scanner::ScanSingleToken() {
2.   Token::Value token;
3.   do {
4.     next().location.beg_pos = source_pos();
5.     if (V8_LIKELY(static_cast<unsigned>(c0_) <= kMaxAscii)) {
6.       token = one_char_tokens[c0_];
7.       switch (token) {
8.         case Token::LPAREN:
9.         case Token::RPAREN:
10.          case Token::LBRACE:
11.          case Token::RBRACE:
12.          case Token::LBRACK:
13.          case Token::RBRACK:
14.          case Token::COLON:
15.          case Token::SEMICOLON:
16.          case Token::COMMA:
17.          case Token::BIT_NOT:
18.          case Token::ILLEGAL:
19.            // One character tokens.
20.            return Select(token);
21.          case Token::CONDITIONAL:
22.            // ? ?. ?? ??=
23.            Advance();
24.            if (c0_ == '.') {
25.              Advance();
26.              if (!IsDecimalDigit(c0_)) return Token::QUESTION_PERIOD;
27.              PushBack('.');
28.            } else if (c0_ == '?') {
29.              return Select('=', Token::ASSIGN_NULLISH, Token::NULLISH);
30.            }
31.            return Token::CONDITIONAL;
32.  //omit
33.          case Token::WHITESPACE:
34.            token = SkipWhiteSpace();
35.            continue;
36.          case Token::NUMBER:
37.            return ScanNumber(false);
38.          case Token::IDENTIFIER:
39.            return ScanIdentifierOrKeyword();
40.          default:
41.            UNREACHABLE();
42.        }
43.      }
44.      if (IsIdentifierStart(c0_) ||
45.          (CombineSurrogatePair() && IsIdentifierStart(c0_))) {
46.        return ScanIdentifierOrKeyword();
47.      }
48.    } while (token == Token::WHITESPACE);
49.    return token;
50.  }

The ScanSingleToken() is responsible for generating a token. In lines 8–40, every case is a token.

The following code is the parser FSA, which consumes tokens and generates nodes for an AST tree.

1.  ParserBase<Impl>::ParseStatementListItem() {
2.    switch (peek()) {
3.      case Token::FUNCTION:
4.        return ParseHoistableDeclaration(nullptr, false);
5.      case Token::CLASS:
6.        Consume(Token::CLASS);
7.        return ParseClassDeclaration(nullptr, false);
8.      case Token::VAR:
9.      case Token::CONST:
10.        return ParseVariableStatement(kStatementListItem, nullptr);
11.      case Token::LET:
12.        if (IsNextLetKeyword()) {
13.          return ParseVariableStatement(kStatementListItem, nullptr);
14.        }
15.        break;
16.      case Token::ASYNC:
17.        if (PeekAhead() == Token::FUNCTION &&
18.            !scanner()->HasLineTerminatorAfterNext()) {
19.          Consume(Token::ASYNC);
20.          return ParseAsyncFunctionDeclaration(nullptr, false);
21.        }
22.        break;
23.      default:
24.        break;
25.    }
26.    return ParseStatement(nullptr, nullptr, kAllowLabelledFunctionStatement);
27.  }

In end, I would argue that FSA is an important fundamental in a compiler, which is worth examining in a little more depth.

Okay, that wraps it up for this share. I’ll see you guys next time, take care!

Please reach out to me if you have any issues. WeChat: qq9123013 Email: [email protected]

This story was first published here.