47,904 讀數

如何为乐趣和利润构建 WebAssembly 语言：解析

经过 Courier9 分钟 read2022/08/26

太長; 讀書

AST 是一种树状数据结构，它将标记组织成一个逻辑层次结构，可以更容易地转换为机器代码。因为 wispy 是一种 S 表达式语言，所以我们的代码本质上是一个 AST。在这篇文章中，我们将介绍编译器的下一阶段，解析。

Company Mentioned

featured image - 如何为乐趣和利润构建 WebAssembly 语言：解析

在里面最后发帖在这个关于如何构建 WebAssembly 编程语言的系列文章中，我们构建了一个词法分析器。在这篇文章中，我们将介绍编译器的下一阶段，解析。解析是我们编译器的一部分，它获取词法分析器生成的令牌流并将其转换为抽象语法树（AST）。

AST 是一种树状数据结构，它将标记组织成一个逻辑层次结构，可以更容易地转换为机器代码。值得庆幸的是，因为 wispy 是一种 S 表达式语言，所以我们的代码本质上已经是 AST。采用以下令牌流：

 “(“, “add”, “3”, “(“, “sub”, “2”, “1”, “)”, “)”

每组括号代表一个子树，其中第一个标记是操作符节点，后面的标记是它的操作数。如果我们在当前集合关闭之前遇到另一个左括号，我们知道它表示一个操作数，它本身就是一个子树。上面的令牌流将被组织成一个看起来像这样的树：

如果您有兴趣为更复杂的类 C 语法编写解析器，请参阅我以前的构建编程语言系列。

构建 AST

现在我们已经定义了 AST，是时候构建一个了。

创建一个新的src/parser.mts文件并添加我们将使用的所有导入：

 // src/parser.mts import { Token, IdentifierNode, TypedIdentifierNode, IdentifierToken, TypedIdentifierToken, FloatToken, FloatNode, IntToken, IntNode, AstNode, BlockNode, } from "./types/index.mjs";

现在我们可以定义我们的顶级解析函数。 parse 函数获取词法分析器生成的标记并返回一个BlockNode作为我们树的根。

 // src/parser.mts export const parse = (tokens: Token[]): BlockNode => { const blocks: BlockNode[] = []; // This loop is run as long as there are tokens to consume while (tokens.length) { // consumeTokenTree converts an array of tokens into a tree of tokens, more on that later. const tree = consumeTokenTree(tokens); // parseBlock turns our new tree of tokens into an actual BlockNode, recursively. More on that later as well. blocks.push(parseBlock(tree)); } // Finally we return the top level BlockNode return { type: "block", expressions: blocks, }; };

接下来，我们定义consumeTokenTree函数。 consumeTokenTree将一个扁平的令牌数组转换为一个令牌树。

鉴于这个纤细的表达：

 (add (sub 3 1) (sub 5 2))

词法分析器将产生这个标记数组：

 // Note: I've simplified the Token format to just be strings to keep things short ["(", "add", "(", "sub", "3", "1", ")", "(", "sub", "5", "2", ")", ")"];

consumeTokenTree将采用该平面数组并将其变成一棵树。这就像将一组括号标记()之间的每个标记放入一个数组中一样简单。所以我们上面的令牌数组变成了这个令牌树：

 ["add", [, "sub", "3", "1"], ["sub", "5", "2"]];

这是consumeTokenTree的实际定义：

 // src/parser.mts // This is token besides for the bracket tokens export type NonBracketToken = Exclude<Token, "parenthesis" | "square-bracket">; // The token tree is made of NonBracketTokens and other TokenTrees export type TokenTree = (NonBracketToken | TokenTree)[]; const consumeTokenTree = (tokens: Token[]): TokenTree => { const tree: TokenTree = []; // Ensures the first token is a left bracket and then discards it, defined below this function. consumeLeftBracket(tokens); while (tokens.length) { // Preview the next token const token = tokens[0]; // Check to see if the next token is a left bracket. if (token.type === "bracket" && getBracketDirection(token) === "left") { // If it is, we just ran into a sub-TokenTree. So we can simply call this function within // itself. Gotta love recursion. tree.push(consumeTokenTree(tokens)); continue; } // Check to see if the next token is a right bracket if (token.type === "bracket" && getBracketDirection(token) === "right") { // If it is, we just found the end of the tree on our current level tree.shift(); // Discard the right bracket break; // Break the loop } // If the token isn't a bracket, it can simply be added to the tree on this level tree.push(token); // Consume / discard the token from the main tokens array tokens.shift(); } // Return the tree. Don't forget to check out the helper functions below! return tree; }; const consumeLeftBracket = (tokens: Token[]) => { const bracketDirection = getBracketDirection(tokens[0]); if (bracketDirection !== "left") { throw new Error("Expected left bracket"); } return tokens.shift(); }; const getBracketDirection = (token: Token): "left" | "right" => { if (token.type !== "bracket") { throw new Error(`Expected bracket, got ${token.type}`); } // If we match a left bracket return left if (/[\(\[]/.test(token.value)) return "left"; // Otherwise return right return "right"; };

现在我们有了一个令牌树，我们需要把它变成一个块。为此，我们创建了一个parseBlock函数，该函数将树作为其输入并返回一个BlockNode ：

 1const parseBlock = (block?: TokenTree): BlockNode => { return { type: "block", // This is where the recursive magic happens expressions: block.map(parseExpression), }; };

您可能已经注意到， parseBlock使用尚未编写的parseExpression函数映射树的每个项目。 parseExpression采用TokenTree或NonBracketToken并将其转换为相应的AstNode类型。

这是定义：

 const parseExpression = (expression?: TokenTree | NonBracketToken): AstNode => { // If the expression is an Array, we were passed another TokenTree, so we can // pass the expression back to the parseBlock function if (expression instanceof Array) { return parseBlock(expression); } // The mapping here is pretty straight forward. Match the token type and pass the // expression on to a more specific expression parser. if (isTokenType(expression, "identifier")) return parseIdentifier(expression); if (isTokenType(expression, "typed-identifier")) return parseTypedIdentifier(expression); if (isTokenType(expression, "float")) return parseFloatToken(expression); if (isTokenType(expression, "int")) return parseIntToken(expression); throw new Error(`Unrecognized expression ${JSON.stringify(expression)}`); };

让我们定义isTokenType函数。这个函数非常简洁，展示了 TypeScript 最强大的特性之一，自定义类型守卫.简单地说， isTokenType测试表达式并将类型缩小到特定的TokenType 。这允许 TypeScript 确定我们正在将正确的标记传递给它们相应的解析器函数。

这是定义：

 export const isTokenType = <T extends Token["type"]>( item: TokenTree | NonBracketToken | undefined, type: T ): item is Extract<Token, { type: T }> => { return isToken(item) && item.type === type; }; const isToken = (item?: TokenTree | NonBracketToken): item is NonBracketToken => { return !(item instanceof Array); };

那里发生了很多事情，所以让我们来看看它。首先，我们有一个通用定义， <T extends Token["type"]> 。这本质上是说 T 必须是Token.type字段的可能值之一。 Typescript 足够聪明，知道这意味着 T 必须是"int" | "float" | "identifier" | "typed-identifier" | "bracket 。

下一段有趣的代码是返回类型谓词item is Extract<Token, { type: T }> 。该谓词告诉 TypeScript，如果isTokenType的返回值为 true，则item必须是其类型与作为type参数传递的字符串匹配的Token 。

在实践中，这意味着如果我们将未知的Token传递给isTokenType ，打字稿将能够正确地将值缩小为更具体的标记，例如IntToken 。

现在我们已经定义了自定义类型保护，我们可以定义实际的令牌解析器。前三个很简单；他们基本上只是返回令牌的副本或稍作修改的副本：

 const parseFloatToken = (float: FloatToken): FloatNode => ({ ...float }); const parseIntToken = (int: IntToken): IntNode => ({ ...int }); const parseIdentifier = (identifier: IdentifierToken): IdentifierNode => { return { type: "identifier", identifier: identifier.value, }; };

最后的解析器是parseTypedIdentifier 。请记住，类型化标识符的形式为identifier:type 。解析它就像用冒号分割字符串一样简单。返回数组的第一个值是identifier ，第二个是type 。这是定义：

 const parseTypedIdentifier = (identifier: TypedIdentifierToken): TypedIdentifierNode => { const vals = identifier.value.split(":"); return { type: "typed-identifier", identifier: vals[0], typeIdentifier: vals[1], }; };

这是完成的文件

这就是一个工作解析器所需的所有代码。在继续之前，让我们更新主src/index.mts文件以查看解析器的输出：

 // src/index.mts #!/usr/bin/env node import { readFileSync } from "fs"; import { lex } from "./lexer.mjs"; import { parse } from "./parser.mjs"; const file = process.argv[2]; const input = readFileSync(file, "utf8"); const tokens = lex(input); const ast = parse(tokens); console.log(JSON.stringify(tokens, undefined, 2));

构建并运行项目：

 npx tsc wispy example.wispy

如果一切顺利，输出应该如下所示这个.

这样，解析器就完成了。我们现在可以将来自词法分析器的标记流转换为 AST。在下一篇文章中，我们可以进入有趣的部分：生成和运行机器可读的代码。

立即申请演示，了解 Courier 如何使产品通知变得令人愉悦！

L O A D I N G
. . . comments & more!

About Author

Courier@courier

Courier simplifies triggering and sending notifications from your app with one API and drag and drop UI.

Read my stories

这篇文章刊登在...

Mentioned in this story

companies

Lexer

Join HackerNoon

Latest technology trends. Customized Experience. Curated Stories. Publish Your Ideas

如何为乐趣和利润构建 WebAssembly 语言：解析

Courier

@courier

太長; 讀書

Company Mentioned

Courier

更多关于 AST

构建 AST

About Author

標籤

这篇文章刊登在...

Mentioned in this story

如何为乐趣和利润构建 WebAssembly 语言：解析

Courier

@courier

太長; 讀書

Company Mentioned

Courier

更多关于 AST

构建 AST

About Author

標籤

这篇文章刊登在...

Mentioned in this story

相關故事