The Parsing Pipeline: Tokenizers and ASTs

This is the first chapter of a series on building a parser. We’ll construct a parser for a full programming language, with syntax reminiscent of Java or JavaScript.

Why Build a Parser?

Before we dive in, let’s address why you might want to build a parser from scratch:

To gain a deep understanding of language processing
For creating domain-specific languages (DSLs)
To build custom development tools
As a foundation for creating your own programming language

The Parsing Pipeline

Let’s break down the parsing process into its core components:

1. Tokenizer (Lexer/Scanner)

The tokenizer is the first stage of our parsing pipeline. Its job is to perform lexical analysis, breaking down the input string into a stream of tokens. Each token typically has a type and a value.

For example, given the input print "hello", the tokenizer produces:

Token 1: {type: "IDENTIFIER", value: "print"}
Token 2: {type: "STRING", value: "hello"}

The tokenizer doesn’t determine if a program is syntactically valid. It simply identifies and categorizes the basic elements of the language.

2. Parser

The parser takes the stream of tokens and performs syntactic analysis. It validates that the sequence of tokens follows the language’s grammar rules and constructs an Abstract Syntax Tree (AST) that represents the structure of the program.

Abstract Syntax Trees (ASTs)

An AST is a tree representation of the abstract syntactic structure of source code. In the tree above, the root node is the assignment operator (=). The left child is the variable X. The right child is an addition expression, which itself contains a multiplication expression — correctly representing that * has higher precedence than +.

This structure is what makes ASTs powerful: the tree shape encodes precedence and associativity, eliminating the need for parentheses or precedence tables during evaluation.

Tools of the Trade

Regular Expressions

While regular expressions aren’t sufficient for parsing entire programming languages, they play a crucial role in tokenization. We use patterns like \d+ to match numbers and "[^"]*" to match string literals.

Backus-Naur Form (BNF)

BNF is a notation technique used to describe the syntax of programming languages. Each rule defines how a construct can be built from simpler constructs, ultimately reaching terminal symbols (actual tokens). The grammar above shows how a Program is a list of statements, and each statement is an expression followed by a semicolon.

Types of Parsers

Parsers fall into two broad categories. Hand-written parsers — particularly recursive descent parsers — are what we’ll build. Each grammar rule maps directly to a function. Auto-generated parsers (LL, LR, PEG) are produced by parser generators from grammar specifications.

We’re choosing recursive descent because it’s the most intuitive: the code structure mirrors the grammar structure, making it easy to read, debug, and extend.

Our Initial Parser Implementation

Let’s dive into our initial parser that handles single numbers:

class Parser {
  constructor() {
    this.string = '';
  }

  parse(string) {
    this.string = string;
    return this.Program();
  }

  /**
   * Program
   *   : NumericLiteral
   *   ;
   */
  Program() {
    return this.NumericLiteral();
  }

  /**
   * NumericLiteral
   *   : NUMBER
   *   ;
   */
  NumericLiteral() {
    return {
      type: 'NumericLiteral',
      value: Number(this.string),
    };
  }
}

The Parsing Process

Initialization: When parse(string) is called, it stores the input string and initiates parsing by calling Program()
Grammar Rules: The parser implements two simple rules: Program -> NumericLiteral -> NUMBER
AST Construction: The NumericLiteral method creates a simple AST node with a type and value

Example Usage

const parser = new Parser();
const ast = parser.parse('42');
// Output: { type: 'NumericLiteral', value: 42 }

Limitations and Next Steps

This initial implementation can only parse single numbers. In the next chapter, we’ll add proper tokenization, support multiple expression types, implement operator precedence, and handle whitespace. Each addition builds upon this foundation, gradually transforming our simple number parser into a full-featured language parser.