Building a Parser: Numbers | Strings
In our previous post, we laid the groundwork with a basic parser and tokenizer. Now, let’s dive deeper into the architecture of recursive descent parsing.
The parser we’re building follows a two-stage approach: first breaking input into tokens through tokenization, then constructing an Abstract Syntax Tree (AST) from those tokens during parsing. Recursive descent parsing is particularly elegant because it maps grammar rules directly to parser methods, breaking down complex expressions into manageable pieces - like solving a puzzle systematically.
Today, we’ll extend this architecture to handle both numeric and string literals, making our parser more practical and powerful.
The Tokenizer and Token Types
Our tokenizer now handles two main token types:
- NUMBER: Sequences of digits
- STRING: Characters enclosed in double quotes
Enhanced Tokenizer Implementation
Let’s start by upgrading our tokenizer to handle both numbers and strings. Our tokenizer needs to recognize and extract these basic types from the input stream.
class Tokenizer {
init(string) {
this._string = string;
this._cursor = 0;
}
isEOF() {
return this._cursor >= this._string.length;
}
hasMoreTokens() {
return this._cursor < this._string.length;
}
getNextToken() {
if (!this.hasMoreTokens()) {
return null;
}
let char = this._string[this._cursor];
// Skip any whitespace
while (char === ' ' || char === '\t' || char === '\n') {
this._cursor++;
if (this.isEOF()) return null;
char = this._string[this._cursor];
}
// Numbers:
if (!Number.isNaN(Number(char))) {
let number = '';
while (!Number.isNaN(Number(this._string[this._cursor])) && !this.isEOF()) {
number += this._string[this._cursor++];
}
return {
type: 'NUMBER',
value: number,
};
}
// Strings:
if (char === '"') {
let str = '';
this._cursor++; // Skip opening quote
while (this._string[this._cursor] !== '"' && !this.isEOF()) {
str += this._string[this._cursor++];
}
this._cursor++; // Skip closing quote
return {
type: 'STRING',
value: str,
};
}
// Handle unexpected characters
throw new SyntaxError(`Unexpected token: "${char}"`);
}
}
Parser Grammar Rules
Our parser’s grammar has evolved to handle both types of literals:
Program
: Literal
;
Literal
: NumericLiteral
| StringLiteral
;
NumericLiteral
: NUMBER
;
StringLiteral
: STRING
;
Implementing the Parser
Let’s implement these grammar rules in our parser:
class Parser {
constructor() {
this._string = '';
this._tokenizer = new Tokenizer();
}
parse(string) {
this._string = string;
this._tokenizer.init(string);
// Prime the tokenizer to obtain the first token
this._lookahead = this._tokenizer.getNextToken();
return this.Program();
}
/**
* Program
* : Literal
* ;
*/
Program() {
return {
type: 'Program',
body: this.Literal(),
};
}
/**
* Literal
* : NumericLiteral
* | StringLiteral
* ;
*/
Literal() {
switch (this._lookahead.type) {
case 'NUMBER':
return this.NumericLiteral();
case 'STRING':
return this.StringLiteral();
default:
throw new SyntaxError('Literal: unexpected literal production');
}
}
/**
* NumericLiteral
* : NUMBER
* ;
*/
NumericLiteral() {
const token = this._eat('NUMBER');
return {
type: 'NumericLiteral',
value: Number(token.value),
};
}
/**
* StringLiteral
* : STRING
* ;
*/
StringLiteral() {
const token = this._eat('STRING');
return {
type: 'StringLiteral',
value: token.value,
};
}
_eat(tokenType) {
const token = this._lookahead;
if (token == null) {
throw new SyntaxError(
`Unexpected end of input, expected: "${tokenType}"`
);
}
if (token.type !== tokenType) {
throw new SyntaxError(
`Unexpected token: "${token.type}", expected: "${tokenType}"`
);
}
// Advance to next token
this._lookahead = this._tokenizer.getNextToken();
return token;
}
}
Understanding the Implementation
The Lookahead Token
Our parser uses a lookahead token to predict which production to use. This makes it an LL(1) parser, meaning it needs to look ahead exactly one token to make parsing decisions.
// Prime the tokenizer to obtain the first token
this._lookahead = this._tokenizer.getNextToken();
Token Consumption
The _eat method is crucial for token consumption:
- Validates the current token type
- Advances to the next token
- Returns the consumed token
_eat(tokenType) {
const token = this._lookahead;
if (token == null) {
throw new SyntaxError(`Unexpected end of input, expected: "${tokenType}"`);
}
if (token.type !== tokenType) {
throw new SyntaxError(
`Unexpected token: "${token.type}", expected: "${tokenType}"`
);
}
this._lookahead = this._tokenizer.getNextToken();
return token;
}
Testing Our Parser
Let’s test our parser with various inputs:
const parser = new Parser();
// Testing numeric literals
console.log(parser.parse('42'));
// Output:
// {
// type: 'Program',
// body: {
// type: 'NumericLiteral',
// value: 42
// }
// }
// Testing string literals
console.log(parser.parse('"Hello, world!"'));
// Output:
// {
// type: 'Program',
// body: {
// type: 'StringLiteral',
// value: 'Hello, world!'
// }
// }
Error Handling
Our parser includes robust error handling for common cases:
- Unexpected end of input
- Unexpected token types
- Invalid literal productions
// This will throw: Unexpected token: "a", expected: "NUMBER"
parser.parse('abc');
// This will throw: Unexpected end of input, expected: "STRING"
parser.parse('"unclosed string');
Next Steps
While our parser can now handle basic numeric and string literals, there are several improvements we could make:
- Support for single quotes in strings
- Handling of escape sequences in strings
- Support for floating-point numbers
- Adding support for hexadecimal and octal numbers
In our next post, we’ll explore adding support for mathematical expressions and operator precedence.
Exercise for Readers
Try extending the parser to support single-quoted strings. Here’s a hint:
// In the tokenizer's getNextToken method:
if (char === '"' || char === "'") {
// Handle both quote types
}