A parser type is a category of parsing techniques used in compiler design and computer science to analyze a string of input (like source code) and determine its grammatical structure.
The parser, or syntax analyzer, takes the sequence of tokens generated by the lexical analyzer and builds a data structure, typically a parse tree or abstract syntax tree (AST), that represents the syntactic structure of the input based on a formal grammar.
Parsers are broadly classified into two main types based on how they construct the parse tree: top-down and bottom-up. These two approaches represent fundamentally different strategies for analyzing the input and verifying its conformance to the language's grammar.
1. Top-Down Parsers
Top-down parsers build the parse tree starting from the top, or the root of the tree, which represents the grammar's start symbol. The parser expands the non-terminals by repeatedly applying grammar productions to derive the terminal symbols, which form the input string. It is a goal-driven approach where the parser predicts the next production rule based on the current non-terminal and the next input token(s).
Key Characteristics:
- Derivation: Uses leftmost derivation to generate the input string.
- Approach: Expands non-terminals downwards toward the leaves of the parse tree.
- Predictive: In many variants, it predicts the next production based on a "lookahead" of one or more input symbols.
- Grammar Limitations: Left-recursive grammars can cause infinite loops, and they are generally less powerful than bottom-up parsers.
Sub-types of top-down parsers
Recursive Descent Parser
- Methodology: The most straightforward top-down parsing technique, it consists of a set of recursive functions, one for each non-terminal in the grammar.
- Process: Each function is responsible for matching the right-hand side of its corresponding production rule. If the grammar is non-deterministic, it may involve backtracking (brute-force).
- Predictive Recursive Descent: A non-backtracking version that uses a lookahead to select the correct production rule.
Non-Recursive Predictive Parser (LL Parser)
- Methodology: A table-driven, non-backtracking top-down parser that uses a stack to manage non-terminals and an LL parsing table to make decisions.
- Naming: The "LL" stands for scanning the input from Left-to-right and constructing a Leftmost derivation.
- Variations (LL(k)): The number k indicates the number of lookahead symbols used to make parsing decisions.
LL(1)is a common variant. - Process: It shifts tokens onto a stack and uses the parsing table and lookahead symbol to decide which grammar rule to apply next.
2. Bottom-Up Parsers
Bottom-up parsers build the parse tree starting from the input string (the leaves of the tree) and work their way up to the grammar's start symbol (the root). This technique is often referred to as "shift-reduce" parsing because its fundamental actions are shifting input tokens onto a stack and reducing a sequence of symbols on the stack to a non-terminal.
Key Characteristics:
- Derivation: Constructs a rightmost derivation in reverse order.
- Approach: Builds the parse tree upwards from the terminal symbols toward the root.
- Powerful: Can handle a larger class of grammars compared to top-down parsers.
Sub-types of bottom-up parsers
Shift-Reduce Parser
- Methodology: The general class of bottom-up parsers that uses a stack and an input buffer.
- Actions: The parser performs one of four actions:
- Shift: Move the next input token onto the stack.
- Reduce: When a handle (a right-hand side of a production rule) appears on top of the stack, replace it with the non-terminal on the left-hand side of the rule.
- Accept: The parse is successful when the input is consumed and only the start symbol remains on the stack.
- Error: An error occurs if no other action is possible.
LR Parsers
- Methodology: A highly efficient and powerful class of deterministic shift-reduce parsers. The "LR" stands for scanning the input from Left-to-right and constructing a Rightmost derivation in reverse.
- Process: LR parsers are table-driven, using a parsing table and a stack to make shift-reduce decisions. The lookahead symbol is crucial for this.
- Variations (LR(k)): The number k specifies the number of lookahead symbols.
LR(1)is more powerful thanLR(0).
Common LR Parser Variations:
- Simple LR (SLR(1)): A simpler and less powerful variant of LR parsing that requires a single lookahead symbol. It is easier to implement but works for a more restricted set of grammars.
- Canonical LR (CLR(1)): The most powerful type of LR parser. It uses a larger parsing table to make more precise parsing decisions based on the full lookahead symbol. It can handle a wider range of grammars but requires more memory.
- Look-Ahead LR (LALR(1)): A compromise between SLR and CLR. It is as powerful as CLR(1) for many practical grammars but has a smaller table size, making it a widely used choice for parser generators like YACC.
Operator-Precedence Parser
- Methodology: A simple, table-driven bottom-up parser used for a specific class of grammars where the precedence of operators can be explicitly defined.
- Limitation: It is a less powerful technique and is generally used for parsing expressions.
3. Other Parsing Techniques
Universal Parsers
- Earley Parser: A general-purpose top-down algorithm that can parse any context-free grammar, including ambiguous grammars. While powerful, it is typically slower than deterministic parsers.
- CYK Algorithm: A parsing algorithm for context-free grammars that can be transformed into Chomsky Normal Form. It is a versatile but less efficient technique.
- Chart Parser (Generalized LR - GLR): A parsing algorithm that can handle ambiguous grammars by exploring all possible parse trees simultaneously, resulting in a "parse forest." It acts as a hybrid of top-down and bottom-up methods.
Comparison of Top-Down and Bottom-Up Parsers
| Feature | Top-Down Parsing | Bottom-Up Parsing |
|---|---|---|
| Parsing Direction | Begins with the start symbol (root) and works downwards toward the terminals (leaves). | Begins with the input string (leaves) and works upwards toward the start symbol (root). |
| Derivation | Uses leftmost derivation. | Constructs a rightmost derivation in reverse. |
| Strategy | Goal-driven: Tries to guess the correct production rule to apply. | Data-driven: Identifies a right-hand side of a production rule and reduces it. |
| Grammar Class | Restricted to LL(k) grammars, which cannot be left-recursive or highly ambiguous. | More powerful, can handle a larger class of grammars, including left-recursive grammars. |
| Implementation | Often easier to hand-code, especially recursive descent. | Complex to implement by hand; typically generated by tools like YACC/Bison. |
| Examples | Recursive Descent, LL(1) Parser. | Shift-Reduce, LR Parsers (SLR, CLR, LALR). |
The Role of Lookahead
Lookahead is a crucial concept for deterministic parsers. The number of lookahead tokens (k) determines how much context the parser has to make a decision.
- LL(k): A top-down parser that uses k lookahead symbols.
- LR(k): A bottom-up parser that uses k lookahead symbols. LR parsers are generally more powerful for the same value of k because their decisions are made later in the process, with more context.
Applications of Parsers
Beyond compiler design, parsers are used in a vast array of computing applications:
- Web Browsers: Parsing HTML, CSS, and JavaScript to render web pages.
- Data Processing: Extracting data from structured formats like JSON, XML, and CSV.
- Natural Language Processing (NLP): Analyzing and understanding the grammatical structure of human language.
- Databases: Parsing SQL queries to execute commands.
- Configuration Files: Interpreting application settings and parameters.