Skip to main content

Notice: this Wiki will be going read only early in 2024 and edits will no longer be possible. Please see: https://gitlab.eclipse.org/eclipsefdn/helpdesk/-/wikis/Wiki-shutdown-plan for the plan.

Jump to: navigation, search

Difference between revisions of "JDT Core Programmer Guide/ECJ/Parse"

m
(recovery: moved & merged description from page Completion)
(3 intermediate revisions by the same user not shown)
Line 10: Line 10:
 
* record line ends (for translation between linear offsets and line based positions)
 
* record line ends (for translation between linear offsets and line based positions)
 
* '''disambiguate''' tokens which depend on context. Some examples, why scanning needs more context than it should according to text books (cf. [https://docs.oracle.com/javase/specs/jls/se14/html/jls-3.html#jls-3.9 JLS §3.9]):
 
* '''disambiguate''' tokens which depend on context. Some examples, why scanning needs more context than it should according to text books (cf. [https://docs.oracle.com/javase/specs/jls/se14/html/jls-3.html#jls-3.9 JLS §3.9]):
** token <code>-&gt;</code> is ambiguous with the two single tokens <code>-</code> and <code>&gt;</code>
+
** token <code>-&gt;</code> is ambiguous with respect to the two single tokens <code>-</code> and <code>&gt;</code>
 
** ''restricted keywords'' are opportunistically recognized when this enables the parser to recognize a legal module declaration (inside the modular compilation unit "module-info.java").
 
** ''restricted keywords'' are opportunistically recognized when this enables the parser to recognize a legal module declaration (inside the modular compilation unit "module-info.java").
 
** ''restricted identifiers'' like <code>var</code> and <code>yield</code> must also be recognized by the scanner in certain contexts.
 
** ''restricted identifiers'' like <code>var</code> and <code>yield</code> must also be recognized by the scanner in certain contexts.
Line 64: Line 64:
 
* fields Parser.methodRecoveryActivated and Parser.statementRecoveryActivated control whether or not attempts should be made to create AST despite syntax errors.
 
* fields Parser.methodRecoveryActivated and Parser.statementRecoveryActivated control whether or not attempts should be made to create AST despite syntax errors.
  
 
Upon any '''syntax error''', syntax recovery happens using the following concepts:
 
* field Parser.currentElement holds a "recovered" element which wraps a possibly incomplete AST node
 
* each class below '''RecoveredElement''' models a node in the recovered tree, with generic functions for adding new child nodes etc.
 
* field Parser.lastCheckpoint marking a position up-to which parsing is considered "done". When hitting a syntax error, parsing will restart at the last checkpoint.
 
* method Parser.resumeOnSyntaxError() applies a number of heuristics to put the parser into a state where parsing can resume.
 
  
 
{{tip|Peculiar parser rules|
 
{{tip|Peculiar parser rules|
 
Grammar '''java.g''' contains a few unexpected rules:
 
Grammar '''java.g''' contains a few unexpected rules:
 
* Rules starting with <code>Invalid</code> or invoking a method <code>consumeInvalid...</code> explicitly expect some common error situations, in order to provide specific error messages.
 
* Rules starting with <code>Invalid</code> or invoking a method <code>consumeInvalid...</code> explicitly expect some common error situations, in order to provide specific error messages.
* Rules starting with <code>Recovery</code> are only invoked during syntax recovery, i.e., those may expect a valid Parser.currentElement.
+
* Rules starting with <code>Recovery</code> are only invoked during syntax recovery, i.e., those may expect a valid Parser.currentElement (see below).
 
* Rules of the shape <code>ForceNoDiet ::&#61; &#36;empty</code> are markers in the grammar, which do not correspond to any input text. Whenever the parser reaches such an empty non-terminal, a side effect is performed in the parser.
 
* Rules of the shape <code>ForceNoDiet ::&#61; &#36;empty</code> are markers in the grammar, which do not correspond to any input text. Whenever the parser reaches such an empty non-terminal, a side effect is performed in the parser.
 
** E.g.: <code>EnumConstant ::&#61; EnumConstantHeader ForceNoDiet ClassBody RestoreDiet</code> ensures that the body of an enum constant is always parsed in non-diet mode.
 
** E.g.: <code>EnumConstant ::&#61; EnumConstantHeader ForceNoDiet ClassBody RestoreDiet</code> ensures that the body of an enum constant is always parsed in non-diet mode.
Line 91: Line 85:
 
* '''DocumentElementParser''': used for creating the ''obsolete'' JDOM (packages <code>org.eclipse.jdt.core.jdom</code>, <code>org.eclipse.jdt.internal.core.jdom</code>).
 
* '''DocumentElementParser''': used for creating the ''obsolete'' JDOM (packages <code>org.eclipse.jdt.core.jdom</code>, <code>org.eclipse.jdt.internal.core.jdom</code>).
 
* '''MatchLocatorParser''': used during Java Search to provide input to the '''MatchLocator''' (which refines index matches into resolved java matches).
 
* '''MatchLocatorParser''': used during Java Search to provide input to the '''MatchLocator''' (which refines index matches into resolved java matches).
 +
 +
===Stacks in detail===
 +
 +
'''astStack''' is the target: in absence of syntax errors everything will eventually be aggregated into this stack and finally combined into a single ASTNode. astStack will only take some higher level ASTNodes, while specific stacks exist for "smaller" nodes.
 +
 +
'''astLengthStack''' is a companion to the above that provides a view as a stack of '''lists'''. Each element on this stack is the number of elements on the main astStack that should be seen as siblings at the same level. Examples are members of the same class, statements of the same block. If, e.g., the astStack contains 2 TypeDeclarations and 3 MethodDeclarions then an astLengthStack [ 2 3 ] will indicate that the three methods are siblings in the same list of methods, and likewise the two types are siblings in the same list of types.
 +
* method <code>pushOnAstStack()</code> always starts a new list (by advancing astLengthPtr) of initial length 1.
 +
* method <code>concatNodeLists()</code> combines the top two lists into a single list, typically the top list is a singleton (the most recently produced node) that is being appended to the current list, following a grammar rule like ''TypeDeclarations ::= TypeDeclarations TypeDeclaration''
 +
 +
The same kind of companion stack (*LengthStack) exists for other stacks as well.
 +
 +
'''expressionStack''' as the name suggests, manages expressions while they are being assembled and before they are included into statements on the main astStack.
 +
 +
'''genericsStack''' is where type parameters and complex type references are assembled.
 +
 +
'''identifierStack''' is for identifiers that haven't even been integrated in any ASTNode.
 +
 +
'''intStack''' is a multipurpose stack mostly used for source positions. Other purposes include array dimensions, modifiers, and more. Due to the mix of ints with different meaning, bugs related to an unbalanced intStack are particularly difficult to analyze. So balancing pushs and pops for this stack for all cases should be done with great care.
 +
 +
====Automaton stack====
 +
While all the above stacks hold work-in-progress elements of the target AST, parsing itself is controlled by the master stack '''stack'''. Each value on this stack is a "state" capturing information what tokens have been seen, and which tokens can legally follow. Sometimes these states are called "act" or "action". The central method <code>tAction</code> takes the previous act plus the current token to produce a new act (its implementation directly leverages the tables generated by jikespg). acts are grouped as shift, shift-reduce, reduce and ERROR states. During '''shift''', a new token will be accepted from the scanner. When the parser has enough information for '''reduce''', the current act is passed to <code>consumeRule</code> where it will select which of the <code>consume*</code> methods is invoked (consumeRule being generated by jikespg as well). This is when contents of the other stacks will be modified.
 +
 +
To see the automaton in action consider setting <code>DEBUG_AUTOMATON</code> to true (but don't commit this change! :) ).
 +
===Recovery===
 +
 +
Upon any '''syntax error''', syntax recovery happens using the following concepts:
 +
* field <code>Parser.currentElement</code> holds a "recovered" element which wraps a possibly incomplete AST node.
 +
** each class below '''RecoveredElement''' models a node in the recovered tree, with generic functions for adding new child nodes etc.
 +
** <code>currentElement</code> acts like a cursor into the tree of recovered elements, with <code>parent</code> links to the rest of the tree.
 +
* field <code>Parser.lastCheckpoint</code> marks a position up-to which parsing is considered "done". When hitting a syntax error, parsing will restart at the last checkpoint.
 +
* method <code>Parser.resumeOnSyntaxError()</code> applies a number of heuristics to put the parser into a state where parsing can resume.
 +
** initially <code>buildInitialRecoveryState()</code> converts existing ASTNodes in <code>Parser.astStack</code> into <code>RecoveredElements</code>.
 +
** More elements will be added later using <code>RecoveredElement.add(..)</code>. Each execution of <code>add(..)</code> will return either
 +
*** <code>this</code> to signal that more elements should be added to the same node, or
 +
*** <code>parent</code> to signal that the current element is complete and subsequent details should be added to the enclosing element.
 +
** Expressions are explicitly omitted from the structure of <code>Recovered*</code>. To enable recovering details from a '''lambda body''', the following tweaks are applied:
 +
*** The lambda block is inserted as a direct child of the current element (omitting the lambda itself)
 +
*** If the current element is a local variable declaration (which cannot accept a block as its child), the local variable is considered as complete and the lambda block is inserted into the parent, i.e., as a sibling of the local variable declaration.
 +
* much of the incremental updating of recovered elements relates to source positions:
 +
** <code>updateOnClosingBrace()</code>, e.g., may set source end positions of a method being recovered.
 +
** source end positions, on the other hand, are used in <code>add(..)</code> to detect whether the current element is "complete".
 +
* recovery may proceed in several iterations starting at a slowly moving lastCheckpoint. Each iteration will start with empty stacks, but ast nodes contained in the recovered elements tree may survive this reset.
 +
* in the end <code>updateParseTree()</code> will collect the information from the recovered elements tree, and update the actual parse tree accordingly, such that <code>Parser.compilationUnit</code> will contain the fully recovered AST.
  
 
==HOWTO: Grammar Changes==
 
==HOWTO: Grammar Changes==

Revision as of 06:19, 27 April 2021

Parse

In classical compiler construction, the input text is tokenized by a Scanner. The resulting token stream is then fed into the Parser which creates the abstract syntax tree (AST). ECJ essentially follows this design, but in reality things are more complex, because the Java syntax is no longer amenable to a strict implementation of this approach.

Scanner

Responsibilities of the scanner:

  • interpret input unicode
  • recognize comments
    • recognize task tags within comments
  • record line ends (for translation between linear offsets and line based positions)
  • disambiguate tokens which depend on context. Some examples, why scanning needs more context than it should according to text books (cf. JLS §3.9):
    • token -> is ambiguous with respect to the two single tokens - and >
    • restricted keywords are opportunistically recognized when this enables the parser to recognize a legal module declaration (inside the modular compilation unit "module-info.java").
    • restricted identifiers like var and yield must also be recognized by the scanner in certain contexts.


Two main concepts are used for token disambiguation:

  • A VanguardParser, is a stripped-down parser that runs the parse automaton at a location of ambiguity to check if a specific goal can be reached using the remaining input. In particular, the VanguardParser does not produce any AST, and it does not really consume any input.
  • The scanner keeps track of the current ScanContext in order to decide if a restricted keyword should be classified as a keyword or as an identifier.

To support the above, the scanner allows limited look ahead and the option to "unget" an erroneously consumed token (in order to allow re-classification).

Warning2.png
Caveat
The above implies a few restrictions on the use of a scanner:
  • many locations in JDT request partial scanning starting from a given context. If the input could possibly be a modular compilation unit, then the scanner needs to actually "parse" starting from the beginning of the text, in order to determine the ScanContext at the requested offset. This is done using ScanContextDetector which is a special VanguardParser.
  • ...
Matters are further complicated in variant CompletionScanner as used for code completion.


Scanner itself is 100% handcoded, but ScannerHelper contains several tables for unicode handling. Olivier Thomann is the expert for unicode handling.

Scanner has a public variant, PublicScanner which must be updated after each relevant change in Scanner. PublicScanner shields clients from generated constants in interface TerminalTokens and maps those to stable constants in ITerminalSymbols.

Parser

Constituents of the parser:

  • java.g is the grammar definition that is used by jikespg to create the following:
    • method Parser.consumeRule(int)
    • constants in interface TerminalTokens: every token detected by the scanner is represented using one of the constants in this interface. Constant names are composed of TokenName + the terminal name used in the grammar.
    • constants in interface ParserBasicInformation (used mostly by the parser automaton).


Generating the parser (see also bug 562044):

  • install jikespg on your local machine
  • define a per-workspace String Substitution "JIKESPG" pointing to the location where jikespg is installed
  • run scripts/build-parser.launch


Main parsing mechanism:

  • the parser state is captured in various stacks, each of which is implementated by a pair of fields:
    • a field of an array type to contain the stack data
    • an int field representing the pointer to the current top element (initially -1 for: empty stack)
  • method Parser.parse() contains the main loop of the parse automaton
  • method Parser.consumeToken(int) generically records some information purely based on a token being consumed
  • method Parser.consumeRule(int) (generated) dispatches into the individual consumeXYZ methods.
  • each consume method may peek or pop elements from some stacks and push elements on one or more other stacks


Parsing modes

  • in diet mode, any blocks like method bodies are skipped, which can later be filled in using dedicated methods like Parser.parse(MethodDeclaration,CompilationUnitDeclaration).
  • fields Parser.methodRecoveryActivated and Parser.statementRecoveryActivated control whether or not attempts should be made to create AST despite syntax errors.


Idea.png
Peculiar parser rules
Grammar java.g contains a few unexpected rules:
  • Rules starting with Invalid or invoking a method consumeInvalid... explicitly expect some common error situations, in order to provide specific error messages.
  • Rules starting with Recovery are only invoked during syntax recovery, i.e., those may expect a valid Parser.currentElement (see below).
  • Rules of the shape ForceNoDiet ::= $empty are markers in the grammar, which do not correspond to any input text. Whenever the parser reaches such an empty non-terminal, a side effect is performed in the parser.
    • E.g.: EnumConstant ::= EnumConstantHeader ForceNoDiet ClassBody RestoreDiet ensures that the body of an enum constant is always parsed in non-diet mode.
    • Don't confuse such marker non-terminals with those rules terminating a recursion, like Argumentsopt ::= $empty


Parser Variants:

  • DiagnoseParser: after detecting a syntax error, this parser heuristically tries several "repairs", and selects the repair with the smallest "distance" to the actual input. During this process, the parser automaton is repeatedly run, without producing any output, other then accept or reject.
  • VanguardParser: as mentioned above, this parser is used for nested parse attempts in order to disambiguate certain tokens, that could be classified in different ways according to context.
  • AssistParser with subclasses CompletionParser and SelectionParser: these parsers focus on a specific text location and implement incomplete parsing optimized for the task at hand.
  • CodeSnippetParser: parses incomplete code for the sake of evaluation
  • CommentRecorderParser: used for created DOM AST
  • SourceElementParser: used for building structure of Java Elements (see Java Model)
  • IndexingParser: used for creating the JDT/Core index
  • DocumentElementParser: used for creating the obsolete JDOM (packages org.eclipse.jdt.core.jdom, org.eclipse.jdt.internal.core.jdom).
  • MatchLocatorParser: used during Java Search to provide input to the MatchLocator (which refines index matches into resolved java matches).

Stacks in detail

astStack is the target: in absence of syntax errors everything will eventually be aggregated into this stack and finally combined into a single ASTNode. astStack will only take some higher level ASTNodes, while specific stacks exist for "smaller" nodes.

astLengthStack is a companion to the above that provides a view as a stack of lists. Each element on this stack is the number of elements on the main astStack that should be seen as siblings at the same level. Examples are members of the same class, statements of the same block. If, e.g., the astStack contains 2 TypeDeclarations and 3 MethodDeclarions then an astLengthStack [ 2 3 ] will indicate that the three methods are siblings in the same list of methods, and likewise the two types are siblings in the same list of types.

  • method pushOnAstStack() always starts a new list (by advancing astLengthPtr) of initial length 1.
  • method concatNodeLists() combines the top two lists into a single list, typically the top list is a singleton (the most recently produced node) that is being appended to the current list, following a grammar rule like TypeDeclarations ::= TypeDeclarations TypeDeclaration

The same kind of companion stack (*LengthStack) exists for other stacks as well.

expressionStack as the name suggests, manages expressions while they are being assembled and before they are included into statements on the main astStack.

genericsStack is where type parameters and complex type references are assembled.

identifierStack is for identifiers that haven't even been integrated in any ASTNode.

intStack is a multipurpose stack mostly used for source positions. Other purposes include array dimensions, modifiers, and more. Due to the mix of ints with different meaning, bugs related to an unbalanced intStack are particularly difficult to analyze. So balancing pushs and pops for this stack for all cases should be done with great care.

Automaton stack

While all the above stacks hold work-in-progress elements of the target AST, parsing itself is controlled by the master stack stack. Each value on this stack is a "state" capturing information what tokens have been seen, and which tokens can legally follow. Sometimes these states are called "act" or "action". The central method tAction takes the previous act plus the current token to produce a new act (its implementation directly leverages the tables generated by jikespg). acts are grouped as shift, shift-reduce, reduce and ERROR states. During shift, a new token will be accepted from the scanner. When the parser has enough information for reduce, the current act is passed to consumeRule where it will select which of the consume* methods is invoked (consumeRule being generated by jikespg as well). This is when contents of the other stacks will be modified.

To see the automaton in action consider setting DEBUG_AUTOMATON to true (but don't commit this change! :) ).

Recovery

Upon any syntax error, syntax recovery happens using the following concepts:

  • field Parser.currentElement holds a "recovered" element which wraps a possibly incomplete AST node.
    • each class below RecoveredElement models a node in the recovered tree, with generic functions for adding new child nodes etc.
    • currentElement acts like a cursor into the tree of recovered elements, with parent links to the rest of the tree.
  • field Parser.lastCheckpoint marks a position up-to which parsing is considered "done". When hitting a syntax error, parsing will restart at the last checkpoint.
  • method Parser.resumeOnSyntaxError() applies a number of heuristics to put the parser into a state where parsing can resume.
    • initially buildInitialRecoveryState() converts existing ASTNodes in Parser.astStack into RecoveredElements.
    • More elements will be added later using RecoveredElement.add(..). Each execution of add(..) will return either
      • this to signal that more elements should be added to the same node, or
      • parent to signal that the current element is complete and subsequent details should be added to the enclosing element.
    • Expressions are explicitly omitted from the structure of Recovered*. To enable recovering details from a lambda body, the following tweaks are applied:
      • The lambda block is inserted as a direct child of the current element (omitting the lambda itself)
      • If the current element is a local variable declaration (which cannot accept a block as its child), the local variable is considered as complete and the lambda block is inserted into the parent, i.e., as a sibling of the local variable declaration.
  • much of the incremental updating of recovered elements relates to source positions:
    • updateOnClosingBrace(), e.g., may set source end positions of a method being recovered.
    • source end positions, on the other hand, are used in add(..) to detect whether the current element is "complete".
  • recovery may proceed in several iterations starting at a slowly moving lastCheckpoint. Each iteration will start with empty stacks, but ast nodes contained in the recovered elements tree may survive this reset.
  • in the end updateParseTree() will collect the information from the recovered elements tree, and update the actual parse tree accordingly, such that Parser.compilationUnit will contain the fully recovered AST.

HOWTO: Grammar Changes

eg: Let us say you want to add a new statement, MyNewStatement. Grammar changes start with java.g - It should be a context free grammar since we use an LALR(1) parser You can add your MyNewStatement similar to that of EnhancedForStatement to the RHS of Statement, ie Statement -> MyNewStatement and then followed by the new rules (obeying the context free rule) with a few consumeMyNewStatement(), other consume* methods.

The parser generator jikespg is used to generate the parser from this grammar. You may want to take the latest bits of the jikespg from [1] since it contains a few bug fixes and then make. Additionally, the generation of parser has been made easier - ref [2]. If you want to look at the original documentation of how the parser is generated see [3].

Let us say, now you have the modified java.g and the jikespg parser ready. First run jikespg.exe java.g to check whether your grammar is LALR(1).

Assuming that it is, then you need to just run org.eclipse.jdt.core/scripts/build-parser.launch. In order to provide the path to jikespg you need to define a per-workspace String Substitution "JIKESPG", see also the usage notes inside build-parser.xml.

This would generate the files as described in [3], but what would interest you most would the Parser.java - here you will see your consumeMyNewStatement() and other consume* stubs inserted - both the method call as per the generated automata (which you should not touch) and the declaration - you would need to add code to the declaration, to reduce and possibly create a new AST Node.

The AST Node created would be the internal compiler AST Node and not the DOM AST node - they have similar/identical definitions - so you need to create something similar to org.eclipse.jdt.internal.compiler.ast.ForeachStatement, which is the enhanced for statement node of compiler AST - look at org.eclipse.jdt.internal.compiler.parser.consumeEnhancedForStatement() - internally, it creates a ForeachStatement - [we are not talking of DOM AST node - the conversion to the DOM AST Node EnhancedForStatement comes later in the ASTConverter method and is not relevant here - mentioning to avoid confusion]

The most tricky part about constructing new AST nodes is to correctly manage the various stacks of the parser. As a general rule, each consume method should consume all ast nodes mentioned on its RHS and replace them by one node representing the LHS. The first distinction to be made is between astStack and expressionStack. Less obvious is the use of intStack which is also modified inside consumeToken, e.g.


[1] https://github.com/jikespg/jikespg/tree/fixes-combined

[2] https://bugs.eclipse.org/bugs/show_bug.cgi?id=562044

[3] https://www.eclipse.org/jdt/core/howto/generate%20parser/generateParser.html

Note You can also run the jikespg separately and then bring the generated files. Once you get the files, set the env in Eclipse JIKESPG=JIKESPG_EXTERNAL to signal to the script to skip running the jikespg.

Back to the top