Skip to main content

Notice: this Wiki will be going read only early in 2024 and edits will no longer be possible. Please see: https://gitlab.eclipse.org/eclipsefdn/helpdesk/-/wikis/Wiki-shutdown-plan for the plan.

Jump to: navigation, search

JDT Core Programmer Guide/ECJ/Parse

< JDT Core Programmer Guide‎ | ECJ
Revision as of 07:59, 30 June 2020 by Stephan.herrmann.berlin.de (Talk | contribs) (Created page with "=Parse= In classical compiler construction, the input text is tokenized by a '''Scanner'''. The resulting token stream is then fed into the '''Parser''' which creates the abs...")

(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)

Parse

In classical compiler construction, the input text is tokenized by a Scanner. The resulting token stream is then fed into the Parser which creates the abstract syntax tree (AST). ECJ essentially follows this design, but in reality things are more complex, because the Java syntax is no longer amenable to a strict implementation of this approach.

Scanner

Responsibilities of the scanner:

  • interpret input unicode
  • recognize comments
    • recognize task tags within comments
  • record line ends (for translation between linear offsets and line based positions)
  • disambiguate tokens which depend on context. Some examples, why scanning needs more context than it should according to text books (cf. JLS §3.9):
    • token -> is ambiguous wrt the two single tokens - and >
    • restricted keywords are opportunistically recognized when this enables the parser to recognize a legal module declaration (inside the modular compilation unit "module-info.java").
    • restricted identifiers like var and yield must also be recognized by the scanner in certain contexts.


Two main concepts are used for token disambiguation:

  • A VanguardParser, is a stripped-down parser that runs the parse automaton at a location of ambiguity to check if a specific goal can be reached using the remaining input. In particular, the VanguardParser does not produce any AST, and it does not really consume any input.
  • The scanner keeps track of the current ScanContext in order to decide if a restricted keyword should be classified as a keyword or as an identifier.

To support the above, the scanner allows limited look ahead and the option to "unget" an erroneously consumed token (in order to allow re-classification).

Warning2.png
Caveat
The above implies a few restrictions on the use of a scanner:
  • many locations in JDT request partial scanning starting from a given context. If the input could possibly be a modular compilation unit, then the scanner needs to actually "parse" starting from the beginning of the text, in order to determine the ScanContext at the requested offset. This is done using ScanContextDetector which is a special VanguardParser.
  • ...
Matters are further complicated in variant CompletionScanner as used for code completion.


Scanner itself is 100% handcoded, but ScannerHelper contains several tables for unicode handling. Olivier Thomann is the expert for unicode handling.

'Scanner has a public variant, PublicScanner which must be updated after each relevant change in Scanner. PublicScanner shields clients from generated constants in interface TerminalTokens and maps those to stable constants in ITerminalSymbols.

Parser

Constituents of the parser:

  • java.g is the grammar definition that is used by jikespg to create the following:
    • method Parser.consumeRule(int)
    • constants in interface TerminalTokens: every token detected by the scanner is represented using one of the constants in this interface. Constant names are composed of TokenName + the terminal name used in the grammar.
    • constants in interface ParserBasicInformation (used mostly by the parser automaton).


Generating the parser (see also bug 562044):

  • install jikespg on your local machine
  • define a per-workspace String Substitution "JIKESPG" pointing to the location where jikespg is installed
  • run scripts/build-parser.launch


Main parsing mechanism:

  • the parser state is captured in various stacks, each of which is implementated by a pair of fields:
    • a field of an array type to contain the stack data
    • an int field representing the pointer to the current top element (initially -1 for: empty stack)
  • method Parser.parse() contains the main loop of the parse automaton
  • method Parser.consumeToken(int) generically records some information purely based on a token being consumed
  • method Parser.consumeRule(int) (generated) dispatches into the individual consumeXYZ methods.
  • each consume method may peek or pop elements from some stacks and push elements on one or more other stacks


Parsing modes

  • in diet mode, any blocks like method bodies are skipped, which can later be filled in using dedicated methods like Parser.parse(MethodDeclaration,CompilationUnitDeclaration).
  • fields Parser.methodRecoveryActivated and Parser.statementRecoveryActivated control whether or not attempts should be made to create AST despite syntax errors.


Upon any syntax error, syntax recovery happens using the following concepts:

  • field Parser.currentElement holds a "recovered" element which wraps a possibly incomplete AST node
  • each class below RecoveredElement models a node in the recovered tree, with generic functions for adding new child nodes etc.
  • field Parser.lastCheckpoint marking a position up-to which parsing is considered "done". When hitting a syntax error, parsing will restart at the last checkpoint.
  • method Parser.resumeOnSyntaxError() applies a number of heuristics to put the parser into a state where parsing can resume.
Idea.png
Peculiar parser rules
Grammar java.g contains a few unexpected rules:
  • Rules starting with Invalid or invoking a method consumeInvalid... explicitly expect some common error situations, in order to provide specific error messages.
  • Rules starting with Recovery are only invoked during syntax recovery, i.e., those may expect a valid Parser.currentElement.
  • Rules of the shape ForceNoDiet ::= $empty are markers in the grammar, which do not correspond to any input text. Whenever the parser reaches such an empty non-terminal, a side effect is performed in the parser.
    • E.g.: EnumConstant ::= EnumConstantHeader ForceNoDiet ClassBody RestoreDiet ensures that the body of an enum constant is always parsed in non-diet mode.
    • Don't confuse such marker non-terminals with those rules terminating a recursion, like Argumentsopt ::= $empty


Parser Variants:

  • DiagnoseParser: after detecting a syntax error, this parser heuristically tries several "repairs", and selects the repair with the smallest "distance" to the actual input. During this process, the parser automaton is repeatedly run, without producing any output, other then accept or reject.
  • VanguardParser: as mentioned above, this parser is used for nested parse attempts in order to disambiguate certain tokens, that could be classified in different ways according to context.
  • AssistParser with subclasses CompletionParser and SelectionParser: these parsers focus on a specific text location and implement incomplete parsing optimized for the task at hand.
  • CodeSnippetParser: parses incomplete code for the sake of evaluation
  • CommentRecorderParser: used for created DOM AST
  • SourceElementParser: used for building structure of Java Elements (see Java Model)
  • IndexingParser: used for creating the JDT/Core index
  • DocumentElementParser: used for creating the obsolete JDOM (packages org.eclipse.jdt.core.jdom, org.eclipse.jdt.internal.core.jdom).
  • MatchLocatorParser: used during Java Search to provide input to the MatchLocator (which refines index matches into resolved java matches).

Back to the top