Notice: This Wiki is now read only and edits are no longer possible. Please see: https://gitlab.eclipse.org/eclipsefdn/helpdesk/-/wikis/Wiki-shutdown-plan for the plan.
CDT/Obsolete/designs/C99 and UPC Parser Overview
Contents
C99/UPC Parsers
Note: The C99 and UPC parsers are currently considered experimental features, and as such the framework, APIs and functionality may change at any time.
CDT contains two new parsers that support parsing C99 and UPC (Unified Parallel C) code. The main purpose of this initiative is to provide UPC support in CDT for consumption by the PTP (Parallel Tools Platform) project.
The UPC specification is defined as an extension to the C99 specification. UPC adds new language constructs that are designed specifically to support high-performance computing on large-scale parallel machines.
In order for CDT to be able to analyze UPC code it must be able to parse UPC code. Unfortunately the current DOM parsers that are built-in to CDT are not easily extended to support new syntax. The need for UPC support, and a desire to improve language-extensibility in CDT, led to the birth of a new parser framework. The C99 parser allows language extensions, such as UPC, to be added easily, and so the UPC parser is written as an extension to the C99 parser.
The C99/UPC parsers are internally quite different from the CDT C/C++ DOM parsers. The C99/UPC parsers are based on a parser generator called LPG, this is a very different approach from the CDT DOM parsers which were hand written from scratch. The grammar of C99 is contained in a grammar file which is processed by the LPG code generator to produce a working parser.
LPG
LPG is a bottom-up LALR parser generator and runtime framework. LPG is a product of IBM Watson Research, and is used by several Eclipse projects including...
- Model Development Tools (MDT)
- Graphical Modeling Framework (GMF)
- Generative Modeling Technologies (GMT, in particular the UMLX component)
- C/C++ Development Tools (CDT)
- Data Tools Platform (DTP)
- SAFARI
- Java Development Tools (JDT, in the bytecode compiler)
LPG consists of two parts:
- The generator (lpg.exe)
- Generates parse tables, token constants etc…
- Part of the LPG distribution.
- The runtime (java library)
- Contains the parser driver and runtime framework.
The LPG distribution is available for download on sourceforge. It contains the generator, some documentation, examples and template files.
Limited documentation can be found on sourceforge and on the IBM research SAFARI project web site.
LPG is also part of the Eclipse IMP project.
LPG has the following characteristics:
- Completely automatic error recovery, including the ability to automatically produce problem nodes in the AST.
- Deterministic and backtracking parser drivers. The backtracking parser is capable of handing some ambiguous grammars.
- Automatic calculation of AST node offsets (this is something that requires quite a bit of code in the DOM parser).
- Different sets of semantic actions can be plugged into a parser. Semantic actions are used to generate the AST.
- Clean separation between the parser (grammar file) and the semantic actions. The DOM parsers in contrast have parsing and AST building code tightly intermixed.
Running LPG on the grammar files
- Download and install the LPG distribution from sourceforge.
- Install the LPG eclipse plugin (from orbit).
- Configure lpg.exe as an external tool in eclipse using the external tools dialog.
- Use the following settings:
- Location
- \your\path\to\lpg.exe
- Working Directory
- ${container_loc}
- Arguments
- -list ${resource_loc}
- Additionally the following environment variables must be set up under the Environment tab.
- LPG_INCLUDE
- Set this to the full system path to where the C99Parser.g file is located.
- LPG_TEMPLATE
- Set this to the full path to the templates directory under the LPG distribution.
- LPG_INCLUDE
- Location
- Use the following settings:
- Once set up in this manner lpg.exe can be run on the C99Parser.g grammar file by selecting it and clicking on the run external tool toolbar button.
UPC
Unified Parallel C is an extension of the C99 language standard. It adds several language constructs that allow the creation of programs that run with a large number of threads. A quick overview of UPC syntax extensions is given here just to give an idea of the scope of what was needed to add UPC support to CDT.
An extended syntax for declaration specifiers is allowed. A shared variable is declared using the type qualifier “shared” optionally in conjunction with “relaxed” or “strict” and can be given a value that determines how the variable is shared between the running threads (this is usually used for arrays).
shared int i; shared [2] int a[200]; strict shared int i;
New for loop
A new type of for loop is added that is used as a work distribution mechanism for distributing operations amongst threads:
upc_forall(i=0; i<N; i++; i)
It is important to note that upc_forall requires four expressions in its declaration rather than three like a normal for loop. Also, the keyword continue can be used in the fourth position.
Barrier Statements
Barrier statements are used to synchronize the program.
upc_barrier value; upc_wait value;
New sizeofs
upc_localsizeof(p) upc_blocksizeof(p)
New keywords
New keywords that represent the number of active threads are added:
shared int b[100*THREADS];
Language Extensibility in CDT
CDT has the following language extensibility features:
- CDT allows new parsers, represented by an implementation of the ILanguage interface, to be added via an extension point.
- CDT provides a dialog for mapping content types to languages.
- The CDT editor allows syntax highlighting to be extended for new keywords.
- The DOM AST is extensible, it is possible to add new types of nodes.
- If a parser produces a well-formed AST using the DOM AST classes, then all of the CDT features such as content assist and indexing will work.
The C99 parser framework adds the following extensibility features:
- The C99 syntax is defined in a grammar file. It is possible to reuse this file in a new parser by using the LPG $Import directive.
- The semantic actions that build the AST can be extended or overridden to support building an AST with new types of nodes that are specific to the language extension.
- A (soon to be) reusable implementation of the preprocessor.
- Support for content assist built-in, a new parser extension does not have to do anything to get content assist to work.
The C99 Parser Framework
The C99 extensible parser framework consists of the following components:
Lexer
The lexer is generated by LPG from a grammar file (C99Lexer.g). Its role is to convert a character stream into a token stream. The lexer does not recognize keywords, instead they are parsed as identifier tokens which are converted into keyword tokens after the preprocessor has run. This approach allows new keywords to be added without editing the lexer grammar file. The lexer is capable of generating comment tokens.
Preprocessor
A new token-based preprocessor has been written from scratch. The strange semantics of the preprocessor makes generating one from a grammar file infeasible. Also, reusing CDTs preprocessor wasn’t an option because it has its own lexer built right into it.
The preprocessor is token-based, that means the input character buffer is first lexed into tokens and then the preprocessor runs on those tokens. Included files are recursively lexed and added to the preprocessor’s input. Comment tokens are filtered through the preprocessor and are passed straight to the parser.
Parser
The C99 parser is generated by LPG from a grammar file (C99Parser.g).
There are ambigites in the grammar file that are unavoidable without access to type information. The LPG backtracking parser is used so that when the parser encounters an ambiguity it can attempt to parse it one way, and if that doesn’t work it can automatically backtrack and try parsing another way. In other words, if an ambiguous piece of syntax is encountered, and if there is at least one possible correct way to parse it, then it will be successfully parsed.
LPG reports ambiguities by reporting conflicts during parser generation. All of the current conflicts boil down to the following causes.
- The sizeof ambiguity
- Dangling else
- Ambiguities between declarations and expressions, such as x * y.
- Ambiguities caused by the content assist grammar rules (described later in this document).
LPG can handle some conflicts by specifying precedence between grammar rules. This technique is used to solve the dangling else problem as well as to favor the content assist rules over all the other rules. Declarations are favored over expressions.
The parser will convert comment tokens into comment nodes in the AST.
AST Building Semantic Actions
Most of the parser grammar rules are associated with semantic actions, which are just methods that are called when a rule is reduced. The AST is built by the actions using a stack based algorithm (as described in chapter 5 of the Red Dragon Book).
This is a very straightforward and simple algorithm for building an AST. For example, when a binary expression is encountered its two operands will be on the stack. A binary expression node is created, the two operands are popped from the stack and added as children to the node, and then the new node is pushed onto the stack.
Preprocessor Expression Evaluation
The preprocessor is itself a kind of interpreter because it needs to be able to parse and evaluate expressions that occur as conditionals in #if and #elif directives. The preprocessor expression evaluator is implemented as a small LPG parser, generated from the C99ExprEvaluator.g grammar file. The grammar file contains rules for parsing conditional expressions and the associated semantic actions use a stack-based algorithm to perform the evaluation.
Keyword Maps
A keyword map is a class that implements the IKeywordMap interface. A keyword map serves two roles. Firstly, it specifies a mapping between the string representation of a keywords and their associated token types. This is used to convert identifier tokens into keyword tokens. Secondly, all keywords that are defined in the map will be syntax highlighted in the editor.
Node Factory
The AST building semantic actions depend on an implementation of IASTNodeFactory to provide newly constructed AST nodes.
Extending The C99 Parser
This section outlines the steps that were necessary to add UPC support as an extension to the C99 parser. Similar steps can be taken to add support for any C99 language extension.
Add new grammar rules and generate a parser
The UPC grammar file (UPCParser.g) is an extension of the C99 grammar file. The first clause of the UPC grammar file reads
$Import C99Parser.g $End
This pulls in all the grammar rules, terminals and Java code defined in the C99 grammar file. We will not go into the details of writing an LPG grammar file here, please see the LPG documentation.
Running the LPG generator (lpg.exe) on the grammar file will generate a whole new parser capable of handling UPC code. It will also generate the code for a lexer.
LPG also generates diagnostic files that end with a .l extension. These files contain very useful information such as rule precedence and the causes of conflicts.
Add new semantic actions
The new grammar rules are going to need new semantic actions associated with them. A new semantic actions class, UPCParserAction, was created that extends the C99 actions.
public class UPCParserAction extends C99ParserAction
Add new keywords to the keyword map
This part is very simple; UPCKeywordMap simply extends C99KeywordMap and adds the new mappings for the new UPC keywords.
The new actions class and keyword map must be specified in the UPC grammar file so that they will appear in the generated code.
$Define $action_class /. UPCParserAction ./ $keyword_map_class /. UPCKeywordMap ./ $End
Add new AST nodes
Several new AST nodes were added that represent new UPC constructs. Designing an extension to the CDT DOM AST is something that should be done carefully. Most of the AST additions were straight forward, for example a upc_forall statement is represtned by UPCASTForallStatement which extends CASTForStatement and adds a spot for the forth expression that is allowed by UPC. Declarations were a bit more involved because the DOM AST provides four types of declaration specifier node. Each one of these had to be extended to add support for the new type qualifiers “shared”, “relaxed” and “strict”.
Extend the node factory
The AST building semantic actions will need a way to create instances of the new AST nodes. This is done by creating a class UPCASTNodeFactory that extends C99ASTNodeFactory. New methods for creating the new types of nodes are added, also the methods that create declaration specifier nodes were overridden to return the UPC specific versions of those nodes.
Build the AST
The class UPCParserAction contains the new semantic actions that build the UPC specific parts of the AST. These actions are called from the grammar file and are associated with the new grammar rules. By simply manipulating the AST stack the new language extensions can be handled quite easily. The method createNodeFactory() is overridden to return an instance of UPCASTNodeFactory.
Create an ILanguage
A base implementation of ILanguage is provided by the class BaseExtensibleLanguage. It is sufficient to extend this class and implement its three abstract methods.
- getParser()
- return the UPC parser
- getKeywordMap()
- return the UPC keyword map
- getPreprocessorExtensionConfiguration()
- a preprocessor extension configuration contains any extra macros that may be define by the language definition. UPC does not add any macros and so this returns an empty configuration.
Content Assist Support
Content assist support is built into the base C99 parser. The algorithm for producing a completion node is essentially the same as for the DOM parsers, which is described in CDT/designs/Overview of Parsing.
The completion token is actually generated by the preprocessor because it is responsible for computing offsets and therefore it is aware when the cursor offset has been reached. The preprocessor will then begin returning end-of-completion tokens after it returns the completion token.
There are 5 simple grammar rules in the C99Parser.g grammar file for dealing with the completion and end-of-completion tokens:
ident ::= 'identifier' | 'Completion' ']' ::=? 'RightBracket' | 'EndOfCompletion' ')' ::=? 'RightParen' | 'EndOfCompletion' '}' ::=? 'RightBrace' | 'EndOfCompletion' ';' ::=? 'SemiColon' | 'EndOfCompletion'
The first rule says that a Completion token can occur anywhere an identifier token can occur. The next 4 rules allow the parse to complete successfully after a Completion token has been encountered.
Of course these rules lead to a huge amount of ambiguities reported by LPG. These rules are given precedence over all other rules so that the end-of-completion token will always match.
This solution is completely general and automatically applies to anyplace in the grammar where the ident rule is used. This means that as long as an extending grammar uses the ident non-terminal in places where identifiers are expected then content assist will automatically work.
The semantic action for consuming an identifier will check if the token is actually a completion token, and if it is a completion node is generated.