# CDT/designs/C99 and UPC Parser Overview

< CDT‎ | designs

## C99/UPC Parsers

Note: The C99 and UPC parsers are currently considered experimental features, and as such the framework, APIs and functionality may change at any time.

CDT contains two new parsers that support parsing C99 and UPC (Unified Parallel C) code. The main purpose of this initiative is to provide UPC support in CDT for consumption by the PTP (Parallel Tools Platform) project.

The UPC specification is defined as an extension to the C99 specification. UPC adds new language constructs that are designed specifically to support high-performance computing on large-scale parallel machines.

In order for CDT to be able to analyze UPC code it must be able to parse UPC code. Unfortunately the current DOM parsers that are built-in to CDT are not easily extended to support new syntax. The need for UPC support, and a desire to improve language-extensibility in CDT, led to the birth of a new parser framework. The C99 parser allows language extensions, such as UPC, to be added easily, and so the UPC parser is written as an extension to the C99 parser.

The C99/UPC parsers are internally quite different from the CDT C/C++ DOM parsers. The C99/UPC parsers are based on a parser generator called LPG, this is a very different approach from the CDT DOM parsers which were hand written from scratch. The grammar of C99 is contained in a grammar file which is processed by the LPG code generator to produce a working parser.

## LPG

LPG is a bottom-up LALR parser generator and runtime framework. LPG is a product of IBM Watson Research, and is used by several Eclipse projects including...

• Model Development Tools (MDT)
• Graphical Modeling Framework (GMF)
• Generative Modeling Technologies (GMT, in particular the UMLX component)
• C/C++ Development Tools (CDT)
• Data Tools Platform (DTP)
• SAFARI
• Java Development Tools (JDT, in the bytecode compiler)

LPG consists of two parts:

• The generator (lpg.exe)
• Generates parse tables, token constants etc…
• Part of the LPG distribution.
• The runtime (java library)
• Contains the parser driver and runtime framework.

The LPG distribution is available for download on sourceforge. It contains the generator, some documentation, examples and template files.

Limited documentation can be found on sourceforge and on the IBM research SAFARI project web site.

LPG is also part of the Eclipse IMP project.

LPG has the following characteristics:

• Completely automatic error recovery, including the ability to automatically produce problem nodes in the AST.
• Deterministic and backtracking parser drivers. The backtracking parser is capable of handing some ambiguous grammars.
• Automatic calculation of AST node offsets (this is something that requires quite a bit of code in the DOM parser).
• Different sets of semantic actions can be plugged into a parser. Semantic actions are used to generate the AST.
• Clean separation between the parser (grammar file) and the semantic actions. The DOM parsers in contrast have parsing and AST building code tightly intermixed.

### Running LPG on the grammar files

• Install the LPG eclipse plugin (from orbit).
• Configure lpg.exe as an external tool in eclipse using the external tools dialog.
• Use the following settings:
• Location
• \your\path\to\lpg.exe
• Working Directory
• ${container_loc} • Arguments • -list${resource_loc}
• Additionally the following environment variables must be set up under the Environment tab.
• LPG_INCLUDE
• Set this to the full system path to where the C99Parser.g file is located.
• LPG_TEMPLATE
• Set this to the full path to the templates directory under the LPG distribution.
• Once set up in this manner lpg.exe can be run on the C99Parser.g grammar file by selecting it and clicking on the run external tool toolbar button.

## UPC

Unified Parallel C is an extension of the C99 language standard. It adds several language constructs that allow the creation of programs that run with a large number of threads. A quick overview of UPC syntax extensions is given here just to give an idea of the scope of what was needed to add UPC support to CDT.

#### Shared declarations

An extended syntax for declaration specifiers is allowed. A shared variable is declared using the type qualifier “shared” optionally in conjunction with “relaxed” or “strict” and can be given a value that determines how the variable is shared between the running threads (this is usually used for arrays).

shared int i;
shared [2] int a[200];
strict shared int i;


#### New for loop

A new type of for loop is added that is used as a work distribution mechanism for distributing operations amongst threads:

upc_forall(i=0; i<N; i++; i)


It is important to note that upc_forall requires four expressions in its declaration rather than three like a normal for loop. Also, the keyword continue can be used in the fourth position.

#### Barrier Statements

Barrier statements are used to synchronize the program.

upc_barrier value;
upc_wait value;


#### New sizeofs

upc_localsizeof(p)
upc_blocksizeof(p)


#### New keywords

shared int b[100*THREADS];


## Language Extensibility in CDT

CDT has the following language extensibility features:

• CDT allows new parsers, represented by an implementation of the ILanguage interface, to be added via an extension point.
• CDT provides a dialog for mapping content types to languages.
• The CDT editor allows syntax highlighting to be extended for new keywords.
• The DOM AST is extensible, it is possible to add new types of nodes.
• If a parser produces a well-formed AST using the DOM AST classes, then all of the CDT features such as content assist and indexing will work.

The C99 parser framework adds the following extensibility features:

C99Parser.g
$End  This pulls in all the grammar rules, terminals and Java code defined in the C99 grammar file. We will not go into the details of writing an LPG grammar file here, please see the LPG documentation. Running the LPG generator (lpg.exe) on the grammar file will generate a whole new parser capable of handling UPC code. It will also generate the code for a lexer. LPG also generates diagnostic files that end with a .l extension. These files contain very useful information such as rule precedence and the causes of conflicts. #### Add new semantic actions The new grammar rules are going to need new semantic actions associated with them. A new semantic actions class, UPCParserAction, was created that extends the C99 actions. public class UPCParserAction extends C99ParserAction  #### Add new keywords to the keyword map This part is very simple; UPCKeywordMap simply extends C99KeywordMap and adds the new mappings for the new UPC keywords. The new actions class and keyword map must be specified in the UPC grammar file so that they will appear in the generated code. $Define
$action_class /. UPCParserAction ./$keyword_map_class /. UPCKeywordMap ./
\$End


Several new AST nodes were added that represent new UPC constructs. Designing an extension to the CDT DOM AST is something that should be done carefully. Most of the AST additions were straight forward, for example a upc_forall statement is represtned by UPCASTForallStatement which extends CASTForStatement and adds a spot for the forth expression that is allowed by UPC. Declarations were a bit more involved because the DOM AST provides four types of declaration specifier node. Each one of these had to be extended to add support for the new type qualifiers “shared”, “relaxed” and “strict”.

#### Extend the node factory

The AST building semantic actions will need a way to create instances of the new AST nodes. This is done by creating a class UPCASTNodeFactory that extends C99ASTNodeFactory. New methods for creating the new types of nodes are added, also the methods that create declaration specifier nodes were overridden to return the UPC specific versions of those nodes.

#### Build the AST

The class UPCParserAction contains the new semantic actions that build the UPC specific parts of the AST. These actions are called from the grammar file and are associated with the new grammar rules. By simply manipulating the AST stack the new language extensions can be handled quite easily. The method createNodeFactory() is overridden to return an instance of UPCASTNodeFactory.

#### Create an ILanguage

A base implementation of ILanguage is provided by the class BaseExtensibleLanguage. It is sufficient to extend this class and implement its three abstract methods.

• getParser()
• return the UPC parser
• getKeywordMap()
• return the UPC keyword map
• getPreprocessorExtensionConfiguration()
• a preprocessor extension configuration contains any extra macros that may be define by the language definition. UPC does not add any macros and so this returns an empty configuration.

## Content Assist Support

Content assist support is built into the base C99 parser. The algorithm for producing a completion node is essentially the same as for the DOM parsers, which is described in CDT/designs/Overview of Parsing.

The completion token is actually generated by the preprocessor because it is responsible for computing offsets and therefore it is aware when the cursor offset has been reached. The preprocessor will then begin returning end-of-completion tokens after it returns the completion token.

There are 5 simple grammar rules in the C99Parser.g grammar file for dealing with the completion and end-of-completion tokens:

ident ::= 'identifier' | 'Completion'

']' ::=? 'RightBracket' | 'EndOfCompletion'
')' ::=? 'RightParen'   | 'EndOfCompletion'
'}' ::=? 'RightBrace'   | 'EndOfCompletion'
';' ::=? 'SemiColon'    | 'EndOfCompletion'


The first rule says that a Completion token can occur anywhere an identifier token can occur. The next 4 rules allow the parse to complete successfully after a Completion token has been encountered.

Of course these rules lead to a huge amount of ambiguities reported by LPG. These rules are given precedence over all other rules so that the end-of-completion token will always match.

This solution is completely general and automatically applies to anyplace in the grammar where the ident rule is used. This means that as long as an extending grammar uses the ident non-terminal in places where identifiers are expected then content assist will automatically work.

The semantic action for consuming an identifier will check if the token is actually a completion token, and if it is a completion node is generated.