CDT/designs/Overview of Parsing
CDT provides powerful features that make C/C++ development much easier such as search, code navigation and content assist. What makes these features work is CDT’s ability to analyze the user’s code, and the first step in analyzing code is to parse it.
CDT contains two parsers, for C and C++, that are known as the DOM (Document Object Model) parsers. The role of the parsers is the following:
- To generate an Abstract Syntax Tree (AST) from source code. The AST is a data-structure that is used as CDTs internal representation of the user’s code. Many of CDTs features are implemented as algorithms that process the AST.
- To generate “completion nodes”, these are special objects that are used by CDTs content assist engine.
CDT is not a compiler, it is an IDE, and so the parsers in CDT have different requirements from the kinds of parsers that would be used as the front end of a compiler. In particular the main design influence of the DOM parsers is performance.
- The parsers are able to skip header files that are included in a source file. This allows features that work on code in an open editor to be fast. For example, headers are skipped when generating an AST for displaying code structure in the outline view.
- The parsers do not perform any semantic analysis or type-checking during the parse. It is this characteristic of the parsers that allows headers to be skipped since info contained in headers such as declared types is not needed.
- The parsers are hand written recursive descent with arbitrary lookahead and backtracking capabilities.
The parsers are not used to build the user's source code into an executable program. Instead CDT will call a user specified system compiler such as GCC or xlC to do the job of building. Every compiler has its own oddities and extensions with regards to the syntax it accepts. For CDT the decision was made to concentrate on compatibility with the GCC suite of compilers, CDT's parsers will accept a range of GCC extensions.
The phases of parsing include:
- Scanning and Preprocessing
- Scanning involves converting the input character stream into a stream of tokens. For example, the individual characters ‘i’, ‘n’, ‘t’ are converted into a token object of type int.
- Preprocessing involves macro expansion, conditional compilation and inclusion of header files.
- Converting the token stream output by the Preprocessor into an AST.
CDT uses the concept of a Translation Unit which represents a single source code file (for example a .c or .cpp file) as well as all the header files that are included by that file. In this way a translation unit does not map one-to-one with source files located on disk. The root node of the AST is of type IASTTranslationUnit.
Scanning and Preprocessing
The preprocessor is responsible for the following:
- Resolving all preprocessor directives, including keeping track of defined macros and performing macro expansions, resolving #include directives and following the correct branches of conditional compilation directives.
- Maintaining an input character buffer. This buffer is actually implemented as a stack with each #include and macro expansion adding to the buffer by pushing a character stream onto the stack.
- Computing location information for tokens.
- Generating “comment tokens” which are used to populate the tasks view.
Many of CDT’s features require that a node in the AST be mapped back to the original file and location of the piece of code that the node represents. For example, when clicking on a search match the file that contains the match is opened in the editor and the exact location is highlighted.
Mapping AST nodes back to their original locations in the source code is a difficult problem because of the presence of preprocessing directives. The original location of a token in a file’s input buffer cannot be used because includes and macro expansions will alter the buffer as the preprocessor runs. The solution used by CDT is to maintain a “location map” which records all the additions that are made to the input buffer while the preprocessor runs. During preprocessing the location map is used to calculate new “global” offsets for the tokens that are produced. (Global offsets are relative to the translation unit.) These offsets are then used by the parser to set offsets on the AST nodes. After parsing is complete the location map is saved as part of the AST’s root node and is used to map the global offsets back to the original file locations on demand.
Limitations Due To The Preprocessor
“In retrospect, maybe the worst aspect of Cpp is that it has stifled the development of programming environments for C. The anarchic and character-level operation of Cpp makes nontrivial tools for C and C++ larger, slower, less elegant, and less effective than one would have thought possible.”
- Bjarne Stroustrup in The Design and Evolution of C++
The preprocessor makes analyzing C/C++ source code incredibly difficult. This is due to several factors:
- The presence of macros means that the source code that the user sees can be dramatically different from what the parser sees.
- Conditional compilation constructs must be evaluated. This means that code regions that are inactive are often ignored by CDTs various features. For example code navigation and search do not work inside inactive code blocks.
- Includes impede performance since, traditionally, a header file must be parsed every time it is included in a translation unit. CDT is capable of skipping the inclusion of header files for performance gains, but this comes at a cost. Information from the header files may not be available during a parse leading to ambiguities and less than 100% parser accuracy.
- Strange interactions between macros and header files can cause inaccuracies during indexing.
Inside the parsers
Parsing C++ is an incredibly difficult problem, in particular C++ is not LR(N) for any N. That means that bottom-up parsing techniques are usually not attempted for parsing C++. The grammar of C++ is ambiguous and this problem is made even worse by the fact that the CDT parser does not have access to type information to help disambiguate. For these reasons the CDT parsers are hand written recursive descent and take advantage of arbitrary lookahead and backtracking techniques. (These features are supported by linking together the tokens that are returned from the scanner).
C and C++ are closely related programming languages, and so the C and C++ parsers both derive from the same base class.
The type hierarchy for the classes and types that make up the AST packages can be complex. The concrete classes for C and C++ AST nodes are actually kept separated, however there is a set of interfaces that is common to both ASTs. For example, there is an IASTBinaryExpression interface which is implemented by the CASTBinaryExpression and CPPASTBinaryExpression classes.
Once an AST is generated it can be analyzed using tree-traversal algorithms. The AST can be traversed in two ways
- By calling the various get() methods on a node to access its child nodes. This method is tedious and very verbose.
- By using the visitor design pattern, this is the preferred way.
The visitor pattern is the object oriented version of a functional “tree map”, it takes a visitor object which encapsulates a set of operations, and applies those operations to the various nodes that make up the tree. Tree traversal with a visitor is depth first and can be done in both top-down and bottom-up manners. A visitor object must be a sub-type of ASTVisitor, and can be applied to an IASTTranslationUnit by calling its accept(ASTVisitor) method.
A binding is an abstracted language element that encapsulates all the ways a particular identifier is used in a program. For example, if a program has a function called foo() then there will be a single binding object that represents foo(), this binding object will store all the information about how the foo identifier is used in the program, including the location of the declaration, the location of the definition and the locations of all the places where the function is called.
There are many different kinds of bindings representing various language constructs that are “bound” to identifiers. For example there are class, struct, variable and macro bindings just to name a few. Bindings are the basis of many of CDT’s features and form the basis of how indexing works.
Binding Resolution is the name of the algorithm used to compute bindings. Binding resolution is performed on the AST after the code has been parsed. IASTName nodes, which represent identifiers in the AST, are the interface for retrieving resolved bindings. The binding resolution algorithm is quite complex, but in essence it searches the AST for uses of an identifier and generates an IBinding object which stores information about those uses.
An example of how bindings are used is the “open declaration” action (triggered by selecting an identifier in the editor and pressing F3). When this action is invoked an AST is generated for the code in the editor. Then offsets are used to determine which name node in the AST corresponds to the identifier selected in the editor. The resolveBinding() method is then called on that name node to retrieve its binding. The binding stores information on where the identifier is declared, this information is used to open the location of the declaration in the editor.
If there are errors in the source code that prevent binding resolution from being successful, for example an identifier exists without having been declared, then a “problem binding” object is returned.
Parsing and binding resolution is a slow process, this is a problem because the user expects code editing features such as content assist to be fast. For this reason CDT stores binding information in an on-disk cache called “the index” or “the PDOM” (Persisted Document Object Model) in order to be able to provide features that respond quickly to user requests.
Building the index involves parsing all the code in a project, resolving all the bindings and writing those bindings to the index. The index is then incrementally updated every time the user edits a file.
Older versions of CDT support three different indexing modes, fast indexing, full indexing and no indexing. The default setting being the fast indexer because indexing a large project can be a time consuming process. The difference between the fast and full indexers is that the fast indexer will skip header files that have already been parsed once, while the full indexer will always re-parse a header file every time it is included. However it is important to understand that the full indexer, despite its name, is still not fully accurate.
When a header file is included in a source file it is subject to any macros that have been defined at that point. Some library headers use macros in conjunction with preprocessor conditionals (#ifdefs) to partially include a header file. Sometimes such a header file is included more than once in a project, if the macros that the header depends on are different each time the header is included then different parts of the header may be included in different source files. Neither indexer will be accurate in this scenario because it will only index the header the first time it is encountered.
The Full indexer will re-parse headers it has already encountered, but it will not re-index them. Therefore source files that include a header may be parsed more accurately, but the header itself will only be indexed the one time. The full indexer is much slower than the fast indexer because of the extra parsing it does, but it is only marginally more accurate. For this reason the Full indexer is not recommended and has been removed from the current version of CDT.
Each project has a single PDOM associated with it. The PDOM is stored on disk as a flat binary file. The indexer will only index headers that are included by source files, so if there is a .h file in the project that is not being included by any .c or .cpp file, then normally it won’t get indexed. However there is a preference setting for indexing all files in the project.
Dealing with Ambiguities
The CDT parsers do not compute type information, the result being that some language constructs can be ambiguous, for example consider the following statement:
x * y;
This could be an expression of variable x multiplied by variable y, or it could be a declaration of a pointer variable y of type x. The result depends on how x and y have been declared.
Normally type information could be used to disambiguate this case, however the CDT parsers do not maintain type information. Instead the parser will first parse the statement as an expression, yielding a piece of AST, then the parser will backtrack and attempt to parse it as a declaration. If both attempts are successful then an “ambiguity node” will be generated and be made to contain both possible subtrees. Later, during binding resolution disambiguation can occur.
There are three types of ambiguity node.
- Represents ambiguities between expression statements and declaration statements, like the x * y ambiguity mentioned above.
- Represents ambiguous declarations that occur outside of statement context, only encountered in C++.
- Represents ambiguities caused by the ‘sizeof’ operator, which can be used on both variables and types, which the parser cannot tell apart.
Edit: Several new types of ambiguity nodes have been added over the last few years, increasing parser accuracy.
Ambiguity nodes are automatically resolved when the AST is traversed by a visitor (in the accept() method of the CASTAmbiguitiy and CPPASTAmbiguity base classes). An ambiguity node resolves itself by attempting to resolve the bindings of all the names in each of its subtrees. The subtree with the least problem bindings generated is chosen as the “correct” path and is linked back into the AST in place of the ambiguity node.
In the case of the x * y ambiguity described above both x and y will be resolved, if x is a type then it will resolve to a type binding and the subtree that represents the declaration is chosen, on the other hand if x and y are variables then they will resolve to variable bindings and the expression option is chosen as correct.
Ambiguity nodes also play an important role in the content assist algorithm. In order to display the correct set of completion options the context in which content assist was invoked is needed. If the piece of code is ambiguous then there might be more than one context (this is why ASTCompletionNode.getNames() returns an array of names instead of a single name). An example of this is the sizeof operator. Lets say the user types sizeof(x then invokes content assist right after the “x”. In this case the parser has no idea what the “x” represents, it could be part of a type name or a variable name. The parser will generate an ambiguity node that contains the AST subtrees for both possibilities. ASTCompletionNode.getNames() will then return two name nodes for “x”, one from each of the possible subtrees. The content assist engine will then follow the parent pointers for each name node to get at the context in which “x” is used.
Dealing With Problems
When a syntax error is encountered in the source a “problem node” will be generated in the AST. There are four types of problem node.
The parser is usually capable of recovering from most syntax errors, generating a problem node and resuming the parse. Problem nodes may also be generated by the preprocessor, for example if a macro is not used properly or if an included file cannot be found.
CDT actually supports the ability to add additional parsers, or “languages”, via an extension point. The interface to a parser is represented by the ILanguage interface. It has three key methods:
- IASTCompletionNode getCompletionNode()
- Called when content assist is invoked.
- IASTTranslationUnit getASTTranslationUnit()
- Called to do a regular parse and return an AST.
- IASTName getSelectedNames()
- Used to retrieve name nodes based on the current select in the editor. Used to implement Open Declaration.
- Encapsulates the character stream that makes up a source code file.
- Contains information on include paths, used by the preprocessor to resolve #include directives.
- Also contains predefined macros given by the toolchain.
- Used by the preprocessor to create CodeReaders for included files.
Content assist is invoked by the CTRL-Space hotkey sequence in the editor. The content assist algorithm is complex and represents a crosscutting concern; many parts of CDT including the editor, the preprocessor, the parser and the UI plugins are involved in making content assist work. The parts of the content assist engine that exist in the CDT core plug-in will be described here.
Content assist starts with the currently opened editor and the cursor location where content assist was invoked. At this point the code near the cursor may actually represent a syntax error because the user may not have completely finished typing a full statement. The parser must recover from the syntax error, generate a correct AST and generate a completion node.
The scanner is given the offset of the cursor so that it knows when that point in the source has been reached. When the scanner reaches the content assist offset it will generate a “completion token” if it can. After the completion token is generated the scanner stops processing the input character stream and starts repeatedly returning a special “end-of-completion token”. The completion token and end-of-completion tokens are handled specially by the parser.
A completion token is similar to an identifier token, the parser will accept a completion token at any point in which an identifier token would be legal. When the parser encounters a completion token it treats it as an identifier and generates the AST as it normally would, including generating an IASTName node for the token. However, the parser will also create a completion node object that will contain a reference to the name node. (In the case of ambiguities more that one name may be added to the completion node.) Note that the name nodes generated may just represent a prefix of the identifier that the user wants, or may even be empty.
After returning the completion token the scanner will start returning only end-of-completion tokens. These special tokens allow the parser to complete successfully, even in the case of an incomplete statement in the source. An end-of-completion token will match punctuation that can be used to end statements and close expressions and scopes, including semi-colons, closing parenthesis, closing braces and others.
For example if the user typed
int y = sizeof(x
and invoked content assist the parser would be given the following token stream,
int, identifier, assign, sizeof, left-paren, completion, end-of-completion, end-of-completion...
which it will interpret as,
int, identifier, assign, sizeof, left-paren, identifier, right-paren, semi-colon
and parsing will complete successfully.
Once a completion node is generated it will be processed by the content assist engine that is part of the CDT UI plugin. This engine will compute the actual suggestions that are presented to the user.