CDT/C editor enhancements/Camel-case completion
This is a problem page. Please treat it as a discussion page and feel free to insert your comments anywhere. My open questions are in bold--Tomasz Wesołowski 13:40, 17 May 2010 (UTC)
- 1 Description
- 2 Ranking
- 3 Solution
- 4 Design
- 4.1 Matching Rules
- 4.2 Performance Considerations
- 4.3 Required Core Modifications
- 4.4 Required UI Modifications
- 5 References
When using context assist, many symbols with a similar beginning of name may exist. A solution to select one of them quickly is to search the proposals by their abbreviation as an alternative to searching by beginning of the name. This allows to find the desired proposal more quickly often, especially for long names.
Importance: 3/5 --Tomasz Wesołowski 13:40, 17 May 2010 (UTC)
3/5 --Tobias Hahn 09:44, 18 May 2010 (UTC)
5/5 --Jens Elmenthaler I'm using it very often in Java, I really miss it in C++. It possibly also supports the filtering you discussed on another page.
3/5 --Kirstin Weber 08:49, 10 June 2010 (UTC)
3/5 --Gil Barash 18:15, 25 September 2010 (UTC)
Please append your opinion here.
This feature can be implemented in a way similar to JDT. It must also be taken into account that camelCase is only one convention in C++ and underscore_case is also popular.
An expected behaviour for giving context assist options of a text 'get' would be to give context assist results in a way similar to:
getSomething // match by name start giveExampleText // match by camel case give_example_text // match by underscore case GIVE_EXAMPLE_TEXT // match by capital underscore case
Matches by name start shall have higher priority than matches by abbreviation.
The size of characters shall be ignored while matching (i.e. only be used to find word boundaries to determine the abbreviation, but not for filtering the results.
This approach would be a bit different than JDT, but allows to select some_symbol and someSymbol in an uniform way - do you find this way preferred or do you prefer strict case sensitivity here, like sS for someSymbol and ss (or even s_s) for some_symbol? Please provide use cases.
Jens Elmenthaler For camelCase matching, we should use the same rules as JDT. That means someSymbol would not match ss. The question whether an underscore_case matches a given string or not is not so simple. My proposal would be to treat _ or an uppercase letter in the typed string as the beginning of a new word, and then match the canndidates accordingly. I.e. typing sS would result in someSymbol and some_symbol. s_s would result in some_symbol only.
Gil Barash I agree that a capital letter or an underscore should represent the beginning of a new word. But my examples would be a bit different:
- 'ab' should match only symbols which start with 'ab' (ignoring case).
- 'AB' should match symbols which start with 'ab' (ignoring case), plus 'addBlock', 'AddBlock' and 'add_block'
- 'a_b' or 'A_B' should match symbols which start with 'a_b' (ignoring case), plus 'addBlock', 'AddBlock' and 'add_block'
- 'adBl' or 'AdBl' or 'ad_bl' should also match 'addBlock, 'AddBlock' and 'add_block'
I would also suggest that the matching words won't have to be in the same order (with lower ranking, of course), meaning that 'a_b' would match 'add_block' with high ranking and 'block_add' with lower ranking.
This chapter was added by Jens Elmenthaler
To begin with, the matching rules provided by the JDT shall also be valid for the CDT. However, while the underscore notation is rarely used in Java, it is wide-spread in C/C++. Because of that, the matching rules cannot be copied from the JDT, but must be extended. For instance, in the JDT you cannot type "FB" in order to yield "FOO_BAR". For C/C++ this is a must (think of all the macro business).
The initial SegmentMatcher implemention did not match "FOO_BAR" when typing "FB". Trying to resolve this resulted in the following approach.
The string to be completed will be called the segment pattern. Using BNF, a segment pattern might be described as follows (no whitespace between the tokens allowed):
segmentPattern := EMPTY | segmentPattern segment segment := separatedSegment | camelCaseSegment | numberSegment separatedSegment := separator numberBody | separator textBody separator := NEITHER_DIGIT_NOR_LETTER | separator NEITHER_DIGIT_NOR_LETTER textBody := LETTER | textBody LOWER_CASE_LETTER numberBody := DIGIT | numberBody DIGIT camelCaseSegment := UPPER_CASE_LETTER | camelCaseSegment LOWER_CASE_LETTER numberSegment := DIGIT | numberSegment DIGIT
In a first step, the user provided pattern will be decomposed into pattern segments. The BNF above defines how the pattern can be decomposed into segments starting with one or multiples separators, camel case segments, or number segments.
In the second step, each segment is translated into a regular expression that segments in candidate names are matched against.
separatedSegment (any character that is not a letter or digit is considered separator): _foo -> _f[0o][Oo].* _Foo -> _F[0o][Oo].* _56 -> _56.* ___foo -> ___f[0o][Oo].* camelCaseSegment (if first of pattern): foo -> f[0o][Oo].* Foo -> F[0o][Oo].* camelCaseSegment (any following): Foo -> (_[fF]|F)[0o][Oo].* numberSegment: 42 -> 42.*
In the final step, all the regular expressions from the individual pattern segments are appended one after another. This results in a regular expression that complete names are matched against. (The prefix match is not considered done separately.)
Note: This matching scheme seems reasonably simple addressing all the needs. In addition, it avoids two drawbacks of the JDT camel case completion for free:
- in order to yield "IASTName", you cannot type "IAN", though everybody would aggree that "AST" should be considered a name segment on its own
- you can not leave out name segments. "IName" does not match "IASTName". In the proposed scheme, you can omit name segments (except the first, of course.), which makes you faster. It also tolerates user errors - I frequently forget segments inbetween.
For those who don't think like a computer, here examples to train your neural networks:
"fB" matches "fooBar", "fooXyzBar", "foo_bar", "foo_Bar", "fOO_BAR", "foo56_abc_bar" but not "FooBar", "FOO_BAR", "_fooBar" "FB" matches "FooBar", "FooXyzBar", "Foo_bar", "Foo_Bar", "FOO_BAR", "Foo56_abc_bar" but not "fooBar", "_FooBar" "f_b" matches "foo_bar", "foo_bar", "fOO_bAR", "fooXyz_bar", "foo_bar43" but not "fooBar", "FOO_BAR" "F_b" matches "Foo_bar", "Foo_bar", "FOO_bAR", "FooXyz_bar", "Foo_bar43" but not "FooBar", "FOO_BAR" "IAN" matches "IASTName", "IAstName", "ISomeASTName", "IASTName56", "I_AST_NAME", "I_SOME_AST_NAME" but not "i_ast_name", "iASTName", "iAstName" "pAN" matches "pASTName", "pAstName", "pSomeASTName", "pASTName1", "p_ast_name", "p_some_ast_name" but not "P_AST_NAME" "IAN2" matches "IASTName2", "IAstName234", "IExtendedAstName2" but not "IASTName12", "iAstName2", "IASTName" "_fB" matches "_fooBar", "_foo_bar", "_fOO_BAR", "_foo_xyz_bar" but not "fooBar", "_FOO_BAR" "_FB" matches "_FooBar", "_FOO_BAR", "_Foo_Bar", "_FOO_XYZ_BAR" but not "FooBar", "_foo_bar"
The most sophisticated proposal computation isn't any good, if the UI does not respond quick enough. This section lists the principles to be applied in the implementation:
Binary Search for the Large Number of Names
Searching the index utilizes a btree implementation enabling the ability for fast binary searches given a certain prefix. Using segment patterns, we still can utilize binary searches to narrow down a large number of names to something handy. Instead of using the entire string as prefix, we can only use the first segment.
Single Segment: Segment Matching is Subset of Prefix Matching
As long as there is just one segment, the segment match cannot pass, if the prefix match fails. In this case, there is no overhead at all since there is no difference to the situation before. (Well, it has to implemented that way, of course.)
A Name Shorter than the Segment Pattern Never Matches
Even having multiple segments in the pattern, a name that matches must have at least as many characters as the given segment pattern has.
Precompiled Regular Expressions
Eventully, the segments of the patterns have to be matched against the segments of the names that survived up to here. The implementation will use the regular expression implementation provided by java.util.regex. This implementation should be reasonably optimized. The pattern will be compiled once and re-used to match whole sets of names.
Results from Prototype
Applying all those principles in a prototype did not reveal any response time issues (around 100 small/medium sized projects in a workspace).
It also showed that the computation of the completion node - which is the basis for all parsing based prososal computers - is dominating the computation effort.
Required Core Modifications
Content assist is already rooted in the CDT core: the computation of the completion nodes, the implementers of IScope and similar interfaces, ...
This section describes the modifications that are required.
Add SegmentMatcher and IContentAssistMatcher
Thanks to Tomasz, there already is a SegmentMatcher class. This class matches a name against a segment pattern, and thus provides the foundation for a camel case/underscore completion.
In addition an interface IContentAssistMatcher will be added. This interface provides the facade for the matching algorithms exclusively for content assist-like features, i.e. all code for content assist-like features shall use this interface instead of the SegmentMatcher directly. This approach allows to turn camel case on and off, according to a user preference, without having each client to know about the preference. This would also allow to add further strategies for matching names in the future. A ContentAssistMatcherFactory is introduced to instantiate the correct implementation of IContentAssistMatcher.
SegmentMatcher, IContentAssistMatcher, as well as the ContentAssistMatcherFactory will become part of the API, such that other plugins can re-use them (e.g. other languages than C/C++). They are located in org.eclipse.cdt.core.parser.util.
Q: this location has been chosen because the current way of matching prefixes is located in
CharArrayUtils, which reside in exactly this package. Is there a better place?
The index provides IIndex and IIndexFragment (internal only) as API. Both interfaces already provide the means to find:
- bindings/macros with a given name
- bindings/macros starting with a given prefix
- bindings/macros matching a pattern (those with '*' and '?')
In order to maintain API compatibility, the semantics of these methods will not be changed. Instead, explicit methods for content assist will added:
- IIndexBinding findBindingsForContentAssist(char prefix, boolean fileScopeOnly, IndexFilter filter, IProgressMonitor monitor) throws CoreException
- IIndexFragmentBinding findBindingsForContentAssist(char prefix, boolean filescope, IndexFilter filter, IProgressMonitor monitor) throws CoreException
Note: for symmetry, one would expect a findMacrosForContentAssist to be added in both interfaces. However proposing macro names in DOMCompletionProposalComputer uses IASTTranslationUnit to retrieve applicable macros. This seems natural in a content assist-like application. There is most likely no client for it, so leave it out.
The following interfaces provide methods that support prefix matching:
The sole purpose of those methods is to enable content assist, so their implementations will simply be redirected to use the IContentAssistMatcher instead.
Q: Should the API documentation be updated accordingly?
The package org.eclipse.cdt.internal.core.pdom.dom contains the following classes to search bindings in the index:
These collectors support the following modes to match names:
- match full name, case sensitive or insensitive
- name starts with a given prefix, case sensitive or insensitive
These modes will be kept in order to support all existing applications of them. However, a new mode will be added for content assist. The new mode will be enabled in the constructors by adding a new boolean parameter contentAssistLookup. If contentAssistLookup is true, the paremeters prefixLookup and caseSensitive will be ignored. This forces any non-CDT plugin (if there are any, since the collector classes are not part of API) using the old constructors to explicitly check whether they are a content assist-like application or really want to do a prefix lookup.
Besides, a couple of minor adaptations are reqired:
- to implement the new methods in IIndex and IIndexFragment
- switch from CharArrayUtils to IContentAssistMatcher, in order to reflect the content assist characteristic
Required UI Modifications
Replace Hard-coded Prefix Matching by Content Assist Matching
The following classes will need to use an IContentAssistMatcher instead of hard-coded prefix checks:
Add a preference to turn Camel Case Matching On/Off
The JDT allows to turn off the camel case matching. So the CDT will do the same. The new boolean preferences will be placed on the CodeAssistPreferencePage in the "Sorting and Filtering" group, which is in accordance to the JDT. Special care needs to be taken because CodeAssistPreferencePage by now only handles preferences from the UI plugin. The camel case matching preference - since required by the CDT core - is owned by the CDT core plugin.