CDT/Obsolete/C editor enhancements/Camel-case completion

This is a problem page. Please treat it as a discussion page and feel free to insert your comments anywhere. My open questions are in bold--Tomasz Wesołowski 13:40, 17 May 2010 (UTC)

Description

When using context assist, many symbols with a similar beginning of name may exist. A solution to select one of them quickly is to search the proposals by their abbreviation as an alternative to searching by beginning of the name. This allows to find the desired proposal more quickly often, especially for long names.

Ranking

Category: editing

Importance: 3/5 --Tomasz Wesołowski 13:40, 17 May 2010 (UTC)

3/5 --Tobias Hahn 09:44, 18 May 2010 (UTC)

5/5 --Jens Elmenthaler I'm using it very often in Java, I really miss it in C++. It possibly also supports the filtering you discussed on another page.

3/5 --Kirstin Weber 08:49, 10 June 2010 (UTC)

3/5 --Gil Barash 18:15, 25 September 2010 (UTC)

Please append your opinion here.

Solution

This feature can be implemented in a way similar to JDT. It must also be taken into account that camelCase is only one convention in C++ and underscore_case is also popular.

An expected behaviour for giving context assist options of a text 'get' would be to give context assist results in a way similar to:

getSomething        // match by name start
giveExampleText     // match by camel case
give_example_text   // match by underscore case
GIVE_EXAMPLE_TEXT   // match by capital underscore case

Matches by name start shall have higher priority than matches by abbreviation.

The size of characters shall be ignored while matching (i.e. only be used to find word boundaries to determine the abbreviation, but not for filtering the results.

This approach would be a bit different than JDT, but allows to select some_symbol and someSymbol in an uniform way - do you find this way preferred or do you prefer strict case sensitivity here, like sS for someSymbol and ss (or even s_s) for some_symbol? Please provide use cases.

Jens Elmenthaler For camelCase matching, we should use the same rules as JDT. That means someSymbol would not match ss. The question whether an underscore_case matches a given string or not is not so simple. My proposal would be to treat _ or an uppercase letter in the typed string as the beginning of a new word, and then match the canndidates accordingly. I.e. typing sS would result in someSymbol and some_symbol. s_s would result in some_symbol only.

Gil Barash I agree that a capital letter or an underscore should represent the beginning of a new word. But my examples would be a bit different:

'ab' should match only symbols which start with 'ab' (ignoring case).
'AB' should match symbols which start with 'ab' (ignoring case), plus 'addBlock', 'AddBlock' and 'add_block'
'a_b' or 'A_B' should match symbols which start with 'a_b' (ignoring case), plus 'addBlock', 'AddBlock' and 'add_block'
'adBl' or 'AdBl' or 'ad_bl' should also match 'addBlock, 'AddBlock' and 'add_block'

I would also suggest that the matching words won't have to be in the same order (with lower ranking, of course), meaning that 'a_b' would match 'add_block' with high ranking and 'block_add' with lower ranking.

Design

This chapter was added by Jens Elmenthaler

Matching Rules

To begin with, the matching rules provided by the JDT shall also be valid for the CDT. However, while the underscore notation is rarely used in Java, it is wide-spread in C/C++. Because of that, the matching rules cannot be copied from the JDT, but must be extended. For instance, in the JDT you cannot type "FB" in order to yield "FOO_BAR". For C/C++ this is a must (think of all the macro business).

The initial SegmentMatcher implemention did not match "FOO_BAR" when typing "FB". Trying to resolve this resulted in the following approach.

The string to be completed will be called the segment pattern. Using BNF, a segment pattern might be described as follows (no whitespace between the tokens allowed):

 segmentPattern :=
   EMPTY |
   segmentPattern segment
 
 segment :=
   separatedSegment |
   camelCaseSegment |
   numberSegment
 
 separatedSegment :=
   separator numberBody |
   separator textBody
 
 separator :=
   NEITHER_DIGIT_NOR_LETTER |
   separator NEITHER_DIGIT_NOR_LETTER
 
 textBody :=
   LETTER |
   textBody LOWER_CASE_LETTER
 
 numberBody :=
   DIGIT |
   numberBody DIGIT
 
 camelCaseSegment :=
   UPPER_CASE_LETTER |
   camelCaseSegment LOWER_CASE_LETTER
 
 numberSegment :=
   DIGIT |
   numberSegment DIGIT

In a first step, the user provided pattern will be decomposed into pattern segments. The BNF above defines how the pattern can be decomposed into segments starting with one or multiples separators, camel case segments, or number segments.

In the second step, each segment is translated into a regular expression that segments in candidate names are matched against.

 separatedSegment (any character that is not a letter or digit is considered separator):
   _foo -> _f[0o][Oo].*
   _Foo -> _F[0o][Oo].*
   _56  -> _56.*
   ___foo -> ___f[0o][Oo].*
 
 camelCaseSegment (if first of pattern):
   foo  -> f[0o][Oo].*
   Foo  -> F[0o][Oo].*
 
 camelCaseSegment (any following):
   Foo  -> (_[fF]|F)[0o][Oo].*
 
 numberSegment:
   42   -> 42.*

In the final step, all the regular expressions from the individual pattern segments are appended one after another. This results in a regular expression that complete names are matched against. (The prefix match is not considered done separately.)

Note: This matching scheme seems reasonably simple addressing all the needs. In addition, it avoids two drawbacks of the JDT camel case completion for free:

in order to yield "IASTName", you cannot type "IAN", though everybody would aggree that "AST" should be considered a name segment on its own
you can not leave out name segments. "IName" does not match "IASTName". In the proposed scheme, you can omit name segments (except the first, of course.), which makes you faster. It also tolerates user errors - I frequently forget segments inbetween.

For those who don't think like a computer, here examples to train your neural networks:

 "fB" matches "fooBar", "fooXyzBar", "foo_bar", "foo_Bar", "fOO_BAR", "foo56_abc_bar"
      but not "FooBar", "FOO_BAR", "_fooBar"
 
 "FB" matches "FooBar", "FooXyzBar", "Foo_bar", "Foo_Bar", "FOO_BAR", "Foo56_abc_bar"
      but not "fooBar", "_FooBar"
 
 "f_b" matches "foo_bar", "foo_bar", "fOO_bAR", "fooXyz_bar", "foo_bar43"
       but not "fooBar", "FOO_BAR"
 
 "F_b" matches "Foo_bar", "Foo_bar", "FOO_bAR", "FooXyz_bar", "Foo_bar43"
       but not "FooBar", "FOO_BAR"
 
 "IAN" matches "IASTName", "IAstName", "ISomeASTName", "IASTName56", "I_AST_NAME", "I_SOME_AST_NAME"
       but not "i_ast_name", "iASTName", "iAstName"
 
 "pAN" matches "pASTName", "pAstName", "pSomeASTName", "pASTName1", "p_ast_name", "p_some_ast_name"
       but not "P_AST_NAME"
 
 "IAN2" matches "IASTName2", "IAstName234", "IExtendedAstName2"
        but not "IASTName12", "iAstName2", "IASTName"
 
 "_fB" matches "_fooBar", "_foo_bar", "_fOO_BAR", "_foo_xyz_bar"
       but not "fooBar", "_FOO_BAR"
 
 "_FB" matches "_FooBar", "_FOO_BAR", "_Foo_Bar", "_FOO_XYZ_BAR"
       but not "FooBar", "_foo_bar"

Performance Considerations

The most sophisticated proposal computation isn't any good, if the UI does not respond quick enough. This section lists the principles to be applied in the implementation:

Binary Search for the Large Number of Names

Searching the index utilizes a btree implementation enabling the ability for fast binary searches given a certain prefix. Using segment patterns, we still can utilize binary searches to narrow down a large number of names to something handy. Instead of using the entire string as prefix, we can only use the first segment.

Single Segment: Segment Matching is Subset of Prefix Matching

As long as there is just one segment, the segment match cannot pass, if the prefix match fails. In this case, there is no overhead at all since there is no difference to the situation before. (Well, it has to implemented that way, of course.)

A Name Shorter than the Segment Pattern Never Matches

Even having multiple segments in the pattern, a name that matches must have at least as many characters as the given segment pattern has.

Precompiled Regular Expressions

Eventully, the segments of the patterns have to be matched against the segments of the names that survived up to here. The implementation will use the regular expression implementation provided by java.util.regex. This implementation should be reasonably optimized. The pattern will be compiled once and re-used to match whole sets of names.

Results from Prototype

Applying all those principles in a prototype did not reveal any response time issues (around 100 small/medium sized projects in a workspace).

It also showed that the computation of the completion node - which is the basis for all parsing based prososal computers - is dominating the computation effort.

Required Core Modifications

Content assist is already rooted in the CDT core: the computation of the completion nodes, the implementers of IScope and similar interfaces, ...

This section describes the modifications that are required.

Add SegmentMatcher and ContentAssistMatcher

Thanks to Tomasz, there already is a SegmentMatcher class. This class matches a name against a segment pattern, and thus provides the foundation for a camel case/underscore completion.

In addition a ContentAssistMatcher will be added. This class provides the facade for the matching algorithms exclusively for content assist-like features, i.e. all code for content assist-like features shall use this class instead of the SegmentMatcher directly. This approaches allows to introduce user preferences for further tweaking without the need to touch all clients. This would also allow to add further strategies for matching names in the future.

Both the SegmentMatcher as well as the ContentAssistMatcher will become part of the API, such that other plugins can re-use them (e.g. other languages than C/C++). They are located in org.eclipse.cdt.core.parser.util.

Q: this location has been chosen because the current way of matching prefixes is located in CharArrayUtils, which reside in exactly this package. Is there a better place?

Index

The index provides IIndex and IIndexFragment(internal only) as API. Both interfaces already provide the means to find:

bindings/macros with a given name
bindings/macros starting with a given prefix
bindings/macros matching a pattern (those with '*' and '?')

In order to maintain API compatibility, the semantics of these methods will not be changed. Instead, explicit methods for content assist will added:

org.eclipse.cdt.core.index.IIndex:

IIndexBinding[] findBindingsForContentAssist(char[] prefix, boolean fileScopeOnly, IndexFilter filter, IProgressMonitor monitor) throws CoreException

org.eclipse.cdt.internal.core.index.IIndexFragment:

IIndexFragmentBinding[] findBindingsForContentAssist(char[] prefix, boolean filescope, IndexFilter filter, IProgressMonitor monitor) throws CoreException

Note: for symmetry, one would expect a findMacrosForContentAssist to be added in both interfaces. However proposing macro names in DOMCompletionProposalComputer uses IASTTranslationUnit to retrieve applicatable macros. This seems natural to do in a content assist-like application. There is most likely no client for it, so leave it out.

AST

The following interfaces provide methods that support prefix matching:

org.eclipse.cdt.core.dom.ast.IScope
org.eclipse.cdt.core.dom.ast.ICPPASTCompletionContext
org.eclipse.cdt.core.dom.ast.IASTCompletionContext

org.eclipse.cdt.internal.core.dom.parser.cpp.ICPPASTInternalScope

The sole purpose of those methods is to enable content assist, so their implementations will simply be redirected to use the ContentAssistMatcher instead.

Q: Should the API documentation be updated accordingly?

PDOM

The package org.eclipse.cdt.internal.core.pdom.dom contains the following classes to search bindings in the index:

NamedNodeCollector
BindingCollector
MacroContainerCollector

These collectors support the following modes to match names:

match full name, case sensitive or insensitive
name starts with a given prefix, case sensitive or insensitive

These modes will be kept in order to support all existing applications of them. However, a new mode will be added for content assist. The new mode will be enabled in the constructors by adding a new boolean parameter contentAssistLookup. If contentAssistLookup is true, the paremeters prefixLookup and caseSensitive will be ignored. This forces any non-CDT plugin (if there are any, since the collector classes are not part of API) using the old constructors to explicitly check whether they are a content assist-like application or really want to do a prefix lookup.

Besides, a couple of minor adaptations are reqired:

to implement the new methods in IIndex and IIndexFragment
switch from CharArrayUtils to ContentAssistMatcher, in order to reflect the content assist characteristic

References

Bug 173458 and Bug 223625 request this feature.

Notice: this Wiki will be going read only early in 2024 and edits will no longer be possible. Please see: https://gitlab.eclipse.org/eclipsefdn/helpdesk/-/wikis/Wiki-shutdown-plan for the plan.

CDT/Obsolete/C editor enhancements/Camel-case completion

Contents

Description

Ranking

Solution

Design

Matching Rules

Performance Considerations

Binary Search for the Large Number of Names

Single Segment: Segment Matching is Subset of Prefix Matching

A Name Shorter than the Segment Pattern Never Matches

Precompiled Regular Expressions

Results from Prototype

Required Core Modifications

Add SegmentMatcher and ContentAssistMatcher

Index

AST

PDOM

References

Breadcrumbs

Notice: this Wiki will be going read only early in 2024 and edits will no longer be possible. Please see: https://gitlab.eclipse.org/eclipsefdn/helpdesk/-/wikis/Wiki-shutdown-plan for the plan.

CDT/Obsolete/C editor enhancements/Camel-case completion

Contents

Description

Ranking

Solution

Design

Matching Rules

Performance Considerations

Binary Search for the Large Number of Names

Single Segment: Segment Matching is Subset of Prefix Matching

A Name Shorter than the Segment Pattern Never Matches

Precompiled Regular Expressions

Results from Prototype

Required Core Modifications

Add SegmentMatcher and ContentAssistMatcher

Index

AST

PDOM

References