Commenting LexCSS.cxx

From Notepad++ Wiki
Jump to: navigation, search
Dissecting an actual lexer

The CSS language has words and operators the highlighting of which depends on the context, that is, what sort of text is surrounding it. Hence this lexer is going to show some backtracking, which is not so common a technique.

As explained in the Scintilla documentation, there are two basic strategies to write a lexer: state based and character based. The majority of lexers out there adopt a state based strategy for highlighting. Each one has its drawbacks:

  • In state based lexers, special characters are repeatedly looked for if they can appear within different states;
  • In character based lexers, one has to keep track of state to decide what to do with the character.

The choice of CSS has been dictated by pedagogy considerations. The lexer shows the basic elements of any internal lexer, with a syntax that has its own quirks, but not distractingly many.

DISCLAIMER: this page exposes large portions of the source code in LexCSS.cxx, part of the Scintilla editing component by Neil Hodgson, with the intent of explaining how to design similar lexers, and facilitating said process. Such an exposure is seen as compatible with the GPL v2 license under which Scintilla is being released.

DISCLAIMER: While the page analyses code, it analyses working code, mostly legacy (pre v2.20 code more precisely), which is not always laid out and written like academic people would like. The code below does work and shows the logic to make a lexer. It does not intend to show how to write "good code" or "maintainable code". Not that there is a consensual definition of those either.

Standard headers

#include <stdlib.h>
#include <string.h>
#include <stdio.h>
#include <stdarg.h>
#include <assert.h>
#include <ctype.h>

Fixtures of many C programs.

#include "ILexer.h"
#include "Scintilla.h"
#include "SciLexer.h"
#include "PropSetSimple.h"
#include "WordList.h"
#include "LexAccessor.h"
#include "Accessor.h"
#include "StylerContext.h"
#include "CharacterSet.h"
#include "LexerModule.h"

While not all mandatory, these header files are very much standard. StylerContext is a helper class that encapsulates part of the Styler inner workings, so its use is recommended. PropSetSimple.h is very useful when using properties, which folding often does. CharacterSet.h enables to define ranges of characters as C++ objects, but it isn't much used either. Some other lexers make more extensive use of them.

Please also note that lexers for languages that have existed for so long, like CSS, have been written long before the lexer object interface was implemented in Scintilla v2.20. Predictably, most of the working code has been left untouched, even though it could be rewritten in a more concise, hence more maintainable form. Since this is a short piece of code, such considerations are not overwhelming either.

using namespace Scintilla;

Required for the Scintilla header fles to work properly.

Utility functions


In order to highlight keywords or identifiers, it is neccessary to know where they start and end. Hence, each language has its own definition of what is a word character. Usually, this encompasses the 52 latin letters, 10 digits and the underscore, and may contain a few extra characters. Here, '-' is a word character; in many languages it is an operator, but CSS doesn't have algebraic operations.

Handling character codes in the 0x80..0xFF range is tricky at times, because not all of them may be valid. Since the strictly compliant handling of UTF-8 would entail a fair amount of rewrite for a minimal benefit - this is only a highlighter, remember -, it was chosen to leave the issue pending. At least there is an explicit comment about it.

static inline bool IsAWordChar(const unsigned int ch) {
   /* FIXME:
    * The CSS spec allows "ISO 10646 characters U+00A1 and higher" to be treated as word chars.
    * Unfortunately, we are only getting string bytes here, and not full unicode characters. We cannot guarantee
    * that our byte is between U+0080 - U+00A0 (to return false), so we have to allow all characters U+0080 and higher
   return ch >= 0x80 || isalnum(ch) || ch == '-' || ch == '_';


Operators are one- or two-character strings that are usually highlighted in a secific way. The following unctio is a straightforward way to tell whether a character is an operator. The C libary would allow to rewrite this as one function call using a constant string made of all the operator characters.

inline bool IsCssOperator(const int ch) {
      if (!((ch < 0x80) && isalnum(ch)) &&
         (ch == '{' || ch == '}' || ch == ':' || ch == ',' || ch == ';' ||
         ch == '.' || ch == '#' || ch == '!' || ch == '@' ||
        /* CSS2 */
        ch == '*' || ch == '>' || ch == '+' || ch == '=' || ch == '~' || ch == '|' ||
        ch == '[' || ch == ']' || ch == '(' || ch == ')')) {
             return true;
       return false;

The highlighter function

The signature

static void ColouriseCssDoc(unsigned int startPos, int length, int initStyle, WordList *keywordlists[], Accessor &styler) {

is what Scintilla expects, as detailed in Plugin Development.

The highlighter starts by unpacking the various word lists, the name of which should be rather self-explanatory:

   WordList &css1Props = *keywordlists[0];
   WordList &pseudoClasses = *keywordlists[1];
   WordList &css2Props = *keywordlists[2];
   WordList &css3Props = *keywordlists[3];
   WordList &pseudoElements = *keywordlists[4];
   WordList &exProps = *keywordlists[5];
   WordList &exPseudoClasses = *keywordlists[6];
   WordList &exPseudoElements = *keywordlists[7];

Typically, one defines a StylerContext to facilitate basic tasks.

   StylerContext sc(startPos, length, initStyle, styler);

and some control variables to reflect current state or similar notions. What they are and how they are used in the highlighter is specific to each language. Commenting them is good if clear name are not enough and the comment spells out the semantics of the variable. Not exactly so here.

   int lastState = -1; // before operator
   int lastStateC = -1; // before comment
   int lastStateS = -1; // before single-quoted/double-quoted string
   int op = ' '; // last operator
   int opPrev = ' '; // last operator

Now on to the main loop that processes characters in turn. Note that you can feed StylerContext::Fprward() a nnnegative integer defaulting to 1. Backtracking ttempts are welcomed with a Scintilla assertion failure.

   for (; sc.More(); sc.Forward()) {

In layman terms: as long as there is an available character, advance the styler position by 1 and process character.

The lexer states

Since this lexer is state based, one has to know which states are supported:

  • SCE_CSS_IDENTIFIER=6 (CSS properties)
  • SCE_CSS_ID=10
  • SCE_CSS_IDENTIFIER2=15 (CSS2 properties)
  • SCE_CSS_IDENTIFIER3=17 (CSS3 properties)

The exact meaning of these depends on the knowledge of the CSS3 language, which it is not the aim of this document to describe.

Special states

The list below is pretty much common: comments, delimited strings and often operators. Note that CSS only has /* ... */ stream comments, but separate single quote- and double quote-delimited strings.

Handling comments
       if (sc.state == SCE_CSS_COMMENT && sc.Match('*', '/')) {

If arriving at the end of a comment block, there is a need to restore the previous state. That is the purpose of the lastStateC variable. But what happens if it is not defined, which happens if the first line being styled started with a comment? Some backtracking is required.

Scintilla guarantees that starting positions always start a line, and the whole document is styled when loaded - at least some sizable initial portion of it covering a couple display pages. This makes the backtracking, or state stitching, simpler.

           if (lastStateC == -1) {
               // backtrack to get last state:
               // comments are like whitespace, so we must return to the previous state
               unsigned int i = startPos;
               for (; i > 0; i--) {
                   if ((lastStateC = styler.StyleAt(i-1)) != SCE_CSS_COMMENT) {

So we backtracked from previous position and found something that is not a comment. If that something is not an operator, then we know what the previous state is and can break out. If an operator, we need to backtrack more so as to find what state to restore. .

                       if (lastStateC == SCE_CSS_OPERATOR) {

The op and opPrev variables store what the last operator found was, so we update them here as well

                           op = styler.SafeGetCharAt(i-1);
                           opPrev = styler.SafeGetCharAt(i-2);

and backtrack beyond the operator

                           while (--i) {
                               lastState = styler.StyleAt(i-1);
                               if (lastState != SCE_CSS_OPERATOR && lastState != SCE_CSS_COMMENT)

We can be here for two reasons:

  • because a valid value was found for lastState, and we can stop here
  • or because backtracking has hit the start of the document. The "--i" guard ensures that

positions below zero won't be polled, resulting in an assertion failure.

In the second case, the undetermined state is the only sensible value for lastState. And now that all relevant state variables have been set, we can proceed.

                           if (i == 0)
                               lastState = SCE_CSS_DEFAULT;

This safety net catches all corner cases where start of document was hit.

               if (i == 0)
                   lastStateC = SCE_CSS_DEFAULT;

So now lastStateC is known and we can restore the styler state to it. The styler is pointing at the '*' in "*/", so the lastStateC state must be enforced after the next character, and the next character is part of the comment:


Just a reminder: there are 4 state control features in the StylerContext class:

colour up to the current character excluded using current state,

and set state to newState from the current character on.

colour up to the current character included using current state,

and set state to newState from the next character on.

Plainly changes the state that will be used to colour the current run when it stops.
Not a method, but an attribute. The state for the crrent run. The initial state is set

when creating the StylerContext object - this is the point of the initStyle parameter in the ColouriseCssDoc() signature.

At this point, the state variables, except lastStateS, are properly intialised.

If we are in comment mode and do not see an end of comment, just go ahead.

       if (sc.state == SCE_CSS_COMMENT)
Handling strings

There is no difference in handling single and double quoed strings, apart from the delimiter.

       if (sc.state == SCE_CSS_DOUBLESTRING || sc.state == SCE_CSS_SINGLESTRING) {

Handling stops only at the appropriate quote.

           if ( != (sc.state == SCE_CSS_DOUBLESTRING ? '\"' : '\))

Now how about this quote? It may be escaped. An escaped quote doesn't count as a quote, and is preceded by a single backslash. But there could very well be backslashes, coded "\\", preceding. So we need some backtracking again, to see how many backslashes there were:

  • If an odd number, we have an escaped quote preceded by 0 or more backslashes, and handling doesn't stop there
  • Otherwise, we have 0 or more backslashes followed by a quote that closes the string. The styler points at the quote, which must be included in string styling.
           unsigned int i = sc.currentPos;
           while (i && styler[i-1] == '\\')
           if ((sc.currentPos - i) % 2 == 1)

Wait a minute. Can't lastStateS be undefined, ie at -1? It is possible that initStyle was a string state after all, so the situation seems very much the same as what we have just seen for comments, right?

Wrong. Strings normally do not span across multiple lines in CSS, contrary to comments. So, a string state could not be initiated by initStyle, but by a previous quote. It is the responsibility of the corresponding code to set lastStateS, so we rely on it.

But the '\' continuation character right at the end of the line is allowed, so there can be a bug if the styling starts in the middle of a multiline string - we don't know in what state the text was before it started, beyond the styling start position backwards.

Handling operators

We need to know what the state before the operator was. If not known, backtrack so as to find it and set lastState accordingly. This is no different from the comment case, when the comment was preceded by an operator.

       if (sc.state == SCE_CSS_OPERATOR) {
           if (op == ' ') {
               unsigned int i = startPos;
               op = styler.SafeGetCharAt(i-1);
               opPrev = styler.SafeGetCharAt(i-2);
               while (--i) {
                   lastState = styler.StyleAt(i-1);
                   if (lastState != SCE_CSS_OPERATOR && lastState != SCE_CSS_COMMENT)

Now that the prevous state was known, we process the operator, which will often initiate a change of state. Only some state transitions are valid, hence the if statements that often protect the state change statement. These statements predictably involve what the state was before the operator was found:

           switch (op) {
           case '@':
               if (lastState == SCE_CSS_DEFAULT)
           case '>':
           case '+':
               if (lastState == SCE_CSS_TAG || lastState == SCE_CSS_CLASS || lastState == SCE_CSS_ID ||
                   lastState == SCE_CSS_PSEUDOCLASS || lastState == SCE_CSS_EXTENDED_PSEUDOCLASS || lastState == SCE_CSS_UNKNOWN_PSEUDOCLASS)
           case '[':
               if (lastState == SCE_CSS_TAG || lastState == SCE_CSS_DEFAULT || lastState == SCE_CSS_CLASS || lastState == SCE_CSS_ID ||
                   lastState == SCE_CSS_PSEUDOCLASS || lastState == SCE_CSS_EXTENDED_PSEUDOCLASS || lastState == SCE_CSS_UNKNOWN_PSEUDOCLASS)

... and so on to line 209, skipping similar code.


Words are contiguous runs of word characters. A word character is one for which IsAWordChar() returns true. Among words, tags have a special status: they may start with a '*', and their end is detected by the presence of specific operators ('>', '+' and ']').

       if (IsAWordChar( {
           if (sc.state == SCE_CSS_DEFAULT)
       if ( == '*' && sc.state == SCE_CSS_DEFAULT) {

So the word will be recognised at a right boundary. The state was set by a preceding operator. Since the words that can follow a word depend on the kind it is, some forward tracking needs to be done.

       if (IsAWordChar(sc.chPrev) && (
           sc.state == SCE_CSS_IDENTIFIER || sc.state == SCE_CSS_IDENTIFIER2 ||
           sc.state == SCE_CSS_IDENTIFIER3 || sc.state == SCE_CSS_EXTENDED_IDENTIFIER ||
           sc.state == SCE_CSS_UNKNOWN_IDENTIFIER ||
           sc.state == SCE_CSS_PSEUDOCLASS || sc.state == SCE_CSS_PSEUDOELEMENT ||
           sc.state == SCE_CSS_UNKNOWN_PSEUDOCLASS ||
           sc.state == SCE_CSS_IMPORTANT ||
           sc.state == SCE_CSS_DIRECTIVE
       )) {

We retrieve enough text in a lookahead buffer, without advancing the styler, as there is no way to set it back later. Then we locate the next word start:

           char s[100];
           sc.GetCurrentLowered(s, sizeof(s));
           char *s2 = s;
           while (*s2 && !IsAWordChar(*s2))

Now, depending on what the state was and what sort of property the next word is, update the state:

           switch (sc.state) {
           case SCE_CSS_IDENTIFIER:
           case SCE_CSS_IDENTIFIER2:
           case SCE_CSS_IDENTIFIER3:
               if (css1Props.InList(s2))
               else if (css2Props.InList(s2))
               else if (css3Props.InList(s2))
               else if (exProps.InList(s2))

Same here, with grammatical restrictions on the operator introducing the pseudo item:

           case SCE_CSS_PSEUDOCLASS:
           case SCE_CSS_PSEUDOELEMENT:
               if (op == ':' && opPrev != ':' && pseudoClasses.InList(s2))
               else if (opPrev == ':' && pseudoElements.InList(s2))
               else if ((op == ':' || (op == '(' && lastState == SCE_CSS_EXTENDED_PSEUDOCLASS)) && opPrev != ':' && exPseudoClasses.InList(s2))
               else if (opPrev == ':' && exPseudoElements.InList(s2))
           case SCE_CSS_IMPORTANT:
               if (strcmp(s2, "important") != 0)
           case SCE_CSS_DIRECTIVE:
               if (op == '@' && strcmp(s2, "media") == 0)

So we are at an end of some id / class name; a tag may start here.

       if ( != '.' && != ':' && != '#' && (
           sc.state == SCE_CSS_CLASS || sc.state == SCE_CSS_ID ||
           ( != '(' && != ')' && ( /* This line of the condition makes it possible to extend pseudo-classes with parentheses */
               sc.state == SCE_CSS_PSEUDOCLASS || sc.state == SCE_CSS_PSEUDOELEMENT ||
               sc.state == SCE_CSS_EXTENDED_PSEUDOCLASS || sc.state == SCE_CSS_EXTENDED_PSEUDOELEMENT ||
               sc.state == SCE_CSS_UNKNOWN_PSEUDOCLASS

As emphasised earlier, the point of this document is not to review the exact syntax of CSS. Hopefully, how the code above relates to this syntax should be reasonably straightforward. The point to drive home is that a lexer is a finite automaton transitioning from state to state, with the encounters with certain characters triggering the transitions. Once you think of the language syntax this way - using a BNF grammar obviously helps -, writing the code is nearly automatic and only requires care when chaining conditional statements, as they may pile up and tangle.

Miscellaneous state transitions:

Comment start:
       if (sc.Match('/', '*')) {
           lastStateC = sc.state;
String start:
       } else if ((sc.state == SCE_CSS_VALUE || sc.state == SCE_CSS_ATTRIBUTE)
           && ( == '\"' || == '\)) {
           lastStateS = sc.state;
           sc.SetState(( == '\"' ? SCE_CSS_DOUBLESTRING : SCE_CSS_SINGLESTRING));
Operator detection.

This sets lastState, which is needed to handle the SCE_CSS_OPERATOR state.

       } else if (IsCssOperator(
           && (sc.state != SCE_CSS_ATTRIBUTE || == ']')
           && (sc.state != SCE_CSS_VALUE || == ';' || == '}' || == '!')
           && ((sc.state != SCE_CSS_DIRECTIVE && sc.state != SCE_CSS_MEDIA) || == ';' || == '{')
       ) {
           if (sc.state != SCE_CSS_OPERATOR)
               lastState = sc.state;
           op =;
           opPrev = sc.chPrev;

This code appears to assumes the state on seeing an operator cannot be SCE_CSS_OPERATOR. Actually, in that case, lastState has been set on the previous character, so keeping it as-is is the right thing.

In any other case, mostly whitespace, just keep going. Line ends do not matter, except for strings, though we have seen that this part of the specification is not covered.


This housekeeping statement should not be left out.




The folding routine has the same signature but for the initStyle parameter, as it is irrelevant for folding. Folding is done based on the current styling. Scintilla guarantees that folding is performed on each line after it is styled, the startng position starting the first line.

static void FoldCSSDoc(unsigned int startPos, int length, int, WordList *[], Accessor &styler) {

Retrieve fold properties that were set by Notepad++ for the current Scintilla:

   bool foldComment = styler.GetPropertyInt("fold.comment") != 0;
   bool foldCompact = styler.GetPropertyInt("fold.compact", 1) != 0;

Some helper variables which are seldom useless:

   unsigned int endPos = startPos + length;
   int visibleChars = 0;
   int lineCurrent = styler.GetLine(startPos);

Track folding level:

   int levelPrev = styler.LevelAt(lineCurrent) & SC_FOLDLEVELNUMBERMASK;
   int levelCurrent = levelPrev;

Miscellaneous. Using a lookahead character when parsing anything is unavoidable. And often it is a whole buffer.

   char chNext = styler[startPos];
   bool inComment = (styler.StyleAt(startPos-1) == SCE_CSS_COMMENT);

We now traverse the folded text without the need for a StylerContext. That was the whole point of endPos.

   for (unsigned int i = startPos; i < endPos; i++) {

Set the decision variables. They are quite basic and will be there more often than not.

       char ch = chNext;
       chNext = styler.SafeGetCharAt(i + 1);
       int style = styler.StyleAt(i);
       bool atEOL = (ch == '\r' && chNext != '\n') || (ch == '\n');

Note that the old Mac format is not supported.

What to fold on?


This is optional, et by the editor:

       if (foldComment) {
           if (!inComment && (style == SCE_CSS_COMMENT))
           else if (inComment && (style != SCE_CSS_COMMENT))
           inComment = (style == SCE_CSS_COMMENT);
Curly braces

This is not optional.

       if (style == SCE_CSS_OPERATOR) {
           if (ch == '{') {
           } else if (ch == '}') {

Line change

When at end of line, ship the folding level for this line out. The foldig level as Scintilla understands it is made of a 30 bit unsigned folder level and two more flags:

  • The SC_FOLDLEVELWHITEFLAG tells Scintilla that the folding line has nothing to show. You

may want such lines hidden as well. This is often referred to as compact folding.

  • The SC_FOLDLEVELHEADERFLAG tells Scintilla that this line has visible characters and starts a

deepr nested block.

       if (atEOL) {
           int lev = levelPrev;
           if (visibleChars == 0 && foldCompact)
               lev |= SC_FOLDLEVELWHITEFLAG;
           if ((levelCurrent > levelPrev) && (visibleChars > 0))
               lev |= SC_FOLDLEVELHEADERFLAG;
           if (lev != styler.LevelAt(lineCurrent)) {
               styler.SetLevel(lineCurrent, lev);

Update situation variables:

           levelPrev = levelCurrent;
           visibleChars = 0;

General processing

It consists only in increasing the count of visible characters, should there be any.

       if (!isspacechar(ch))


Perhaps there was no end of line, so we need to tell Scintilla what is the folding for this last line. This is done by simply setting the numerical nesting level, not touching the other flags. So we extract the flags and or() them to the numerical level.

   // Fill in the real level of the next line, keeping the current flags as they will be filled in later
   int flagsNext = styler.LevelAt(lineCurrent) & ~SC_FOLDLEVELNUMBERMASK;
   styler.SetLevel(lineCurrent, levelPrev | flagsNext);


Other items

Word list descriptions

Any compliant lexer should provide such descriptions. A null pointer marks the end of the list of lists.

static const char * const cssWordListDesc[] = {
   "CSS1 Properties",
   "CSS2 Properties",
   "CSS3 Properties",
   "Browser-Specific CSS Properties",
   "Browser-Specific Pseudo-classes",
   "Browser-Specific Pseudo-elements",

Register the lexer

LexerModule lmCss(SCLEX_CSS, ColouriseCssDoc, "css", FoldCSSDoc, cssWordListDesc);

Constructing the lexer, giving it an internal ID, the highlighter routine address, a short name, the folder routine address and the list of wordlist descriptions, registers it in the list of lexers Scintilla maintains.

Final note

Please note that this is an internal lexer. Versions of Scintilla not less than 2.20 will not accept registering the lexer the way it is if not included in SciLexer.dll - instead, you need to pass the address of a function that will return an ILexer the Colourise and Fold method will be called. This will move the code above into a class, and requires implementing a couple more inherited methods. The specifics are covered in the sections of Plugin Development devoted to lexers.

A future page may show how to rewrite this as an external lexer, taking full advantage of all the helper methods Scintilla provides.