Comments, suggestions, contributions and error reports are welcome (VBDis@aol.com).
The implementation is split into various units, which can be
modified
to match the needs of various applications. You may notice my very
personal
naming and coding conventions in many places <g>.
The core implementation consists of:
A C prepropcessor scanner recognizes comments and line ends, header names, number-, string- and character-literals, identifiers, operators and punctuators, as well as characters that don't fit into these categories. The basic scanner, for use by the preprocessor, doesn't perform keyword lookup nor does it interpret preprocessor number literals.
A homebrew extension: The scanner can deliver tokens for comments and (continued) lines, where the (newer) standards require that comments map into whitespace.
The most important optimization topic is memory management, resulting both in memory consumption and memory managment operations(time). The scanner uses the following optimization techniques:
The overall amount of strings is minimized. Predefined strings are reused wherever possible. Files are read in line-by-line, into TStringLists, and all tokens have references into these lines for non-predefined strings. Exceptions to this rule are:
The scanner uses pointers into the source lines during scanning. This limits the lookahead across continuation lines and the handling of trigraphs.
Comments are retained as references into the source file lines, and are possibly broken into multiple tokens for multi-line comments. The application filter can discard these tokens.
Numbers are immediately scanned as integer or floating point numbers. This option may produce wrong results in very special situations (token pasting).
A full blown scanner would require much more strings:
The source file lines are not normally reusable, due to character set mapping (trigraphs...), comment substitution, and line splicing. Possibly the source files should be read by characters, not by lines, so that no source line strings must be stored. Instead every token contains a direct string reference to the according textual representation of the token. A string table only is required for wide string literals, since WideString references are incompatible with AnsiString references, and both references are illegal in variant (token) records.
The handling of continuation lines (line splicing) is incompatible
with
retaining comments. When comments are requested in the full blown
scanner,
they should be attached to the immediately preceding token, increasing
the overall token record size. At the same time the token, following an
comment, must be flagged as being preceded by whitespace, or the
fWhiteBefore
token attribute should be changed into fWhiteAfter. This handling then
requires some non trivial changes in the scanner.
Directives can be subdivided into macro definition, conditional compilation, file inclusion, and other directives.
Macros are defined with #define, and the definition can be removed with #undef. Macros can have arguments, in which case an opening parenthese '(' must immediately follow the macro name; no whitespace is allowed in between, otherwise the '(' would become part of the macro body.
Conditional compilation is implemented by #if, #ifdef, #ifndef, #elif, #else and #endif. The #if(n)def directives test whether the given macro name is (not) #defined, whereas the #if and #elif directives have constant expressions for the condition. Conditions can be nested, and I have found an simple procedure to track, during file scanning, whether the tokens currently have to be skipped. Older preprocessors could simply skip whole lines, without scanning, but the handling of continuation lines and multi-line comments requires that a new preprocessor must also parse the conditionally excluded parts of the source files. Directives in excluded parts also must be handled, at least the conditional directives.
The inclusion of related files comes in various forms. The traditional #include directive expects either an "ordinary" file, or an <system> file, or a macro that evaluates to one of these file name formats. The use of '<' as the delimiters for system file names requires special scanning of the input! "Ordinary" files are searched in the current directory (of the file with the #include directive), then in all directories of ancestor files of the current file, and finally in the include path, which also (only) is searched for the system files. Another form is the #include_next directive, which starts the search in the search path at the component which follows the directory in which the currently processed file was found. This convention can result in multiple occurences of the same directory in the search path, and none of these duplicates can be removed!
The other directives are used to create errors (#error), fake file names and line numbers from which the following code was created by some other tool (#line), and to set compiler options (#pragma).
'#' is the stringizer, which converts the following token into a string literal. This operator is used with macro arguments, to provide the name of the argument as a string, for output purposes. All tokens of the actual argument are converted into a single string literal. According to the specs, the quotes around and all backslashes in string literals should be escaped. This is not done in the implemented scanner, because the resulting string is stored with embedded control characters instead of escape sequences. All strings have to be converted during textual output, according to the syntax of the target language.
'#@' is the charizer, a Microsoft extension, which converts the following token into a character literal.
'##' is the token paster, which combines the token names to its left and right into an new identifier.
Another special "operator" in conditional preprocessor expressions is 'defined', which evaluates to a boolean (0..1) value indicating whether the following identifier has been #defined before. This operator can come in two flavours, as a prefix operator to the following symbol, or in function-like form with the symbol enclosed in parentheses.
The scanner tokens are somewhat language independent, in detail keywords are not recognized by the basic scanner. Instead all identifiers are mapped in a preprocessor symbol table, where every symbol can be associated with an object. One kind of symbol objects is used to implement macros, other classes can be used to map keywords or hold other parser specific information. Unfortunately it's allowed that even C/C++ keywords can be #defined, like "#define int int". Since the preprocessor has to find the #define first, under all circumstances, it's not easily possible to add symbols other than for #defines to the preprocessor symbol table. One possible solution, to prevent the search for keywords in another table, is a map from symbol indices, in the preprocessor symbol table, into language keywords. Then this map can be constructed when the keyword names are added to the preprocessor symbol table, what can be done before the scanner starts to add #defined names to the symbol table. Once a symbol name is added to the preprocessor symbol table, it is never moved around or removed; #define then only attaches an object to the alredy existing entry, and #undef only removes and destroys these objects, but leaves the symbol table entry and its name intact.
Some pseudo tokens allow the creation of pretty printers and other applications, which require more than the C/C++ specific tokens. These tokens are created for:
- The preprocessor is line-oriented, besides for multi-line comments which by definition have to be interpreted as a single blank (whitespace), regardless of embedded EOLs. The tokenizer therefore should return a single token even for multiline comments. OTOH every line should be scanned somewhat independently from continuation lines, so that the comment text can be implemented by references into specific lines, and no memory allocation is required for multi-line comments. Then skipped or otherwise unwanted comments impose no runtime penalty on an application.
- Only the first non-white character in a line should be inspected for '#' to determine the presence of a preprocessor directive. The distinction between various scanner modes (also ASM...) is vital, because every language has different implications and requires different scanners.
- Most applications require indications about the file, line and
possibly
column of every token. The file and line number is added to BOL and
NoEol
tokens, and also the indentation of the source lines.
#line directives are not yet implemented.
- Column information is not normally required, besides for error messages and the exact position of declarations and definitions in an file. The column information can be extracted from the top level source file state, but #line and lookahead can make that information unreliable.
- Parsers, which retain comments, should attach comments to another (non-pseudo) token. It's suggested that a comment is treated as a postfix of that token, so that comment-only lines can be attached to the BOL token.
- Token lookahead will not normally occur in a scanner, but the preprocessor deserves some lookahead for the preprocessor operators. Since such operators can occur only inside macro definitions, the lookahead can occur on the token stream that makes up the body of a macro.
- String concatenation, in the last step of the C scanner/preprocessor, does not fit together with multi-line comments. A combined string token can not be created when comments between the strings shall be reported as distinct tokens.
The preprocessor implements an stack of token streams, where a token stream can be a source file or a macro expansion. A TTokenStream implements two functions to retrieve the next token, where nextRaw() returns the next unprocessed token, whereas nextToken() returns the next "cooked" token, after eventual preprocessing. The preprocessing in source file streams includes directive handling and the suppression of conditionally excluded parts of the files, in macro expansion streams no difference exists between both functions.
Two more functions can be used as token filters, where nextNoEof() handles switching from a macro expansion back to the preceding token stream, or from #included files back to the preceding file. nextNoWhite suppresses all "white" tokens, for use e.g. in the expression evaluator. Both functions can be called in raw (preprocessor) or cooked (parser) mode.
The evaluation of constant expressions, required for #if directives, is implemented for reuse by the application (evaluation of enum and other initializers). A "preprocessor" flag must be passed to the expression evaluator, if different handling of e.g. macro substitution is required. The handling of high-level constant values, as opposed to #defines, is not yet implemented; most probably an application-supplied callback function will be used in the future.
The macro substitution checker sits on the top of all preprocessor
levels.
Special conditions, preventing the expansion of a macro, are handled in
the macro expander itself.