C Parser

This project is based on an old C parser for header files, written 1986 in GFA-Basic. The purpose was the extraction of type information from C header files, now extended into part of an cross-compiler.

Scanner

The scanner and preprocessor is taken from the CScan project. This scanner doesn't recognize keywords, instead all keywords are recognized by a filter between the scanner and parser. The parser also maintains its own symbol tables, extensible into multiple scopes.

Parser

The basic parser understands C declarations, only the syntax of blocks (compound statements) is missing. The parser is a handcrafted top down parser, with a slightly reorganized C grammar that better fits the semantics of C declarations than the typical C standard grammars.

Declarations

C declarations are hard to parse. Every declaration consists of up to 4 parts: type-specifier, declarator prefix, declarator name, and declarator postfix. The declared name sits beween the declarator pre- and postfix, possibly in a nested declaration with separate pre- and postfix parts. All these parts must be puzzled together in the correct order. One specifier can be followed by multiple declarators, which can add their own modifiers to the common type specifier. Much information is hidden in missing declaration parts, so that the real types can be determined only very late in the parsing process.

Declaration Specifiers

Declaration specifiers describe the storage class, base type, and type attributes and modifiers, of a declaration. The only important storage class (typedef) is not a storage class in C terminology, because it definitely marks the following specification as a type declaration. A type specification is broken in the RType object into

storageClass (token)
signed (token)
sized (tokens)
spectoken
attrs (tokens)
calling convention (token) and inline flag, when procedures are involved

The storage class token can be {0..1}[auto, register, static, extern, const, typedef], with const and typedef being added willingly to this set.
The signed token can be {0..1}[signed, unsigned], and applies only to int and char types.
The sized enum of [short, long, longlong] is constructed from {0..2}[short, long] tokens, and applies to int and double types.
The spectoken describes the base type [char, wchar_t, int, __int8, __int16, __int32, __int64, float, double, struct, union, enum, void, t_sym], where t_sym represents a typename reference, and int is the default base type.
The attrs set can contain {0..2}[const, volatile], further (16 bit Windows) attributes [__near, __far, __based] are ignored in the current 32 bit implementation.
The calling convention {0..1}[ __cdecl, __fastcall, __stdcall] is an important Windows specific attribute.
The {0..1}[__inline] modifier has no really defined place in a declaration, but it is so important that it is remembered in a dedicated field.
More Windows specific attributes may be added, like [__dllimport, __dllexport] etc.
Also GNU/Java specific attributes [__ATTR, __THROW...] can be implemented later, in a dynamic attributes list.

Declarators

The specification part is finished when a declarator name, pointer symbol, or nested declarator is found. At this point a missing spectoken is substituted by int, and the specification is compiled into an base type (spec string). For type references and complex types a basetype reference is created in the basetype field. For simple types then the declarator pre- [pointer] and postfixes [array, argument-list] are collected and compiled into pre and post string fields, together with further [const, volatile] attributes. For complex types [struct, union, enum] the remaining declarator parts are parsed differently. Finally the overall type of the declaration is determined, and classified into [type, var, const, procedure] declarations, and an optionally following initializer part is parsed according to this type.

Please note that every declarator must be parsed completely, before a procedure can be distinguished from a variable declaration. In the case of a procedure the type specifier describes the result type of the procedure, as well as attributes of the procedure, not attributes of the result type. A __inline procedure also can be followed by some kind of initializer, in this case the body (compound statement) of the procedure. Old style procedure declarations only describe the types of the parameters, in the full definition a parameter declaration section can occur between the procedure declaration and procedure body.

While procedure declarations can be determined by the presence of a parameter list, variables and constants must be distinguished by the existence of a const attribute in the specification. This separation is important in so far as constants can occur in constant expressions, so that their values must be known then; initialized variables can have a similar declaration, but they can not occur in constant expressions. Enum members also can occur in constant expressions, so every enum member must be stored twice, as a member in the enum declaration and as a named constant. Simple constants can be of type [char, wchar_t, int, float, double, enum, string], where string is a pointer to char. All these constants, with the exception of strings, have an numerical value. Due to the possibly complex initializers for complex constants, only simple non-string constants are implemented in the type parser, for now.

Persistent Type Description

In the first attempt strings are used to describe all declarations. All declarations are entered into lists of types, variables, constants and procedures. A declaration consists of a reference name and definition, and an optional initialized value. Anonymous types are inlined into the definition, named types are substituted by references into the type list. A definition is an ASCII string that can be parsed in both directions; this format allows to parse from the base type to the final type, or from the topmost type specifier down to the base type. The type definition grammar is as follows:

Type = Qualifier | Array | Pointer | Procedure | Struct_Union | Enum | Typename | Bitfield | Basetype.
Qualifier = [ "!" ] ( "#" | "V" ) Type. /*!=static, #=const (locked) or V=volatile */
Array = "[" ExprList "]" Type.
Pointer = "*" Type.
Procedure = "(" { Param } [ ":~," ] ")" [CallConv] Type. /* ":~," stands for a "..." variable argument list */
    Param = [ name ] [ ":" Type ] [ "=" Value ] "," . /* either name or Type must be specified */
    CallConv = { "I" | "F" | "C" }. /* Inline, Fastcall, Cdecl, default: not inlined, Stdcall */
/* final types */
Struct_Union = ( "S" | "U" ) ( ( ":" tagname ) | ( "{" { Field "," }"}" ) ).
Field = [ name ] ":" [ Bitfield ] Type. /* only here a Bitfield type is allowed */
    Bitfield = "<" Expr ">". /* Type is assumed to be a Basetype */
Enum = "E" "{" { name [ "=" Value ] "," } "}".
Typename = '"' name '"'. /* in double quotes */
/* basic types are: void, char, short, int, long, "long long", float, double, "long double", "..." */
Basetype = [ "+" /* unsigned */ | "-" /* signed */ ] ( num | "v" | "c" | "s" | "i" | "l" | "L" | "f" | "d" | "D" | "~" ).
    /* num means: number of bytes [1,2,4,8], must be preceded by "+" or "-" for unsigned/signed integers */
Value = ( number | string | char | identifier | [ operator ] "(" { Value } ")" ) ";" .

???
Value can be any expression, up to a non-expression token: : ; , ] } ) #0.
A cast type can be appended as ":" Type?
Value = "{" Expr "}".
Expr = ( number | string | char | identifier | [ operator ] "(" [ ExprList ] ")" | Stmt ) .
ExprList = Expr { ";" Expr }
Stmt = ... .

A "!" prefix indicates a static declaration (proc, var...).
All(?) values have [ "=" Value ";" ] syntax, Value may be a symbol.
Tagged struct/unions are stored as types, all typenames refer to this tag-type.
Anonymous types have numbers as names (symbol number, struct/union only?).
Inside struct/union: inlined, no global typedef!
Procedures *must* have a parameter list, even if empty.

Symbols are stored in sections (may change!?)
VarSym = name [ ":" Type ] [ "=" Value ";" ].
TypeSym = Typename "=" Type. //"=" because TypeName can contain ":"!
TypeName = name | S:nn | U:nn | E:nn. //nn = name or type-symbol-number

Currently all list members (params...) are terminated by the list separator ",". This may change if found not required or useful.

Names of members (proc, struct, union, enum) may be enclosed in single quotes, to allow for easy bidirectional parsing, and as a distinction from type names (in double quotes).

Initializers may exist for const or variable declarations. The initializers are stored as attributes of the symbols, not as part of the (type) declaration. For the usage of enum members and const variables, as constants, a separate list of constant declarations is maintained.

Tagged definitions of structs etc. are treated as type declarations, with the type indicator prefixed to the tag name (e.g. "E:MyEnum"). The embedded punctuator makes these names distinct from regular typedef names.

The representation of constant expressions, as e.g. initializers, is not yet specified. For the use in a cross compiler the expression should be preserved, as either a token stream (language specific) or as a parse tree. For now all constant expressions in type declarations are evaluated, and the result is stored as a literal (number).

A full-fledged C parser requires a distinction between type identifiers and other symbols. For the use as a header parser, with possibly unresolved include files, heuristics can be used to classify yet undefined names as typenames, or the user can be asked for an appropriate classification.

Tokens

The scanner creates tokens for keywords, operators and punctuators, symbols, literals (numeric, char and string), as well as some special tokens for preprocessing and other purposes:

t_empty	Means: token consumed
t_error	Illegal chars in input
t_eof	No more input available (this source)
t_bol	Begin of line, for preprocessor and listing purposes
t_NoEol	Escaped end of line, for macro definitions and listing purposes
t_rem	Various tokens for comments, text as a single string

For tokens with explicit string values (symbols, string and char literals) a string is created by the scanner. The consumer can retrieve comment and BOL strings, if these are required e.g. in a pretty printer. String literals are stored in a specific string table, whereas char literals are stored in the token record.

The lookup and transformation of typenames and keywords must be done by the parser, the preprocessor does not impose any special meaning on symbols other than macro names.

Complications

The various C "standards" result in some complications, which affect the handling of macros, as well as the handling of preprocessor operators and directives. Also the handling of multi-line comments and column information requires that the files are scanned entirely, it's impossible to simply skip over excluded lines.

Constant Expressions

The evaluation of constant expressions slightly differs between preprocessor and application.
A preprocessor expression ends on EOL, which token is ignored by a higher level parser.
The preprocessor also must recognize "defined", possibly with the ugly syntax without parentheses.
In all other cases macro symbols should be expanded, but empty definitions should result in a value of zero, not in nothing!
For the retrieval of non-preprocessor constants a callback function to the parser is used by the expression evaluator.

In a future version only the preprocessor conditions will be evaluated immediately. The parser instead will create an parse tree for every expression, which then can be handled by the application (compiler...). Such trees also can be evaluated after their construction, with the advantage that unavailable values (sizeof...) can be detected and handled without the risk of breaking the parsing process on possible errors (zero divide...). These trees also allow for post-processing of initializers for complex data types.

Lookahead

Both the preprocessor and parser deserve some limited lookahead. Currently both kinds of lookahead are implemented differently, where the parser lookahead is restricted to a single token, following a symbol; when such lookahead was required, the symbol name is passed to the parser subroutines, which deal with that name before proceeding to the lookahead token. The preprocessor lookahead is required to distinguish expandable from non-expandable functional macro invocations. Here all tokens are recorded in a temporary stream, until either a "(" or a non-white token is found. A "(" indicates the begin of the list of actual macro arguments, in which case the macro has to be expanded. Otherwise the macro is not expanded, and the recorded tokens are made available to the parser.

Parser Output

The declarations of all variables, constants, types, procedures and macros can be stored in an text file. Such a file later can be loaded again, e.g. for a transformation into a different language module.