C Parser
This project is based on an old C parser for header files, written 1986
in GFA-Basic. The purpose was the extraction of type information from C
header files, now extended into part of an cross-compiler.
Scanner
The scanner and preprocessor is taken from the CScan
project. This scanner doesn't recognize keywords, instead all keywords
are recognized by a filter between the scanner and parser. The parser also
maintains its own symbol tables, extensible into multiple scopes.
Parser
The basic parser understands C declarations, only the syntax of blocks
(compound statements) is missing. The parser is a handcrafted top down
parser, with a slightly reorganized C grammar that better fits the semantics
of C declarations than the typical C standard grammars.
Declarations
C declarations are hard to parse. Every declaration consists of up to 4
parts: type-specifier, declarator prefix, declarator name, and declarator
postfix. The declared name sits beween the declarator pre- and postfix,
possibly in a nested declaration with separate pre- and postfix parts.
All these parts must be puzzled together in the correct order. One specifier
can be followed by multiple declarators, which can add their own modifiers
to the common type specifier. Much information is hidden in missing declaration
parts, so that the real types can be determined only very late in the parsing
process.
Declaration Specifiers
Declaration specifiers describe the storage class, base type, and type
attributes and modifiers, of a declaration. The only important storage
class (typedef) is not a storage class in C terminology, because it definitely
marks the following specification as a type declaration. A type specification
is broken in the RType object into
-
storageClass (token)
-
signed (token)
-
sized (tokens)
-
spectoken
-
attrs (tokens)
-
calling convention (token) and inline flag, when procedures are involved
The storage class token can be {0..1}[auto, register, static, extern, const,
typedef], with const and typedef being added willingly to this set.
The signed token can be {0..1}[signed, unsigned], and applies only
to int and char types.
The sized enum of [short, long, longlong] is constructed from {0..2}[short,
long] tokens, and applies to int and double types.
The spectoken describes the base type [char, wchar_t, int, __int8,
__int16, __int32, __int64, float, double, struct, union, enum, void, t_sym],
where t_sym represents a typename reference, and int is the default base
type.
The attrs set can contain {0..2}[const, volatile], further (16 bit
Windows) attributes [__near, __far, __based] are ignored in the current
32 bit implementation.
The calling convention {0..1}[ __cdecl, __fastcall, __stdcall] is an
important Windows specific attribute.
The {0..1}[__inline] modifier has no really defined place in a declaration,
but it is so important that it is remembered in a dedicated field.
More Windows specific attributes may be added, like [__dllimport, __dllexport]
etc.
Also GNU/Java specific attributes [__ATTR, __THROW...] can be implemented
later, in a dynamic attributes list.
Declarators
The specification part is finished when a declarator name, pointer symbol,
or nested declarator is found. At this point a missing spectoken is substituted
by int, and the specification is compiled into an base type (spec string).
For type references and complex types a basetype reference is created in
the basetype field. For simple types then the declarator pre- [pointer]
and postfixes [array, argument-list] are collected and compiled into pre
and post string fields, together with further [const, volatile] attributes.
For complex types [struct, union, enum] the remaining declarator parts
are parsed differently. Finally the overall type of the declaration is
determined, and classified into [type, var, const, procedure] declarations,
and an optionally following initializer part is parsed according to this
type.
Please note that every declarator must be parsed completely, before
a procedure can be distinguished from a variable declaration. In the case
of a procedure the type specifier describes the result type of the procedure,
as well as attributes of the procedure, not attributes of the result type.
A __inline procedure also can be followed by some kind of initializer,
in this case the body (compound statement) of the procedure. Old style
procedure declarations only describe the types of the parameters, in the
full definition a parameter declaration section can occur between the procedure
declaration and procedure body.
While procedure declarations can be determined by the presence of a
parameter list, variables and constants must be distinguished by the existence
of a const attribute in the specification. This separation is important
in so far as constants can occur in constant expressions, so that their
values must be known then; initialized variables can have a similar declaration,
but they can not occur in constant expressions. Enum members also can occur
in constant expressions, so every enum member must be stored twice, as
a member in the enum declaration and as a named constant. Simple constants
can be of type [char, wchar_t, int, float, double, enum, string], where
string is a pointer to char. All these constants, with the exception of
strings, have an numerical value. Due to the possibly complex initializers
for complex constants, only simple non-string constants are implemented
in the type parser, for now.
Persistent Type Description
In the first attempt strings are used to describe all declarations. All
declarations are entered into lists of types, variables, constants and
procedures. A declaration consists of a reference name and definition,
and an optional initialized value. Anonymous types are inlined into the
definition, named types are substituted by references into the type list.
A definition is an ASCII string that can be parsed in both directions;
this format allows to parse from the base type to the final type, or from
the topmost type specifier down to the base type. The type definition grammar
is as follows:
Type = Qualifier | Array | Pointer | Procedure | Struct_Union | Enum
| Typename | Bitfield | Basetype.
Qualifier = [ "!" ] ( "#" | "V" ) Type. /*!=static, #=const (locked)
or V=volatile */
Array = "[" ExprList "]" Type.
Pointer = "*" Type.
Procedure = "(" { Param } [ ":~," ] ")" [CallConv] Type. /* ":~," stands
for a "..." variable argument list */
Param = [ name ] [ ":" Type ] [ "=" Value ] ","
. /* either name or Type must be specified */
CallConv = { "I" | "F" | "C" }. /* Inline, Fastcall,
Cdecl, default: not inlined, Stdcall */
/* final types */
Struct_Union = ( "S" | "U" ) ( ( ":" tagname ) | ( "{" { Field ","
}"}" ) ).
Field = [ name ] ":" [ Bitfield ] Type. /* only here a
Bitfield type is allowed */
Bitfield = "<" Expr ">". /* Type is assumed to
be a Basetype */
Enum = "E" "{" { name [ "=" Value ] "," } "}".
Typename = '"' name '"'. /* in double quotes */
/* basic types are: void, char, short, int, long, "long long", float,
double, "long double", "..." */
Basetype = [ "+" /* unsigned */ | "-" /* signed */ ] ( num |
"v" | "c" | "s" | "i" | "l" | "L" | "f" | "d" | "D" | "~" ).
/* num means: number of bytes [1,2,4,8],
must be preceded by "+" or "-" for unsigned/signed integers */
Value = ( number | string | char | identifier | [ operator ] "(" {
Value } ")" ) ";" .
???
Value can be any expression, up to a non-expression token: : ; , ]
} ) #0.
A cast type can be appended as ":" Type?
Value = "{" Expr "}".
Expr = ( number | string | char | identifier | [ operator ] "(" [ ExprList
] ")" | Stmt ) .
ExprList = Expr { ";" Expr }
Stmt = ... .
A "!" prefix indicates a static declaration (proc, var...).
All(?) values have [ "=" Value ";" ] syntax, Value may be a symbol.
Tagged struct/unions are stored as types, all typenames refer to this
tag-type.
Anonymous types have numbers as names (symbol number, struct/union
only?).
Inside struct/union: inlined, no global typedef!
Procedures *must* have a parameter list, even if empty.
Symbols are stored in sections (may change!?)
VarSym = name [ ":" Type ] [ "=" Value ";" ].
TypeSym = Typename "=" Type. //"=" because TypeName can contain ":"!
TypeName = name | S:nn | U:nn | E:nn.
//nn = name or type-symbol-number
Currently all list members (params...) are terminated by the list separator
",". This may change if found not required or useful.
Names of members (proc, struct, union, enum) may be enclosed in single
quotes, to allow for easy bidirectional parsing, and as a distinction from
type names (in double quotes).
Initializers may exist for const or variable declarations. The initializers
are stored as attributes of the symbols, not as part of the (type) declaration.
For the usage of enum members and const variables, as constants, a separate
list of constant declarations is maintained.
Tagged definitions of structs etc. are treated as type declarations,
with the type indicator prefixed to the tag name (e.g. "E:MyEnum"). The
embedded punctuator makes these names distinct from regular typedef names.
The representation of constant expressions,
as e.g. initializers, is not yet specified. For the use in a cross compiler
the expression should be preserved, as either a token stream (language
specific) or as a parse tree. For now all constant expressions in type
declarations are evaluated, and the result is stored as a literal (number).
A full-fledged C parser requires a distinction between type identifiers
and other symbols. For the use as a header parser, with possibly unresolved
include files, heuristics can be used to classify yet undefined names as
typenames, or the user can be asked for an appropriate classification.
Tokens
The scanner creates tokens for keywords, operators and punctuators, symbols,
literals (numeric, char and string), as well as some special tokens for
preprocessing and other purposes:
t_empty |
Means: token consumed |
t_error |
Illegal chars in input |
t_eof |
No more input available (this source) |
t_bol |
Begin of line, for preprocessor and listing purposes |
t_NoEol |
Escaped end of line, for macro definitions and listing purposes |
t_rem |
Various tokens for comments, text as a single string |
For tokens with explicit string values (symbols, string and char literals)
a string is created by the scanner. The consumer can retrieve comment and
BOL strings, if these are required e.g. in a pretty printer. String literals
are stored in a specific string table, whereas char literals are stored
in the token record.
The lookup and transformation of typenames and keywords must be done
by the parser, the preprocessor does not impose any special meaning on
symbols other than macro names.
Complications
The various C "standards" result in some complications, which affect the
handling of macros, as well as the handling of preprocessor operators and
directives. Also the handling of multi-line comments and column information
requires that the files are scanned entirely, it's impossible to simply
skip over excluded lines.
Constant Expressions
The evaluation of constant expressions slightly differs between preprocessor
and application.
A preprocessor expression ends on EOL, which token is ignored by a
higher level parser.
The preprocessor also must recognize "defined", possibly with the ugly
syntax without parentheses.
In all other cases macro symbols should be expanded, but empty definitions
should result in a value of zero, not in nothing!
For the retrieval of non-preprocessor constants a callback function
to the parser is used by the expression evaluator.
In a future version only the preprocessor conditions will be evaluated
immediately. The parser instead will create an parse
tree for every expression, which then can be handled by the application
(compiler...). Such trees also can be evaluated after their construction,
with the advantage that unavailable values (sizeof...) can be detected
and handled without the risk of breaking the parsing process on possible
errors (zero divide...). These trees also allow for post-processing of
initializers for complex data types.
Lookahead
Both the preprocessor and parser deserve some limited lookahead. Currently
both kinds of lookahead are implemented differently, where the parser lookahead
is restricted to a single token, following a symbol; when such lookahead
was required, the symbol name is passed to the parser subroutines, which
deal with that name before proceeding to the lookahead token. The preprocessor
lookahead is required to distinguish expandable from non-expandable functional
macro invocations. Here all tokens are recorded in a temporary stream,
until either a "(" or a non-white token is found. A "(" indicates the begin
of the list of actual macro arguments, in which case the macro has to be
expanded. Otherwise the macro is not expanded, and the recorded tokens
are made available to the parser.
Parser Output
The declarations of all variables, constants, types, procedures and macros
can be stored in an text file. Such a file later can be loaded again, e.g.
for a transformation into a different language module.