info bison

3.2 Symbols, Terminal and Nonterminal

Symbols in Bison grammars represent the grammatical classifications of the language.

A terminal symbol (also known as a token type) represents a class of syntactically equivalent tokens. You use the symbol in grammar rules to mean that a token in that class is allowed. The symbol is represented in the Bison parser by a numeric code, and the yylex function returns a token type code to indicate what kind of token has been read. You don't need to know what the code value is; you can use the symbol to stand for it.

A nonterminal symbol stands for a class of syntactically equivalent groupings. The symbol name is used in writing grammar rules. By convention, it should be all lower case.

Symbol names can contain letters, digits (not at the beginning), underscores and periods. Periods make sense only in nonterminals.

There are three ways of writing terminal symbols in the grammar:

A named token type is written with an identifier, like an identifier in C. By convention, it should be all upper case. Each such name must be defined with a Bison declaration such as %token. See section Token Type Names.
A character token type (or literal character token) is written in the grammar using the same syntax used in C for character constants; for example, '+' is a character token type. A character token type doesn't need to be declared unless you need to specify its semantic value data type (see section Data Types of Semantic Values), associativity, or precedence (see section Operator Precedence).
By convention, a character token type is used only to represent a token that consists of that particular character. Thus, the token type '+' is used to represent the character ‘+’ as a token. Nothing enforces this convention, but if you depart from it, your program will confuse other readers.

All the usual escape sequences used in character literals in C can be used in Bison as well, but you must not use the null character as a character literal because its numeric code, zero, signifies end-of-input (see section Calling Convention for yylex). Also, unlike standard C, trigraphs have no special meaning in Bison character literals, nor is backslash-newline allowed.
A literal string token is written like a C string constant; for example, "<=" is a literal string token. A literal string token doesn't need to be declared unless you need to specify its semantic value data type (see section Data Types of Semantic Values), associativity, or precedence (see section Operator Precedence).
You can associate the literal string token with a symbolic name as an alias, using the %token declaration (see section Token Declarations). If you don't do that, the lexical analyzer has to retrieve the token number for the literal string token from the yytname table (see section Calling Convention for yylex).

Warning: literal string tokens do not work in Yacc.

By convention, a literal string token is used only to represent a token that consists of that particular string. Thus, you should use the token type "<=" to represent the string ‘<=’ as a token. Bison does not enforce this convention, but if you depart from it, people who read your program will be confused.

All the escape sequences used in string literals in C can be used in Bison as well, except that you must not use a null character within a string literal. Also, unlike Standard C, trigraphs have no special meaning in Bison string literals, nor is backslash-newline allowed. A literal string token must contain two or more characters; for a token containing just one character, use a character token (see above).

How you choose to write a terminal symbol has no effect on its grammatical meaning. That depends only on where it appears in rules and on when the parser function returns that symbol.

The value returned by yylex is always one of the terminal symbols, except that a zero or negative value signifies end-of-input. Whichever way you write the token type in the grammar rules, you write it the same way in the definition of yylex. The numeric code for a character token type is simply the positive numeric code of the character, so yylex can use the identical value to generate the requisite code, though you may need to convert it to unsigned char to avoid sign-extension on hosts where char is signed. Each named token type becomes a C macro in the parser file, so yylex can use the name to stand for the code. (This is why periods don't make sense in terminal symbols.) See section Calling Convention for yylex.

If yylex is defined in a separate file, you need to arrange for the token-type macro definitions to be available there. Use the ‘-d’ option when you run Bison, so that it will write these macro definitions into a separate header file ‘name.tab.h’ which you can include in the other source files that need it. See section Invoking Bison.

If you want to write a grammar that is portable to any Standard C host, you must use only nonnull character tokens taken from the basic execution character set of Standard C. This set consists of the ten digits, the 52 lower- and upper-case English letters, and the characters in the following C-language string:

"\a\b\t\n\v\f\r !\"#%&'()*+,-./:;<=>?[\\]^_{|}~"

The yylex function and Bison must use a consistent character set and encoding for character tokens. For example, if you run Bison in an ASCII environment, but then compile and run the resulting program in an environment that uses an incompatible character set like EBCDIC, the resulting program may not work because the tables generated by Bison will assume ASCII numeric values for character tokens. It is standard practice for software distributions to contain C source files that were generated by Bison in an ASCII environment, so installers on platforms that are incompatible with ASCII must rebuild those files before compiling them.

The symbol error is a terminal symbol reserved for error recovery (see section Error Recovery); you shouldn't use it for any other purpose. In particular, yylex should never return this value. The default value of the error token is 256, unless you explicitly assigned 256 to one of your tokens with a %token declaration.

[ < ]

[ > ]

[ << ]

[ Up ]

[ >> ]

[Top]

[Contents]

[Index]

[ ? ]