Sunday, January 3, 2010

Parser – Token Identification

There are four major types of tokens, an identifier, a constant (numeric or string), a symbol character or two representing operators, and a remark (comment). Tokens don't contain white space (spaces or tabs), so any white space will normally separate tokens, though a between identifiers or constant an symbols will not require white space. In obtaining a token, any white space will be skipped, then the next character will be checked.

An identifier starts with a letter; contains letters, digits and/or under-bars; and ends with an optional data type symbol and possibly an opening parenthesis. Some operators are also made up of letters, for example, MOD, AND, OR, etc. One wrinkle, some BASIC commands may consist of two words separated with white space, for example SELECT CASE, END IF, EXIT FOR, etc.

A numeric constant starts with a digit or decimal point; contains digits and/or decimal point; and possibly ends with an exponent (“e” or “E”, optional sign and digits). Only one decimal point is allowed and a second one will terminate the token as would a second “e” or “E”. (For example, 0.010.02, would be two constant tokens 0.010 and .02, and this will generate an error downstream in the translator – the Parser just separates the input into tokens.)

A string constant starts with a double-quote; contains any characters; up to the next double-quote. Two double quotes in a row would place a single double-quote in the string constant.

An operator character is any other valid operator symbol (for example, +, -, etc. including comma, semicolon and colon – basically any valid BASIC character that's not part of an identifier or constant). Operators can also contain multiple character operators (for example >=, <=, <>), but also be single character operators (=, <, >) and some single characters operators can be next to each each (for example +-, which is PLUS followed by unary MINUS).

(New) A remark can start with a REM command, but must be followed by white-space and if not the first command on the line, must be preceded by a colon statement separator.  Comments can also start with a single-quote (it appears the ANSI standard allows exclamation point for comments, aka TrueBasic and ECMA-116, but most of the other BASICs I checked use single-quote including G/W-Basic, QuickBasic, VisualBasic, FreeBasic, xBasic, PowerBasic, SmallBasic, and thinBasic; therfore I'm going with the crowd).  When using a single-quote, it does not need to be followed by white-space or preceded by a colon.  All characters following the REM or single-quote up to the end of the line will be taken as the comment.

Updated Wednesday, January 6, 2010; 11.01 pm: Added information about remarks.

No comments:

Post a Comment

All comments and feedback welcomed, whether positive or negative.
(Anonymous comments are allowed, but comments with URL links or unrelated comments will be removed.)