Sunday, January 3, 2010

Parsing Identifiers

Once a letter is found, the parser will continue collecting letters, digits and under-bars. When no more are found, the next character is checked for a data type symbol (“%”, “$” or “#”) and if found the data type will be set to an integer, string or double (initially set to none). Finally, the next character is checked for an opening parenthesis, and if found a parenthesis flag is set. If the the first two characters are FN, then the token type is set to Defined Functioned, and the string is returned along with the data type and parenthesis flag.

The Operator Table will then be searched for the identifier. If the identifier is found in the table and the entry is not flagged that the word is part of a two word command, then the internal code for the command, operator or internal function will be returned for the token along data type (internal function only). No string from the input needs to be returned.

If the table entry is flagged that the command is part of a two word command, then a second identifier after white space is read from the input. The second identifier will not have a data type symbol or parenthesis. The Operator Table is searched again for the two word identifier (with one space between). If found then the internal code for the command will be returned. If not found and if the first word by itself is a valid command, then the internal code for the command is returned; otherwise a syntax error is reported at the second word.

If the identifier was not found in the table, then is could be a Variable, Array, Generic Function or Subroutine name. At this point, the Parser has insufficient information to determine which of these types the identifier is. The global or local dictionaries would need to be checked and this is beyond the scope for the Parser (this will be handled by the Translator or Encoder). The Parser will simply return a type of No Parenthesis or Parenthesis.

Parser – Token Identification

There are four major types of tokens, an identifier, a constant (numeric or string), a symbol character or two representing operators, and a remark (comment). Tokens don't contain white space (spaces or tabs), so any white space will normally separate tokens, though a between identifiers or constant an symbols will not require white space. In obtaining a token, any white space will be skipped, then the next character will be checked.

An identifier starts with a letter; contains letters, digits and/or under-bars; and ends with an optional data type symbol and possibly an opening parenthesis. Some operators are also made up of letters, for example, MOD, AND, OR, etc. One wrinkle, some BASIC commands may consist of two words separated with white space, for example SELECT CASE, END IF, EXIT FOR, etc.

A numeric constant starts with a digit or decimal point; contains digits and/or decimal point; and possibly ends with an exponent (“e” or “E”, optional sign and digits). Only one decimal point is allowed and a second one will terminate the token as would a second “e” or “E”. (For example, 0.010.02, would be two constant tokens 0.010 and .02, and this will generate an error downstream in the translator – the Parser just separates the input into tokens.)

A string constant starts with a double-quote; contains any characters; up to the next double-quote. Two double quotes in a row would place a single double-quote in the string constant.

An operator character is any other valid operator symbol (for example, +, -, etc. including comma, semicolon and colon – basically any valid BASIC character that's not part of an identifier or constant). Operators can also contain multiple character operators (for example >=, <=, <>), but also be single character operators (=, <, >) and some single characters operators can be next to each each (for example +-, which is PLUS followed by unary MINUS).

(New) A remark can start with a REM command, but must be followed by white-space and if not the first command on the line, must be preceded by a colon statement separator.  Comments can also start with a single-quote (it appears the ANSI standard allows exclamation point for comments, aka TrueBasic and ECMA-116, but most of the other BASICs I checked use single-quote including G/W-Basic, QuickBasic, VisualBasic, FreeBasic, xBasic, PowerBasic, SmallBasic, and thinBasic; therfore I'm going with the crowd).  When using a single-quote, it does not need to be followed by white-space or preceded by a colon.  All characters following the REM or single-quote up to the end of the line will be taken as the comment.

Updated Wednesday, January 6, 2010; 11.01 pm: Added information about remarks.