Wednesday, December 30, 2009

Language Definition – User Defined Functions

The first type of functions being supported are the classic BASIC user defined or single statement functions.  These functions are defined with a DEF statement and whose identifier name starts with FN.  They may contain multiple arguments or no arguments.  The entire function must be defined on one statement (line), though multiple user functions can be defined on the same line separated by colons.  Two examples are:

    DEF FNHypot(X,Y)=SQR(X*X+Y*Y)
    DEF FNLength=LEN(First$+Last$)

Using FN for these functions means that there can be no other variables, arrays, functions or subroutines that start with FN.  These user functions may also have data type (the default type is double precision).  The user function arguments are considered local variables to the function and are not related to variables of the same name outside the function.  For the example above, X and Y are local to FNHypot(), but X and Y outside of the function will not be affected when FNHypot() is called.  The variables in FNLength are not arguments and are therefore not local.  Any variables used in a function not listed as arguments are regular variables.  These functions may also call other functions, but there needs to be check to make sure the function does not call itself.

Language Definition – Identifiers

Before getting into how the Parser is going to identify token types, there needs to some definition of what the identifiers will look like.  There will be no limit placed on the size of identifiers, however, there is a practical upper limit because of the program line length.  While I would like to limit line lengths to say 80 characters, this is probably not realistic.  Some type of line wrapping will be necessary, but that's a problem for another day.

Identifiers must start with a letter, but may contain any number of letters and numbers plus the under-bar character.  Identifiers will be case insensitive, however, identifiers will be saved as first entered.  In other words, if a variable name like SomeVariableName is entered, that's how it will be saved, but any form like somevariablename, SOMEVARIABLENAME, SOMEvariableName, etc. will refer to the same variable, but the name will be displayed as it was first entered.  (There will be allowance to rename variables later.)

Identifiers must be unique between variables, arrays, functions, subroutines and must not be any of the reserved BASIC commands and operators (e.g. PRINT, IF, etc. or even say Print, however the reserved BASIC command can be used within an identifier, for example, Print5 is acceptable).  At the end of the identifier can be an optional symbol for the data type: “%” for integer, “$” for string, and “#” for double precision (the default).  Later perhaps single precision can be supported with a “!” character.  The data type symbol is considered part of the name, therefore the variable names Variable, Variable% and Variable$ all refer to different variables and may all be contained in a program.

Arrays, functions and subroutine identifier names contain an opening parenthesis at the end with no intervening white space.  Note that while the opening parenthesis is considered part of the identifier, it is not stored.  Therefore, having both Variable and Variable() in the same program is not allowed.  This will allow array names to be used without the parentheses like in passing an entire array to a function or subroutine and MAT statements if implemented.  Subroutine identifier names do not have a data type symbol as they don't have a return value.

Parser

The parser needs to take an input line and separate out tokens so that the translator can begin the process of converting the line into the internal RPN format. The tokens will be one of several different types:
  1. Command Name
  2. Internal Function Name
  3. Remark (Comment)  (New)
  4. Operator
  5. Variable Name
  6. Array Name
  7. User Function Name
  8. Constant
The first four on this list are part of the BASIC language and will be in listed in the Operator Table. The Operator Table will contain several pieces of information like the priority of the operator, the internal code of the operator, the string representation of the operator (used by the Parser and the Recreator), the function to call when running the program, etc. The items in the Operator Table will be expanded as the Interactive BASIC Compiler is developed.  For now the Operator Table will contain the strings and the type (command, internal function or operator). For functions, there will also be a data type (integer, double, string or print).  (New) Comments will require special handling by the parser.

The last four on this list each will also have a data type associated with them. If a token is not found in the Operator Table, then it is a member of one of the last four, or it is an invalid token (for example, if a symbol is found that is not an operator).

The Parser will return one token with a type and data type at a time. Along with each token will be the column that the token starts. This column will be used for error reporting. There's no point in converting the entire line into tokens before the tokens are processed.

Updated Wednesday, January 6, 2010; 10.55 pm: Added information about remarks.