Sunday, January 17, 2010

Calling The Parser

First, how the Parser will be used needs to be defined. Since it is a class, it needs to be instanced:

    Parser parser;

The constructor doesn't need to do anything. The input line will be inputted from the user, initially from a prompt like BASIC interpreters, and later from a screen editor. In either case it will be a standard C zero-terminated character array. This array needs to be initially fed to the Parser, something like:

    char *input;
    ...
    parser.start(input);

The start function will initialize internal variables in preparation for parsing the line into tokens. Tokens are then obtained from the parser:

    Token *token;
    ...
    while ((token = parser.get_token()) != NULL)
    {
        ... process token here ...
    }

The Parser will also detect some syntax errors. To keeps things simple, a special Error token type will be added to the token type enumeration. For these error tokens, the column will be set to where the error was detected, and the string will contain the error message.

When the end of the line is reached, a NULL token pointer will be returned. The Parser will allocate the tokens, but it will up to the caller to delete the tokens. There will be a destructor for the Token structure that will delete the String contained within if set. This will eventually be handled by the Encoder once the tokens have been converted to the internal code. The Translator will process the tokens, possibly modifying them as the line is processed (more on this modifying later when the Translator is developed).

Saturday, January 16, 2010

Parser – Design

The Parser code will be a collection of functions and variables needed by the functions. All this could be coded in its own source file and either the variables are global in the source file (static to keep functions in other files from being able to access them) or put into a structure and passed between the functions. This latter approach is better, at least when writing C code.

However, with C++, the best way will be to implement the whole Parser is as a C++ class. This allows all the functions to share the variables (which is essentially a structure) and keeps them private to the functions. All the functions can also be kept private except for the main function that will be called to parse a line. The class definition will be in the main header file (ibcp.h) and the code will be in the parser.cpp source file. Each component in the Interactive BASIC Compiler will be a class, so it's best not to have a separate include file for each one. The other definitions (like the token structure and the Operator Table) will also be put into ibcp.h. The organization of the definitions will be adjusted as the project grows. The List and String classes can remain in their own files for now.

Friday, January 15, 2010

String Class – Testing and Release

The String class is contained in the source files string.h (class definition) and string.cpp (functions). A test_string program was written to test the functions implemented so far. The ibcp_0.0.2-src.zip file has been uploaded at Sourceforge IBCP Project. Also included is a test_nums program written to test the conversion of strings to numeric values (integers and doubles) that will be needed by the Parser for dealing with numeric constants. The .zip file contains all the files so far, however, no make files have been included.

A word about the development environment, which has changed slightly. I was able to get GCC 4.4.0 (a very recent version) using MinGW to work with VIDE2, so this will be used instead of Borland's C++ 5.5 Command Line tools. I also discovered there are some differences between the two in the names of library functions. For instance, Borland uses strcmpi and strncmpi for case insensitive comparison functions, where as GCC (LIBC) uses strcasecmp and strncasecmp. Allowing compilation for both, while not impossible, will be more work.

A note about make files, when switching projects in VIDE2, it must be told to rebuild the makefile since it does not do this automatically. VIDE2 continues to use the current Makefile.v file, no matter which project is loaded. To avoid confusion in the release files, only the VIDE2 project files will be be included. The Makefile.v file will be automatically generated for the first project loaded into VIDE2 if not present. Loading other projects after this requires use of the Project/Rebuild Makefile menu command.

One last note, while VIDE2 is being used to create and build the programs, I really don't like the built-in editor. While it does contain a Vi mode (my favorite programming text editor), VIDE2's Vi mode is just subset of the real Vi. Further, Vi is a subset of Vim (Vi IMproved) that I have used for years on Linux. I got tired of VIDE2 telling that the command I was trying to use was not supported. So instead I am using gVim (available at www.vim.org), which has a Windows version (Vim can also be installed with MSYS). This is the reason there is a “vim:ts=4:sw=4” at the beginning of each source file within a comment, which tells Vim to set the tab stop length at 4 characters (my preference).

Thursday, January 14, 2010

String Class - Implementation

The String class will consist of two variables, a length (an integer, which will support any length strings) and a character pointer to the string. As mentioned, this is not a C/C++ 0 terminated character array, the length value determines the length of the string. A zero length string will have the length set to zero and the character pointer set to null (i.e. zero). There will also be allowance for a String to hold a C string constant (i.e. const char *), where the length will be set to -1.

The functions to implement now are ones that will be needed for the Parser, mainly constructors for creating String instances from a character array including one from two pointers (one pointing to the beginning of the string and one pointing to the character after the last one of the same string), one from a pointer and a length, one from a just a length (just allocates the character array), and one from a C string constant pointer (for holding error strings). There are also constructors that create a null or zero length string (length is zero, pointer is zero) and one from another string instance. There is a destructor that deallocates the character array if there is one (i.e. allocated, for which length is greater than zero). Finally there are access functions for returning the length and the character array pointer.

Wednesday, January 13, 2010

String Class

The Token structure has a string field that will hold a string constant, identifier, numeric constant or comment string. Eventually, at least for a string constant, the string will be need to be used as a BASIC style string.

A BASIC string is specific data type in the language, whereas C/C++ does not have a string type but uses character arrays for strings terminated by a null character (a character with value zero). A BASIC string has a length value associated with it and in BASIC, a 0 character, aka CHR$(0), is allowed in a string, so a character value of zero cannot be used as a terminator.

Therefore, to work with BASIC style string, a string class is needed that will contain a char array pointer (a char array will be allocated for a given length) and a length value. When breaking up an input line, each token will be put into one of these String objects. This will save the extra work of extracting strings from the input line and then adding terminators to them. These Strings will also be used throughout for identifiers, numeric constants and comment strings.

The String class will contain the necessary functions for working with BASIC strings, like concatenation, comparison, object creation, and string functions (like needed for MID$, LEFT$, RIGHT$, etc.).

Saturday, January 9, 2010

Table (Parser) – Code Field

Explanation of the code field requires its own post. There also needs to be an internal code value used within the internal code for each program element. If would be most convenient for this to be the index of the table entry so that execution of the program is as fast as possible. For readability of the C++ source, it would be nice to also have an enumeration defined for each table entry. Take the example:

    enum {CODE1, CODE2, CODE3, NCODES};
    char *table[NCODES] = {"100", "200", "300"};

Then table[CODE1] would be used for table[0], and so on. The problem comes in when new entries are added to the table. To add a new element in the table, say "150" between "100" and "200", the corresponding value must be added to the enumeration, the new value, say CODE1_5, must be added between CODE1 and CODE2. If both entries are not added in the correct positions, the program will not function correctly.

I don't like the idea of this because it makes the program harder to maintain, especially when the table grow very large. There's a number of kluge ways to use the C pre-processor to solve this, but I don't like them either. The best way I've come up with is to have an entry conversion array that is allocated and set up when the program starts running with some checks to make sure everything is defined properly. This takes the burden array from the programmer, especially when the enumeration and table get very large (and they will get large for this project). And it won't affect the execution of the program since it won't be used by the Run Time.

Friday, January 8, 2010

Table (Parser)

One last thing is needed before the Parser can be implemented is the Table, so named because it contains a list of the all operators and the precedence priority used in the conversion to the RPN format. However, it is also a good place to list all the commands and have them in one place, so there is only one table to search through or use during source recreation.

Eventually the Table will have many fields in it, but these will be added as needed. But for now, for the Parser, the fields that are needed are:

    string – the characters of the entry
    type – the type of the entry
    data type – data type of the entry
    code – a code for the entry
    two flag – command can have two words or operator can have two characters flag

Thursday, January 7, 2010

Parser - Tokens

The Parser will return tokens, which will be contained in a structure containing the fields:

    type – the type of token
    column – the column of the input that the token started
    code – the internal code for the token
    string – pointer to the string of the token from the input
    data type – the data type of the token
    value – union of an integer and double value

Wednesday, January 6, 2010

Parsing Remarks

Special Note: I realized as I was thinking about the Parser, that I forgot to consider comments, or in BASIC terminology, “remarks” (i.e. the REM command). I have updated posts Parser and Parser – Token Identification with additional information about remarks. Remark tokens will be their own tokens that will include the string of the comment. There will be two different type of for remarks – the REM command and the single-quote.

Parsing remarks will require special handling. When a single word command is found, if it is REM (REM will be a single word command), then the parser will proceed to read the rest of the line and will return the REM remark token with the string of the comment.  The Translator will check to make sure there is a colon if front of the REM (ignore white-space) if it's not at the beginning of the line.

For the single-quote form of remarks, the single-quote will be treated as an operator symbol character.  When a single character operator is found, if it is a single-quote (which will be marked as a single character operator), then the parser will proceed to read the rest of the line and will return the single-quote operator, however, it will be marked as a remark, with the string of the comment.  The Translator will treat the comment as the end of the line and will make sure the command up to that point is valid if the single quote was not at the beginning of the line.  Single-quotes in the middle of a string constant will not be treated as a comment since the Parser will be collecting characters of the string constant up to the closing double-quote, in other words, the Parser will not be looking for an operator.

Parsing Operators

When neither an identifier or constant was found then the Parser will assume that the token will be an operator. The Operator Table is search for the single character. If not found in the table, then a syntax error is reported. Each table entry will have a two character flag if the operator can also be part of a two character operator (for example, < and > will have this flag set). If the entry does not have this two character flag set, then the internal code for the operator is returned.

For table entries with the two character flag, if the next character is not white space, the Operator Table is searched for the two character operator. If found, then the internal code for the operator is returned.  Otherwise the internal code for the single character operator is returned (assuming the single character is a valid operator, otherwise a syntax error is reported).

Monday, January 4, 2010

Parsing Constants

When a digit or period is found, the parser will look for a numeric constant. Some checks are needed. A leading zero must be followed by a period or the zero is taken alone. Only one period is allowed in a constant; finding a second period will terminate the constant. Finding an “e” or “E” will cause an exponent to be scanned (an optional sign followed by digits).

Once the constant has been scanned, if the constant did not have no decimal point or exponent, then an attempt will be made to convert the string into an integer. If this fails because the value is too large, then it will be converted into a double. If the conversion to double fails (overflow or underflow) then a syntax error will reported.

When a double-quote is found, the parser will look for a string constant. Two double-quotes will be converted into a single double-quote.  The string is terminated upon reaching the ending double-quote.  A syntax error will be reported if the end of the line is reached before the ending double-quote.

Sunday, January 3, 2010

Parsing Identifiers

Once a letter is found, the parser will continue collecting letters, digits and under-bars. When no more are found, the next character is checked for a data type symbol (“%”, “$” or “#”) and if found the data type will be set to an integer, string or double (initially set to none). Finally, the next character is checked for an opening parenthesis, and if found a parenthesis flag is set. If the the first two characters are FN, then the token type is set to Defined Functioned, and the string is returned along with the data type and parenthesis flag.

The Operator Table will then be searched for the identifier. If the identifier is found in the table and the entry is not flagged that the word is part of a two word command, then the internal code for the command, operator or internal function will be returned for the token along data type (internal function only). No string from the input needs to be returned.

If the table entry is flagged that the command is part of a two word command, then a second identifier after white space is read from the input. The second identifier will not have a data type symbol or parenthesis. The Operator Table is searched again for the two word identifier (with one space between). If found then the internal code for the command will be returned. If not found and if the first word by itself is a valid command, then the internal code for the command is returned; otherwise a syntax error is reported at the second word.

If the identifier was not found in the table, then is could be a Variable, Array, Generic Function or Subroutine name. At this point, the Parser has insufficient information to determine which of these types the identifier is. The global or local dictionaries would need to be checked and this is beyond the scope for the Parser (this will be handled by the Translator or Encoder). The Parser will simply return a type of No Parenthesis or Parenthesis.

Parser – Token Identification

There are four major types of tokens, an identifier, a constant (numeric or string), a symbol character or two representing operators, and a remark (comment). Tokens don't contain white space (spaces or tabs), so any white space will normally separate tokens, though a between identifiers or constant an symbols will not require white space. In obtaining a token, any white space will be skipped, then the next character will be checked.

An identifier starts with a letter; contains letters, digits and/or under-bars; and ends with an optional data type symbol and possibly an opening parenthesis. Some operators are also made up of letters, for example, MOD, AND, OR, etc. One wrinkle, some BASIC commands may consist of two words separated with white space, for example SELECT CASE, END IF, EXIT FOR, etc.

A numeric constant starts with a digit or decimal point; contains digits and/or decimal point; and possibly ends with an exponent (“e” or “E”, optional sign and digits). Only one decimal point is allowed and a second one will terminate the token as would a second “e” or “E”. (For example, 0.010.02, would be two constant tokens 0.010 and .02, and this will generate an error downstream in the translator – the Parser just separates the input into tokens.)

A string constant starts with a double-quote; contains any characters; up to the next double-quote. Two double quotes in a row would place a single double-quote in the string constant.

An operator character is any other valid operator symbol (for example, +, -, etc. including comma, semicolon and colon – basically any valid BASIC character that's not part of an identifier or constant). Operators can also contain multiple character operators (for example >=, <=, <>), but also be single character operators (=, <, >) and some single characters operators can be next to each each (for example +-, which is PLUS followed by unary MINUS).

(New) A remark can start with a REM command, but must be followed by white-space and if not the first command on the line, must be preceded by a colon statement separator.  Comments can also start with a single-quote (it appears the ANSI standard allows exclamation point for comments, aka TrueBasic and ECMA-116, but most of the other BASICs I checked use single-quote including G/W-Basic, QuickBasic, VisualBasic, FreeBasic, xBasic, PowerBasic, SmallBasic, and thinBasic; therfore I'm going with the crowd).  When using a single-quote, it does not need to be followed by white-space or preceded by a colon.  All characters following the REM or single-quote up to the end of the line will be taken as the comment.

Updated Wednesday, January 6, 2010; 11.01 pm: Added information about remarks.

Friday, January 1, 2010

Language Definition – Constants

One final item before moving on to the implementation of the Parser. In the Parser, numeric constants are not signed. Any plus or minus sign will be treated as a unary operator as far as token parsing is concerned.  If later during translation, a constant has a unary operator, then it will be folded into the constant. The Parser and/or Translator can't assume that a plus or minus is part of the constant too soon, take the example A‑5, if the constant is ‑5, then this expression would have two tokens, A and ‑5, and would have no operator, which is an error.

Integer constants will be any numeric value that fits in an integer and does not have a decimal point. Double (floating point) constants will have a decimal point, an optional exponent (“e” or “E” followed by an optional sign and some digits), or will not fit in a integer. Integers will be 32‑bit values having a range of ‑2,147,483,648 to 2,147,483,647, however, since there is no sign, the value ‑2,147,483,648 is not possible. Double precision values have a range of approximately 1e‑308 to 1e+308.

String constants start and end with a double quote. There will also be an easy way of putting double quotes in a string constant because using "hello "+CHR$(34)+"world"+CHR$(34) is cumbersome and unintuitive. The C style like "hello \"world\"" (also used by xBasic) is not desirable either. TrueBasic's method is two double quotes like "hello ""world""" is probably the best solution, so that method will be used.

Language Definition – Subroutines

Subroutines are similar to functions except subroutines have no return value and therefore the identifier name does not have a data type symbol. The general syntax for a subroutine is:

    SUB sub_name(arguments)
   ...
    END SUB

The arguments will be optional. Arguments will be called by reference if a variable or array element is used as an argument, otherwise the arguments are by values. The arguments will be treated as local variables, except if called by reference, modifying the local variable was also modify the caller's variable or array element. The by reference call can be prevented by surrounding the argument in the function call by parentheses as in (A). Entire arrays can be passed by reference by using just the array name in the subroutine call.

Subroutines may call themselves recursively. It is up to the programmer to prevent an infinite loop, though eventually memory will run out as each subroutine call will increase the size of the stack.

The subroutine will return upon a RETURN statement or when the END SUB statement is reached. The SUB will not be executed upon reaching the SUB statement, execution will proceed to the after the END SUB statement. The subroutine will be called using a CALL statement as:

    CALL sub_name(arguments)

Updated Saturday January 2, 2010; 9:45am: Looks like most of the BASICs use just "SUB" for subroutine, so I decided to also use just SUB instead of SUBROUTINE. 

Thursday, December 31, 2009

Language Definition – Generic Functions

The second type of functions being supported are generic or multiple line functions.  These functions are defined with a FUNCTION statement and whose identifier can be any standard identifier name with an optional data type symbol.  The general syntax for a function is:

    FUNCTION function_name(arguments)
        ...
        function_name = <return value>
        ...
    END FUNCTION