Saturday, January 30, 2010

New List Class – Release

The test_stack program was modified to include a new test_list() function that tests all the new list functions. The existing stack test functions test to make sure all the existing stack functions are still working correctly. The print_stack functions were renamed to print_list along with the arguments since these functions are really printing the underlying list. (Besides, it didn't make sense calling print_stack() from the test_list() function.) The ibcp_0.0.3-src.zip file has been uploaded at Sourceforge IBCP Project. The binaries for the test programs have now also been uploaded so compilation is not necessary.

One last minor fix was needed to the List class in the destructor. Upon scanning through the ANSI-ISO_C++.pdf document, I learned there is a distinction between using new to allocate a single item and new[] to allocate an array. You must use the corresponding delete and delete[] to deallocate the memory. For the List class, "new Element" is used throughout so there are no issues with that. However, the master element is allocated using new[]:

    master = (Element *)new char[sizeof(Element) – sizeof(T)];

This is used because the master element only contains previous and next element pointers, but no value. The statement above subtracts the size of the value (template generic type T) from the size of the element. Since delete[] needs to be used with new[], in the destructor, the delete master statement was changed to:

    delete[] (char *)master;

This gives delete[] the character pointer originally obtained from the new char[...] statement before is was type cast to an Element pointer.

Now it is time to get back to the Parser implementation.

New List Class – Implementation

A problem occurred during the update of the list class. It was desired that the append, insert and remove functions to not be inline functions. Defining them in the class automatically makes them inline. Therefore, they were defined outside the main List class. They are still in the list.h include file, but because they are template functions, no code is generated until a class is instantiated and the function is used.

The problem occurred for the functions that return an element pointer. The List class contains the public Element structure (next and previous element pointers and the generic value of type T). The Element structure needs to be defined within the List class since it contains the type T of the template. Inside the class definition, the Element structure can be simply referred to as “Element” or “Element *” for pointers. Outside the function, the syntax “List<type>::Element” and “List<type>::Element *” is needed where “type” is the instantiated type. For defining template functions “List<T>::Element” and “List<T>::Element *” are used.

Therefore, to define a template function that returns and Element pointer and takes an Element pointer as an argument would appear that it should be written as:

    template <class T> List<T>::Element *List<T>::fun(List<T>::Element *element)

Unfortunately, the compiler gave the error: error: expected constructor, destructor, or type conversion before '*' token. The problem was with the return value since the argument syntax was in the List class previously causing no issues.

List Class – Updated

The List class needs to be updated to include more list like functions. The first implementation was really oriented towards the support of stack like functions (i.e. push and pop). For a list, elements need to be added to any position in the list, not just the end. This will be needed for the Translator as the RPN list is built. The following operations are therefore needed:
  • Add a value to a new element after an element (append)
  • Add a value to a new element before an element (insert)
  • Get a value and remove it's element (remove)
Several other changes will be made to the list class.

Friday, January 29, 2010

Parser – Implementation Notes

The get_token() function always allocates a new Token and sets the column to the current position (by subtracting the pointer to beginning of the line, i.e. input). Therefore a constructor was added that accepts an integer value (the column) and also initializes the string pointer to NULL. A destructor was also added to delete the string if it is not NULL.

The get_identifier() function calls scan_word to get the first word and then checks to see if the string begins with “FN” (including lower case) for a one-line user function. No reason to search the table first. Otherwise, the table is then searched. If not found than the identifier is returned including whether there was a opening parentheses and the data type if any.

If the word was found in the table and is only a single word, the command, internal function or word operator is returned. I decided to change the Boolean twoflag in the table to an Enumeration named Multiple so that three word and three character operators could be supported. For now only two word commands and two character operators are supported.

If the command can be two words, white space is skipped and scan_word() is called again to get the second word. If there is a data type or parentheses, then it is not a valid second word of a command. If the first word is valid as a single word, then it is returned as the token (the second word is held for the next token). Otherwise, the table is searched again for the two words. If not found in the table, then if the first word is valid as a single word, then it is returned as the token, otherwise an error token is returned. If found, the two word command is returned as the token. Support for three word commands was not implemented at this time.

The implementation of skip_whitespace(), scan_word(), and get_number() was straight forward. The implementation of get_operator() was similar to get_identifier() except only single characters are involved (no scanning necessary). Support for three character operators will not be implemented at this time.

I decided that searching the table should not be part of the Parser class, but should be part of another class for the Table. By the way, the Operator Table will now be known as simply the Table, since it holds more than just operators.

Friday, January 22, 2010

Parser – Support Functions

There will a few support functions that will be used by the main functions:

    skip_whitespace() - skips white space at the current position
    scan_word()  - scan for a word of an identifier
    scan_string()  - scan for string constant counting or copying the string
    search_table()  - searches Operator Table for one or two words

The skip_whitespace() function will be called before the main parsing functions are called. It will also be used by get_identifier() before scanning for the second word if the command of the identifier is part of a two word command.

The scan_word() function will be used by get_identifier() to actually check and get the identifier. This functionality is in it's own function so that it can be used twice if the identifier is a command that is part of a two word command.

The scan_string() function will be used by get_string() to first count the size of the string so that the memory  can be allocated and then called again to copy the string. The reason for the double scan is the possibility of the presence two double-quotes in the string, which are only counted and copied as one character (so a simple copy after the string length is counted won't work).

The search_table() function will be used by get_identifier() and get_operator() to search the Operator Table. There will be two string arguments for searching two word commands. The second string will be optional, passed as NULL when searching for one word or operator.

Tuesday, January 19, 2010

Parser – Main Functions

The Parser code will be broken into four main functions, one for each of the major token types:

    get_identifier()  - checks for and gets an identifier
    get_number()  - checks for and gets an constant integer or double number
    get_string()      - checks for and gets a string constant
    get_operator()    - checks for and gets an operator (symbol characters)

Each of these functions will take no arguments and return a Boolean flag if they found a token of the type they were looking. If they don't find their token type at the current position, they will return false, so the next main token function can be called. If all return false, then a syntax error will be returned.

When one of these functions find their token type, they will continue scanning the input line for end of the token. They will fill the previous allocated token before returning true. If they detect an error, they will set the token to the error token type with an appropriate error message.  They will leave the current position set to the first character that they did not process, ready for the next token to be parsed (after skipping white space).

Monday, January 18, 2010

Parser – Variables

The Parser only requires only a few variables:

    char *input  – pointer to input line being processed
    char *pos  – pointer to the current position within the input line
    Token *token  – pointer to the token that will be filled and returned

The start() function will set the input and pos variables to the input line it is given. Each call to get_token() will start by allocating a new token to which it's pointer is put into the token variable. Remember that even errors returned will require a token. Next Before beginning to see what token type is at the current position, any white space is skipped.

Sunday, January 17, 2010

Calling The Parser

First, how the Parser will be used needs to be defined. Since it is a class, it needs to be instanced:

    Parser parser;

The constructor doesn't need to do anything. The input line will be inputted from the user, initially from a prompt like BASIC interpreters, and later from a screen editor. In either case it will be a standard C zero-terminated character array. This array needs to be initially fed to the Parser, something like:

    char *input;
    ...
    parser.start(input);

The start function will initialize internal variables in preparation for parsing the line into tokens. Tokens are then obtained from the parser:

    Token *token;
    ...
    while ((token = parser.get_token()) != NULL)
    {
        ... process token here ...
    }

The Parser will also detect some syntax errors. To keeps things simple, a special Error token type will be added to the token type enumeration. For these error tokens, the column will be set to where the error was detected, and the string will contain the error message.

When the end of the line is reached, a NULL token pointer will be returned. The Parser will allocate the tokens, but it will up to the caller to delete the tokens. There will be a destructor for the Token structure that will delete the String contained within if set. This will eventually be handled by the Encoder once the tokens have been converted to the internal code. The Translator will process the tokens, possibly modifying them as the line is processed (more on this modifying later when the Translator is developed).

Saturday, January 16, 2010

Parser – Design

The Parser code will be a collection of functions and variables needed by the functions. All this could be coded in its own source file and either the variables are global in the source file (static to keep functions in other files from being able to access them) or put into a structure and passed between the functions. This latter approach is better, at least when writing C code.

However, with C++, the best way will be to implement the whole Parser is as a C++ class. This allows all the functions to share the variables (which is essentially a structure) and keeps them private to the functions. All the functions can also be kept private except for the main function that will be called to parse a line. The class definition will be in the main header file (ibcp.h) and the code will be in the parser.cpp source file. Each component in the Interactive BASIC Compiler will be a class, so it's best not to have a separate include file for each one. The other definitions (like the token structure and the Operator Table) will also be put into ibcp.h. The organization of the definitions will be adjusted as the project grows. The List and String classes can remain in their own files for now.

Friday, January 15, 2010

String Class – Testing and Release

The String class is contained in the source files string.h (class definition) and string.cpp (functions). A test_string program was written to test the functions implemented so far. The ibcp_0.0.2-src.zip file has been uploaded at Sourceforge IBCP Project. Also included is a test_nums program written to test the conversion of strings to numeric values (integers and doubles) that will be needed by the Parser for dealing with numeric constants. The .zip file contains all the files so far, however, no make files have been included.

A word about the development environment, which has changed slightly. I was able to get GCC 4.4.0 (a very recent version) using MinGW to work with VIDE2, so this will be used instead of Borland's C++ 5.5 Command Line tools. I also discovered there are some differences between the two in the names of library functions. For instance, Borland uses strcmpi and strncmpi for case insensitive comparison functions, where as GCC (LIBC) uses strcasecmp and strncasecmp. Allowing compilation for both, while not impossible, will be more work.

A note about make files, when switching projects in VIDE2, it must be told to rebuild the makefile since it does not do this automatically. VIDE2 continues to use the current Makefile.v file, no matter which project is loaded. To avoid confusion in the release files, only the VIDE2 project files will be be included. The Makefile.v file will be automatically generated for the first project loaded into VIDE2 if not present. Loading other projects after this requires use of the Project/Rebuild Makefile menu command.

One last note, while VIDE2 is being used to create and build the programs, I really don't like the built-in editor. While it does contain a Vi mode (my favorite programming text editor), VIDE2's Vi mode is just subset of the real Vi. Further, Vi is a subset of Vim (Vi IMproved) that I have used for years on Linux. I got tired of VIDE2 telling that the command I was trying to use was not supported. So instead I am using gVim (available at www.vim.org), which has a Windows version (Vim can also be installed with MSYS). This is the reason there is a “vim:ts=4:sw=4” at the beginning of each source file within a comment, which tells Vim to set the tab stop length at 4 characters (my preference).

Thursday, January 14, 2010

String Class - Implementation

The String class will consist of two variables, a length (an integer, which will support any length strings) and a character pointer to the string. As mentioned, this is not a C/C++ 0 terminated character array, the length value determines the length of the string. A zero length string will have the length set to zero and the character pointer set to null (i.e. zero). There will also be allowance for a String to hold a C string constant (i.e. const char *), where the length will be set to -1.

The functions to implement now are ones that will be needed for the Parser, mainly constructors for creating String instances from a character array including one from two pointers (one pointing to the beginning of the string and one pointing to the character after the last one of the same string), one from a pointer and a length, one from a just a length (just allocates the character array), and one from a C string constant pointer (for holding error strings). There are also constructors that create a null or zero length string (length is zero, pointer is zero) and one from another string instance. There is a destructor that deallocates the character array if there is one (i.e. allocated, for which length is greater than zero). Finally there are access functions for returning the length and the character array pointer.

Wednesday, January 13, 2010

String Class

The Token structure has a string field that will hold a string constant, identifier, numeric constant or comment string. Eventually, at least for a string constant, the string will be need to be used as a BASIC style string.

A BASIC string is specific data type in the language, whereas C/C++ does not have a string type but uses character arrays for strings terminated by a null character (a character with value zero). A BASIC string has a length value associated with it and in BASIC, a 0 character, aka CHR$(0), is allowed in a string, so a character value of zero cannot be used as a terminator.

Therefore, to work with BASIC style string, a string class is needed that will contain a char array pointer (a char array will be allocated for a given length) and a length value. When breaking up an input line, each token will be put into one of these String objects. This will save the extra work of extracting strings from the input line and then adding terminators to them. These Strings will also be used throughout for identifiers, numeric constants and comment strings.

The String class will contain the necessary functions for working with BASIC strings, like concatenation, comparison, object creation, and string functions (like needed for MID$, LEFT$, RIGHT$, etc.).

Saturday, January 9, 2010

Table (Parser) – Code Field

Explanation of the code field requires its own post. There also needs to be an internal code value used within the internal code for each program element. If would be most convenient for this to be the index of the table entry so that execution of the program is as fast as possible. For readability of the C++ source, it would be nice to also have an enumeration defined for each table entry. Take the example:

    enum {CODE1, CODE2, CODE3, NCODES};
    char *table[NCODES] = {"100", "200", "300"};

Then table[CODE1] would be used for table[0], and so on. The problem comes in when new entries are added to the table. To add a new element in the table, say "150" between "100" and "200", the corresponding value must be added to the enumeration, the new value, say CODE1_5, must be added between CODE1 and CODE2. If both entries are not added in the correct positions, the program will not function correctly.

I don't like the idea of this because it makes the program harder to maintain, especially when the table grow very large. There's a number of kluge ways to use the C pre-processor to solve this, but I don't like them either. The best way I've come up with is to have an entry conversion array that is allocated and set up when the program starts running with some checks to make sure everything is defined properly. This takes the burden array from the programmer, especially when the enumeration and table get very large (and they will get large for this project). And it won't affect the execution of the program since it won't be used by the Run Time.

Friday, January 8, 2010

Table (Parser)

One last thing is needed before the Parser can be implemented is the Table, so named because it contains a list of the all operators and the precedence priority used in the conversion to the RPN format. However, it is also a good place to list all the commands and have them in one place, so there is only one table to search through or use during source recreation.

Eventually the Table will have many fields in it, but these will be added as needed. But for now, for the Parser, the fields that are needed are:

    string – the characters of the entry
    type – the type of the entry
    data type – data type of the entry
    code – a code for the entry
    two flag – command can have two words or operator can have two characters flag

Thursday, January 7, 2010

Parser - Tokens

The Parser will return tokens, which will be contained in a structure containing the fields:

    type – the type of token
    column – the column of the input that the token started
    code – the internal code for the token
    string – pointer to the string of the token from the input
    data type – the data type of the token
    value – union of an integer and double value

Wednesday, January 6, 2010

Parsing Remarks

Special Note: I realized as I was thinking about the Parser, that I forgot to consider comments, or in BASIC terminology, “remarks” (i.e. the REM command). I have updated posts Parser and Parser – Token Identification with additional information about remarks. Remark tokens will be their own tokens that will include the string of the comment. There will be two different type of for remarks – the REM command and the single-quote.

Parsing remarks will require special handling. When a single word command is found, if it is REM (REM will be a single word command), then the parser will proceed to read the rest of the line and will return the REM remark token with the string of the comment.  The Translator will check to make sure there is a colon if front of the REM (ignore white-space) if it's not at the beginning of the line.

For the single-quote form of remarks, the single-quote will be treated as an operator symbol character.  When a single character operator is found, if it is a single-quote (which will be marked as a single character operator), then the parser will proceed to read the rest of the line and will return the single-quote operator, however, it will be marked as a remark, with the string of the comment.  The Translator will treat the comment as the end of the line and will make sure the command up to that point is valid if the single quote was not at the beginning of the line.  Single-quotes in the middle of a string constant will not be treated as a comment since the Parser will be collecting characters of the string constant up to the closing double-quote, in other words, the Parser will not be looking for an operator.

Parsing Operators

When neither an identifier or constant was found then the Parser will assume that the token will be an operator. The Operator Table is search for the single character. If not found in the table, then a syntax error is reported. Each table entry will have a two character flag if the operator can also be part of a two character operator (for example, < and > will have this flag set). If the entry does not have this two character flag set, then the internal code for the operator is returned.

For table entries with the two character flag, if the next character is not white space, the Operator Table is searched for the two character operator. If found, then the internal code for the operator is returned.  Otherwise the internal code for the single character operator is returned (assuming the single character is a valid operator, otherwise a syntax error is reported).

Monday, January 4, 2010

Parsing Constants

When a digit or period is found, the parser will look for a numeric constant. Some checks are needed. A leading zero must be followed by a period or the zero is taken alone. Only one period is allowed in a constant; finding a second period will terminate the constant. Finding an “e” or “E” will cause an exponent to be scanned (an optional sign followed by digits).

Once the constant has been scanned, if the constant did not have no decimal point or exponent, then an attempt will be made to convert the string into an integer. If this fails because the value is too large, then it will be converted into a double. If the conversion to double fails (overflow or underflow) then a syntax error will reported.

When a double-quote is found, the parser will look for a string constant. Two double-quotes will be converted into a single double-quote.  The string is terminated upon reaching the ending double-quote.  A syntax error will be reported if the end of the line is reached before the ending double-quote.

Sunday, January 3, 2010

Parsing Identifiers

Once a letter is found, the parser will continue collecting letters, digits and under-bars. When no more are found, the next character is checked for a data type symbol (“%”, “$” or “#”) and if found the data type will be set to an integer, string or double (initially set to none). Finally, the next character is checked for an opening parenthesis, and if found a parenthesis flag is set. If the the first two characters are FN, then the token type is set to Defined Functioned, and the string is returned along with the data type and parenthesis flag.

The Operator Table will then be searched for the identifier. If the identifier is found in the table and the entry is not flagged that the word is part of a two word command, then the internal code for the command, operator or internal function will be returned for the token along data type (internal function only). No string from the input needs to be returned.

If the table entry is flagged that the command is part of a two word command, then a second identifier after white space is read from the input. The second identifier will not have a data type symbol or parenthesis. The Operator Table is searched again for the two word identifier (with one space between). If found then the internal code for the command will be returned. If not found and if the first word by itself is a valid command, then the internal code for the command is returned; otherwise a syntax error is reported at the second word.

If the identifier was not found in the table, then is could be a Variable, Array, Generic Function or Subroutine name. At this point, the Parser has insufficient information to determine which of these types the identifier is. The global or local dictionaries would need to be checked and this is beyond the scope for the Parser (this will be handled by the Translator or Encoder). The Parser will simply return a type of No Parenthesis or Parenthesis.

Parser – Token Identification

There are four major types of tokens, an identifier, a constant (numeric or string), a symbol character or two representing operators, and a remark (comment). Tokens don't contain white space (spaces or tabs), so any white space will normally separate tokens, though a between identifiers or constant an symbols will not require white space. In obtaining a token, any white space will be skipped, then the next character will be checked.

An identifier starts with a letter; contains letters, digits and/or under-bars; and ends with an optional data type symbol and possibly an opening parenthesis. Some operators are also made up of letters, for example, MOD, AND, OR, etc. One wrinkle, some BASIC commands may consist of two words separated with white space, for example SELECT CASE, END IF, EXIT FOR, etc.

A numeric constant starts with a digit or decimal point; contains digits and/or decimal point; and possibly ends with an exponent (“e” or “E”, optional sign and digits). Only one decimal point is allowed and a second one will terminate the token as would a second “e” or “E”. (For example, 0.010.02, would be two constant tokens 0.010 and .02, and this will generate an error downstream in the translator – the Parser just separates the input into tokens.)

A string constant starts with a double-quote; contains any characters; up to the next double-quote. Two double quotes in a row would place a single double-quote in the string constant.

An operator character is any other valid operator symbol (for example, +, -, etc. including comma, semicolon and colon – basically any valid BASIC character that's not part of an identifier or constant). Operators can also contain multiple character operators (for example >=, <=, <>), but also be single character operators (=, <, >) and some single characters operators can be next to each each (for example +-, which is PLUS followed by unary MINUS).

(New) A remark can start with a REM command, but must be followed by white-space and if not the first command on the line, must be preceded by a colon statement separator.  Comments can also start with a single-quote (it appears the ANSI standard allows exclamation point for comments, aka TrueBasic and ECMA-116, but most of the other BASICs I checked use single-quote including G/W-Basic, QuickBasic, VisualBasic, FreeBasic, xBasic, PowerBasic, SmallBasic, and thinBasic; therfore I'm going with the crowd).  When using a single-quote, it does not need to be followed by white-space or preceded by a colon.  All characters following the REM or single-quote up to the end of the line will be taken as the comment.

Updated Wednesday, January 6, 2010; 11.01 pm: Added information about remarks.

Friday, January 1, 2010

Language Definition – Constants

One final item before moving on to the implementation of the Parser. In the Parser, numeric constants are not signed. Any plus or minus sign will be treated as a unary operator as far as token parsing is concerned.  If later during translation, a constant has a unary operator, then it will be folded into the constant. The Parser and/or Translator can't assume that a plus or minus is part of the constant too soon, take the example A‑5, if the constant is ‑5, then this expression would have two tokens, A and ‑5, and would have no operator, which is an error.

Integer constants will be any numeric value that fits in an integer and does not have a decimal point. Double (floating point) constants will have a decimal point, an optional exponent (“e” or “E” followed by an optional sign and some digits), or will not fit in a integer. Integers will be 32‑bit values having a range of ‑2,147,483,648 to 2,147,483,647, however, since there is no sign, the value ‑2,147,483,648 is not possible. Double precision values have a range of approximately 1e‑308 to 1e+308.

String constants start and end with a double quote. There will also be an easy way of putting double quotes in a string constant because using "hello "+CHR$(34)+"world"+CHR$(34) is cumbersome and unintuitive. The C style like "hello \"world\"" (also used by xBasic) is not desirable either. TrueBasic's method is two double quotes like "hello ""world""" is probably the best solution, so that method will be used.

Language Definition – Subroutines

Subroutines are similar to functions except subroutines have no return value and therefore the identifier name does not have a data type symbol. The general syntax for a subroutine is:

    SUB sub_name(arguments)
   ...
    END SUB

The arguments will be optional. Arguments will be called by reference if a variable or array element is used as an argument, otherwise the arguments are by values. The arguments will be treated as local variables, except if called by reference, modifying the local variable was also modify the caller's variable or array element. The by reference call can be prevented by surrounding the argument in the function call by parentheses as in (A). Entire arrays can be passed by reference by using just the array name in the subroutine call.

Subroutines may call themselves recursively. It is up to the programmer to prevent an infinite loop, though eventually memory will run out as each subroutine call will increase the size of the stack.

The subroutine will return upon a RETURN statement or when the END SUB statement is reached. The SUB will not be executed upon reaching the SUB statement, execution will proceed to the after the END SUB statement. The subroutine will be called using a CALL statement as:

    CALL sub_name(arguments)

Updated Saturday January 2, 2010; 9:45am: Looks like most of the BASICs use just "SUB" for subroutine, so I decided to also use just SUB instead of SUBROUTINE.