Interactive BASIC Compiler Project

Wednesday, March 17, 2010

Translator – Developmental Release

Another change is needed in the Token class - the “code” field needs to be changed to “index” because it is really an index of a Table entry and not the code enumeration. It was originally named code because it was going to be the code within the program in memory, which will make the program easy to execute (the index is still going to be used in the program memory). The table entry will have a function pointer to the routine that will execute the command or operator. If the code from the Code enumeration was used, then the run-time module would have to first index through the code-to-index conversion array to get the index of the table and then index to the table entries array.

Changes have be made to the Parser (special EOL token and changing the Token member code to index), Table (added unary_code field and entries for Null, Neg, and EOL codes) and the test code (renamed test_parser.cpp to test_ibcp.cpp because it's going to contain all of the test code; separated code that prints token contents into its own function so it can be called from other function like from the test_translator() function to be written; modified to work with the new EOL token; and was setup to allow for additional test functions where the command line arguments have been changed in that now “-p” activates the Parser test code).

The ibcp_0.1.3-src.zip file has been uploaded at Sourceforge IBCP Project along with the binary for the program. The code essentially does what it did before, but contains the recent changes mentioned. The reason for the release was to follow the spirit of open source code (release early, release often) and I wanted to verify that the changes so far compiled and worked. Also, I discovered the source for Release 0.1.2 was missing the test_parser.cpp source file.

Translator Class – Table/Parser Changes

The new unary operator Neg_Code for Sub_Code needs to be added, which will represent negation, along with it's corresponding table entry. To other two codes also need to be added, Null_Code and EOL_Code.

The Null_Code will be used for two things. One being a very low (lowest) precedence operator that will be pushed onto the operator stack before processing tokens. This will simplify the Translator when it is checking and popping off high precedence operators as it won't need to check if the stack is empty first. The second will be for the new unary_code field that needs to be added to the Table entries to indicate an operator's associated unary operator with Null_Code indicating that there is no unary operator.

The Parser needs to be modified to return a token with an EOL_Code at the end of the line instead of returning a NULL pointer. The reason for this changes is so that this EOL token can be passed into the Translator. The EOL will be treated as an operator and it's precedence will also be very low, but higher than the Null operator. The EOL operator will force the Translator to empty the operator stack of any operators except for the Null operator. Also by using the EOL operator, the Translator can make sure an end of line is allowed. In other words, if the Translator is expecting an operand (as after it has just seen an operator), but sees EOL (an operator), an error will be reported (“operand expected”).

Tuesday, March 16, 2010

Translator Class – Design

There is more that the Translator will handle like parentheses, arrays and functions, commas and semicolons, commands, data type handling and references. Before getting too involved, it's time to layout the Translator class and get basic functionality working with simple expressions and then build the rest upon it. First up is how the Translator class will be used, which will have these member functions:

   start() - initializes the RPN list, stacks and other variables for a new line
   add_token() - adds a token obtained from the Parser
   get_result() - gets the RPN list after the last token has been added

The get_token() function will return an enumeration status code on the result of the adding the token with values of Good (token accepted), Done (last token has been added) and various return error codes. If an error occurs, the caller will use the information in the current token to output the error (like was done with error tokens from the Parser).

The get_result() function will be nothing more than an access function that will return the RPN list.

Sunday, March 14, 2010

Translator – Unary Operators

Up to now, the subject of operators have referred to binary operators or operators with two operands. Unary operators only have only one operand. Only two unary operators are currently planned, - (negate) and NOT. Unary operators are found in expressions preceding an operand. Examples are -A + B, A + -B, and -(A+B). Two unary operators appearing together will be allowed, for example A + --B, but the translator will make no effort to try and optimize this (it's up to the programmer to make an effort to not do stuff like this).

When the current token contains a unary operator, it will be pushed directly onto the operator stack, since there are no operators currently planned that have the highest precedence. With the highest precedence, they will come off the stack whenever a binary operator is processed.

There will be two negate operators, one for integers (NegInt) and one for doubles (NegDbl), so the unary – operator will not need to insert any conversion operator. However, the NOT operator will only work with integers, so if the operand is a double, a hidden double to integer conversion operator (CvtInt) needs to be added to the RPN list before adding the NOT operator.

Translator – Operand Stack

When processing operators, it is necessary check the data types of it's operands already in the RPN list and it may be necessary to insert hidden conversion operators into the RPN list. The location of the first operand may not be the second to the last item added to the RPN list. Take the example A% * B% + C * D; the + operator will be the last operator to be processed, at which point the RPN list will contain A% B% MulInt C D MulDbl. For the + operator, which needs to be the add double operator also needs a CvtDbl inserted after the * operator, which has been changed to an multiple integer operator.

The Translator needs to save the location of where the previous operands are located, for both checking data types and for inserting conversion operators. This will be accomplished with another stack named the operand stack. When an operand is added to the RPN list, it's the pointer to element containing the operand in the RPN list will also be pushed onto the operand stack. When an operator is processed, it will pop it's operands off of the operand stack and use them to check data types and insert necessary conversion operator. After the operator is added to the RPN list, it's element pointer into the RPN list is pushed onto the operand stack for a future operator.

The operand stack parallels the run-time stack except instead of holding results of operands/operators, it will hold pointers to the items the will create the results during run-time.

Looking back at the example, when the + operator is processed, on the operand stack, there will be a pointer to the MulInt operator and a pointer to the MulDbl operator. It will insert a CvtDbl after the MulInt operator, add an AddDbl to the RPN list and push it's element pointer onto the operand stack. The resulting RPN list will contain A% B% AddInt CvtDbl C D MulDbl AddDbl. When this expression is executed during run-time, no time wasting decisions or checks need to be made.

Saturday, March 13, 2010

Translator – Operators

As mentioned, operator tokens are pushed onto an operator stack after pulling higher precedence operators off of the operator stack and adding them to the RPN list. However, there is some handling that needs to be performed before an operator can be added to the RPN list. For each operator, the Translator needs to check the data type of the operator's operand to be sure they are compatible with the operator and/or to add the proper operator dependent on the data types of the operands.

Take the + operator, which can have integer, double or string operands. If both operands are strings, then the concatenation operator needs to be added. For numeric values, it's a bit more involved. If both the operands are integers, then an add integer operator is added. If both the operand are doubles, then an add double operator is added. For an integer and a double, the result will be a double, but the operand that is an integer (which could be the first or the second operand) needs to be converted to a double before performing an add double operator.

The goal is for optimal execution, so having the add double operator check to see if the one of the operands is an integer that needs to be converted first during run time is not ideal. This check should be done before hand. If a conversion is needed, an internal integer to double conversion operator will be inserted into the code. So, for the expression A% + B, the internal code will be A% CvtDbl B +. During run time, the CvtDbl will operate like a function or unary operator, when A% is pushed onto the run-time stack, the CvtDbl will pop it off the stack, convert it to double and push the double value back on the stack.

The Translator will insert these hidden operators into the RPN list after the appropriate operand as needed by the operator. (This is why the List class was modified to have append and insert functions – see post List Class – Updated.) When the internal code is output by the Recreator, these hidden operators will not produce any output.

Table and Token Updates

Since the Translator needs to know if functions have parentheses (arguments) or not, the IntFunc value in the TokenType needs to be replaced with the two types IntFuncN (function with no parentheses) and IntFuncP (function with parentheses). Currently there is one function with no arguments (RND). The table entries for the internal functions were updated.

The Translator needs to determine two things about tokens, whether the token has a parentheses and whether the token is an operator. This will be accomplished with two Boolean member functions in the Token structure, is_operator() and has_paren(). To implement these, Token will have two static arrays paren[] and op[] dimension of the size of the TokenType enumeration. To initialize these arrays, Token will have a static initialize() function, which will initialize the appropriate values in the array. This function needs to be called during initialization.

New source file ibcp.cpp has been added to the project, which will hold the new Token static variables and function. This file also contains the main function for the project. To test these minor changes, the test_parser code was used. The main() function in test_parser.cpp was renamed test_parser() with a call to it from main() that is in the new ibcp.cpp. The test_parser routines were modified to output “Op” if the token is an operator (is_operator() returns true); “()” if the token has parentheses (has_paren() returns true); and “??” if both are true, which should never be the case.

There is a new project file ibcp.vpj, which will be the name of the program for now. When a test_translator routine is written, it will be called from ibcp.cpp like with the test_parser routine. The ibcp_0.1.2-src.zip file has been uploaded at Sourceforge IBCP Project along with the binary for the program, which is now ibcp.exe, but works the same as the previous test_parser.exe except for the minor output changes for the new Token functions (the parserX.txt output files were also updated).

Friday, March 12, 2010

Translator – Operands

Operands are the objects that operators act upon. Basic operands consist of variables, constants and functions with no arguments. More complex operands consist of array elements, functions with arguments, and parenthetical expressions. Technically, the results of operators can be operands to other operators, but this will not be the definition of an operand for the Translator.

Only the basic operands will be added directly to the RPN list. Basic operands are any tokens that are not an operator and do not have a parentheses. I realized that this means that the Internal Function token type needs to be split into Internal Function with and with out parentheses (i.e. IntFuncP and IntFuncN) just like for the one-line define functions (DefFuncN and DefFuncP).

In other words, basic operands are items the don't require arguments. The Translator won't be able to tell the difference between functions with no arguments and variables – both of these will be of the NoParen token type. Therefore, the basic operands are NoParen, Constant, IntFuncN, DefFuncN, ImmCmd (will be the only token on a line), and Remark.

Thursday, March 11, 2010

Translator

The job of the Translator is to convert the order of the tokens parsed from an input line into reverse polish notation (RPN), which will make for optimal execution of the program. See the post Internal Language for an explanation.

Essentially, the Translator will obtain tokens from the Parser expecting either an operand or an operator token type. Operands will be added directly to the RPN list, and operators will be added on an operator stack. Before putting an operator token on the operator stack, any operators on the operator stack that have higher precedence than the current operator are pulled from the stack and added to the RPN list. After the higher precedence operators are pulled from the stack, the current operator token is pushed on the stack. At the end, all operators from the operator stack are pulled and added to the RPN list.

Continued... »

Parser Test Revisited

I was thinking that it would be nice if the test_parser program had a mode where it would ask for input lines from the user so that additional items could be tested without having to recompile the program. Therefore a “-i” option was added to test_parser. This will tell the test_parser program to ask for input lines instead of running the canned tests. Once entered, the line will be parsed into tokens. The program will continue asking for input lines and will end when a blank line is entered (control+C also works). The skip_whitespace() function in the Parser class was modified to accept tab characters in addition to space characters.

The standard library gets() function is used for line input, which supports backspace, and simple editing including moving the cursor with the arrow keys, insert mode can be toggled with the insert key, and characters can be deleted with the delete key. None of these characters end up in the input buffer. The ibcp_0.1.1-src.zip file has been uploaded at Sourceforge IBCP Project along with the binary for the test program. Now, really moving on to the Translator...

Wednesday, March 10, 2010

Parser Class – Operators (Testing/Release)

Additional tests lines were added into the fifth test input group for testing string constants. Everything was testing good until the one error condition was tested (invalid two character operator). This error was for the case where the character is not valid as a single character operator and the two characters at the position is not found in the table.

The first problem found was that the operators <, <= and <> had their multiple field set to OneChar instead of TwoChar. Then I realized that there were no two character operators where the first character isn't also a valid single character operator. I thought I could add one to test, for example &&, but then realized there would also have to be an & single character operator in the table with a token type set to Error for the code to work right. This would be silly, so the get_operator() code needed to be rewritten.

One rewritten, the “invalid two character” operator error no longer exists. First the current character is checked for a valid one character operator by searching the table. If valid then the token is setup for the operator. If the character can only be a one character operator, the token is returned. The single-quote remark operator is checked for before returning.

If the character can also be a two character operator, the table is searched again for the two characters. If not found then if the first character was a valid operator, the token previously setup is returned. Otherwise false is returned indicating there is not a valid operator at the current position. (In other words, let the next parser routine handle it, which for now there is none, so the get_token() returns an “unrecognized character” error.) If the two characters are a valid operator, the token is setup for the operator and returned.

The temporary && operator was added to the table to test the two character operator where the first character is not a valid operator. Once tested, this test entry was removed. The ibcp_0.1.0-src.zip file has been uploaded at Sourceforge IBCP Project. The binary for the test program has also been uploaded. The Parser has been tested and is now complete. Next up is the Translator, now the fun begins...

Monday, March 8, 2010

Parser Class – Strings (Testing/Release)

Additional tests lines were added into he fourth test input group for testing string constants. There's not much to strings, but the one issue found was that the check in get_string() for a no closing quote needed to be removed because the scan_string() was modified during the testing of the immediate commands to accept a no closing quote condition. The ibcp_0.0.8-src.zip file has been uploaded at Sourceforge IBCP Project. The binary for the test program has also been uploaded.

Sunday, March 7, 2010

Parser Class – Numbers (Testing/Release)

Additional tests lines were added into the third test input group for testing numerical constants. Numbers were also added to test each of the errors. After seeing how the errors were identified, I thought it would be nice if the entire number was identified for errors like “constant is out of range” for example:

A = 3.45e343
^^^^^^^^-- constant is out of range

This required a new length value be added to the Token structure. Two new set_error() functions were implemented that also take a length argument. The existing two set_error() functions were modified to set length to 1. The length value was added to the anonymous union holding the integer and double values since none of these will be used at the same time. Having this error length value will be very useful when the GUI is implemented to the whole constant can be highlighted in red instead of just the first character. All the existing callers of set_error() were checked to see if a length would be help, but only a few in the get_immediate() for immediate commands was changed.

One error message could not be verified. Towards the end of get_number() there is a check to see if the current pointer (which will be pointing to the first character not part of the constant) matches the end pointer value returned from the strtol() or strtod() standard library conversion functions (which will point to first character not part of the conversion). This check was added to make sure the result was expected. I couldn't think of case where this would actually happen, so if it does, it would be due to a bug in the get_number() function, therefore, “BUG:” was added to the error message.

The ibcp_0.0.7-src.zip file has been uploaded at Sourceforge IBCP Project. The parser1.txt was updated for the change in the error output. The binary for the test program has also been uploaded.

Saturday, March 6, 2010

Parser Class – Identifiers (Testing/Release)

The test_parser program was modified to break the tests into groups, otherwise one long stream of output would be generated – there's no reason to see all the Immediate Commands test output while testing Identifiers. Therefore, a single digit argument was added to the test_parser program where 1 is for the Immediate Command tests, 2 is for the Identifier tests, an so on. If anything in the core Parser code is changed because of bugs found, the older tests can be rerun and compared to the previous output as a regression test.

The first problem discovered was that it is was necessary to have distinct Token Types for Defined Functions with parentheses and without. Therefore DefFunc was replaced with DefFuncP and DefFuncN. The second issue discovered was that the internal string functions (CHR$, LEFT$, etc.) had the wrong data type of None in the table, so these were changed to the correct String data type.

Two things that could not be tested are search for words with a data type (there are none of these currently in the table) and the error if found the first word of a two word command and the second word of the command is not valid (currently all two commands are valid as a single word command, e.g. END, DO, LOOP, etc.).

The ibcp_0.0.6-src.zip file has been uploaded at Sourceforge IBCP Project. The binary for the test program has also been uploaded.

Monday, March 1, 2010

Parser Class – Immediate Commands (Testing/Release)

The first major problem discovered was that Immediate Commands can either have line number arguments (all argument types including Blank but not String) or a string argument. There needed to be a way to determine if the token string is pointing to the immediate commands argument structure or to an actual string. To solve this problem, the new data type CmdArgs was added. The immediate command's token's data type will either be set to CmdArgs or String.

The next problem discovered was that the scan_command() function could not return Null_Flag and have the token type set to an Error because it may have been set to Error from a previous call since it was not initialized or reset in scan_command() - this confused the get_command() function, which incorrectly returned errors when there wasn't an error. There needed to be a way to distinguish between invalid immediate commands that could be a valid BASIC command (like L500=10 or L500,L1000=8) and commands that are not a valid BASIC command either (like A100,-5 or L100-200=4). To solve this, the Error_Flag = -1 constant definition was added, which is returned if the immediate command has an error.

Many different test commands were added to the test input array, sufficient to test all of the different forms of command arguments, plus many errors like commands with arguments forms that is not acceptable for some commands. At the end of the array, there are many invalid commands to test every single error exit point in the scan_command() function.

The ibcp_0.0.5-src.zip file has been uploaded at Sourceforge IBCP Project. The binary for the test program has also been uploaded. Note that all the previous test programs and binaries are not included in this release in order to reduce the size of the file – these files can still be obtained from the 0.0.4 release.

Sunday, February 28, 2010

Parser Class – Debugging

I've come to realize that debugging and testing the Parser code is not going to be a trivial undertaking. The Parser class routines have been implemented along with simple test_parser program to test the Parser class code. This program has an array of test input strings that will be feed to the Parser one at time. For each test input, the test input line is output and the start() function is called. It enters a loop calling get_token() until a NULL token pointer is returned. An error token returned also terminates the loop.

Inside the token loop, it simply outputs the contents of the token. At the end of the loop, it deletes the token, since for now the goal is testing if the proper token contents are returned. If the token contains an error, the error message is output with a indicator to where in the line the error is (using the column field in the token). The token is then deleted and the loop terminated.

Otherwise, the contents of the token will be output. There is a switch statement on the token type. There are arrays of strings for token type names, data type names and code names. The value in the token is used to index into these arrays so a name can be output (it would be a pain to look up numbers for each to see if the code is working correctly). Each token is output on a separate line indented under the input line.

I decided not to wait until the whole Parser is debugged and working before releasing code, since the goal of Open Source is to throw out code as soon as possible. Therefore, the code will be released in stages once each of the major Parser routines is debugged, tested and working. The first stage will be the Immediate Commands code (and of course the test_parser code). As for version numbering, the 0.0.x series will be continued through the debugging of the Parser. Once all of the Parser is working, the version number will be upped 0.1.0. The same convention will be used as the Translator is implemented, debugged and tested; which will continue through versions 0.1.x, until working with the release of version 0.2.0. An so on (the first entirely working Interactive BASIC Compiler will be 1.0.0).