Interactive BASIC Compiler Project

Friday, August 16, 2013

New INPUT Command Design

The design of the INPUT command will be modified slightly. The current design is produces a rather complex translation, which also makes execution complicated. Refer to previous posts for Input Commands particularly February 24, 2011 showing an example translation and February 26, 2011 describing some of the complexity during execution. Unfortunately, simplifying the design much is not possible because of the requirement to handle a statement like INPUT I%,A(I%) where the value of I% needs to be assigned to an inputted value before the reference to A(I%) can be generated since the value of I% needs to be set beforehand. The design can be improved upon for run-time (the goal is for fast execution) especially when dealing with errors entered during input. Consider the previous and new proposed translations for this example statement:

InputBegin InputParseInt InputParseDbl'End' I%<ref> InputAssignInt I% A(<ref> InputAssignDbl Input
InputParseDbl InputParseInt InputBegin I%<ref> InputAssignInt I% A(<ref> InputAssignDbl Input

The proposed translation is nearly identical except InputBegin now appears after all the InputParseType codes, which are now reversed, and an 'End' sub-code is no longer needed. At run-time, each InputParseType pushes a data type to an input stack. InputBegin starts by issuing the prompt, gets the input, and processes the input stack without popping the items. Each item on the input stack contains a data type and an input value, which is filled in as each value is successfully parsed. If an error is detected, the error is displayed and the character at which the error occurred is highlighted. At which point, the user is given the opportunity to edit the input and try again.

Upon successful input (all values parsed and saved on the input stack), execution proceeds to the next code. Each reference gets pushed to the evaluation stack and each InputAssignType pops the reference, pops a value from the input stack, and assigns the value to the reference.

There is not much left for Input to perform, but it does check for the 'Keep' sub-code (semicolon at the end of the statement to keep the cursor on the same line), and if not set, moves the cursor to the beginning of the next line.

For the INPUT PROMPT command, the beginning of the translation is varied slightly to allow for the prompt string expression. The beginning of the line contains the tokens for the string expression (before the first InputParseType). There is an InputBeginStr instead of InputBegin, which could have an optional 'Question' sub-code depending if the prompt expression was followed by a semicolon (no question) or comma (question). At run-time, this code will pop the result of the string expression from the evaluation stack, and proceeds the same as InputBegin except with the specified prompt.

Thursday, August 15, 2013

New Translator – PRINT Statements (Tagged)

A minor parser issue was discovered with how the length of double word command tokens were being set, which was noticed with the "not yet implemented" errors not pointing at the entire INPUT PROMPT command. The token length was only being set to the length of the second word plus one for the space. This was corrected by setting the token length to the current position (first character after the second word) minus the starting column of the first word. This also takes into account if more than one space is entered between the two words.

The PRINT statements implementation in the new translator is now complete and version v0.4.3 has been tagged. All tests pass with the old translator routines (after some corrections were made). All translator tests pass with the new translator routines with the exception of the INPUT and REM statements as these have not been implemented yet. Some minor cleanup was done with the latest commit along with updating the files for v0.4.3 (see the commit log for details). Implementation of INPUT statements can now commence in the new translator.

[commit bd608f5cf3] [commit 0e316f71ce]

Wednesday, August 14, 2013

New Translator – PRINT Statements

The implementation of the PRINT translate routine using the new translator routines was mostly straight forward, but some time was spent getting the right errors reported at the correct tokens (the previous post dealt with some of these issues in the translator routines). Three local variables were needed to hold the pointer to the last semicolon token (last one is used instead of the command token if at the end of the statement), a separator flag (indicates when a semicolon is allowed) and a print function flag (only used for error reporting). The routine then enters a loop until the end of the statement.

The loop begins by calling the get expression routine for a None data type to allow for print functions or an empty expression. If a non numeric parser error is returned, then the appropriate error is returned. This and other errors cause the loop exit. If the done stack is not empty and the data type of the top item is None, then the print function flag is set to true. Otherwise a new print code token is created and the process final operand routine is called to get the correct associated code for the top item and the print function flag is cleared. The top item (either a print function or the new print code token) is dropped from the done stack (both have already been added to the RPN output list). The separator flag is set, the last semicolon token is deleted and the pointer is set to null.

If the next token (the terminating token from the get expression routine) is a comma, then if the last semicolon token pointer is set, then a comma is not allowed and the appropriate error is set and the loop exits. Otherwise, the comma token is appended to the RPN output list. The last semicolon token is deleted and the pointer is set to null. The separator flag is cleared (no semicolon is allowed). The loop continues with the next expression.

If the next token is a semicolon, then if the separator flag is not set, then a second semicolon is not allowed and the appropriate error is set depending if the last semicolon token pointer is set and the loop exits. The last semicolon token is deleted and the pointer is set to the current semicolon token. The loop continues with the next expression.

For any other terminating token, the loop exits. Otherwise, the separator flag is cleared for the next expression and the loop continues. If the loop exited with an error, the last semicolon token and command token pointers are deleted and the error is returned. If the last semicolon token pointer is set, the command token is deleted and is set to the last semicolon token. The command token is appended to the RPN output list.

If the terminating token is not an end-of-statement token, then the appropriate error is returned depending on the print function flag. Otherwise a done status is returned. The PRINT translate function was implemented in the new basic/print.cpp source file, which was added the CMake build file, and was added to the table for the PRINT command.

[commit 3fc94d4b18]

Tuesday, August 13, 2013

Parser Error And None Data Type Handling

The PRINT statement translation implementation is almost complete, but a few more issues were discovered with how parser errors and the None data type are handled. Support for the None data type was added for the print functions (TAB and SPC), but also to support being able to call the get expression routine and have it allow for no expression, just a terminating token. This is needed for PRINT statements that have commas with no expressions in between them (for instance, to skip over two columns).

The get token routine was modified to accept a new No data type value, which will now be the default value. This data type will only be used to call the get token routine when a non-operand token is requested. This was necessary since the None type is now being used for operands.

The get operand routine was modified for command and operator tokens, which are normally invalid for an operand and an "expected XX expression or variable" error is returned, now returns a Done status if the data type passed is None. This allows for no expressions to be returned and the command or operator token becomes the terminating token from the get expression routine (it is up to the PRINT translate routine to determine the validity of the token).

The get expression routine was modified to carry the None data type into the loop in the expected data type variable. This allows the None data type to be passed to the get operand routine upon the first call (to allow for print functions or no expression) so the secondary operand data type variable is no longer needed. After the first operator, the expected data type variable is set accordingly for the operator.

However, for parentheses expressions, the data type needs to be changed to Any before recursively calling the get expression routine since neither a print function nor no expression is valid inside the parentheses. Also, if a parser error is returned, the status is only set to the "expected operator or closing parentheses" error if the token does not have a data type, because a numeric parser error needs to be reported as such.

Finally, the scheme to leave an internal function of parentheses token on the hold stack when an error is detected so that the caller can check if the top of the hold stack is not the null token was flawed. Without going into details, it turned out this scheme was not necessary anyway and was removed. Quite a few more parser error tests were added to translator test #14 for testing additional error cases.

[commit 86292a56c0]

Sunday, August 11, 2013

Print Function Support

The PRINT statement consists of expressions separated by semicolons or commas. Commas can also be used with no expressions between them. The expressions may also contain the print functions TAB and SPC. These functions are distinguished from other functions in that they have no return data type. The PRINT translate function could specifically handle these print functions since they are only use in a PRINT statement.

However, the get expression routine also needs to be able handle these print functions in at least to report errors when they are used incorrectly in expressions. It currently does this simply by the fact that the None data type is not acceptable by any operator or data type request (an expression with a data type of None is not currently requested).

Instead of having the PRINT translate routine handle the print functions, the get expression routine was modified to accept the None data type, which affected two things. This allows print functions to be returned. It also allows no expression to be returned (just the next token as the terminating token), which is needed in the PRINT statement for commas. This will be detected by the done stack being empty. The PRINT translate routine will be the only caller with the None data type.

The new translator was changed back to only reporting the TAB( or SPC( token when used incorrectly in expressions. This was necessary due to parser and other errors inside the parentheses of the print function as the error was reported at the wrong place. Consider the expression:

A = TAB(B$)

An "expected numeric expression" error was reported at the B$ token, but should point to the TAB( token as it doesn't belong in the expression in the first place. Several more statements were added to translator test #14 that included various parser errors in print functions used incorrectly. See the commit log for more details of the changes made.

[commit 9502d5b183]

Saturday, August 10, 2013

Old Translator Issues With Token Caching

While debugging the new PRINT translate routine, I realized that the old translator routines were never tested with new token caching and error reporting code, because when they were, crashes and errors occurred. Two small problems were found and were corrected.

The first issue occurred when an error is detected and the error token was not the current token obtained from the parser (meaning the error token had already been added to the output and cannot be deleted). The old translator routine obtained a new token and copied the error token into it before calling the set error token routine, which called the output list set error routine and then deleted the token. When the token was copied into the new token obtained, the index of the original token was copied over the new token's index confusing the used token handling. This was corrected by calling the output list set error routine directly (which only grabs the column and length of the error token) so no temporary token is needed.

The second issue occurred when there was a REMARK operator (') at the beginning of a line. The Null token put on the bottom of the hold stack was not removed. This was detected as a token leak and the detection code reported and deleted the token. Unfortunately, the translator still had the token on the hold stack. This did not cause a problem until the next command with an error was processed. When an error occurs, the the hold stack is empty and its tokens are deleted. When it got to the deleted Null token, the index was garbage (because the memory for the token had been reused) and a segmentation fault occurred.

The REMARK operator token handler was modified to pop and delete the Null token from the hold stack if currently in command mode. If not in command mode, then another part of the translator correctly handled the the Null token. Curiously for this situation, the EOL token at the end of the line is never obtained and processed. Therefore, this fix is a hack, but since the old translator routines will be abandoned, there is no reason for a correct fix.

One last minor issue was that when token leaks or extra deletes were reported, if the token was a Null or End-of-line token, a blank token was output with just an index. The "NULL" and "EOL" strings were added to the debug name entry in the table to identify these blank tokens.

[commit cf93288fcb]

Thursday, August 8, 2013

New PRINT Implementation (Revised)

Once implementation of the new PRINT translation was started, I realized that the design at run-time was more complicated than it needed to be. Most of the original design appeared to be best after all, with a few minor changes. When it comes to run-time, the original design would be more efficient and would not need the extra complexity of a command stack. With print codes immediately after each print item, the print code would simply pop the value from the evaluation stack and print it.

The changes to the original design are how semicolons and the final PRINT command code are handled. Semicolon sub-codes were added to print functions codes except when they appeared at end of a statement. Multiple semicolons were allowed, marked with semicolon sub-codes or extra semicolon tokens with more than one extra semicolons. A PRINT command code would only be added to the end of the statement if there was no print token (semicolon, comma, TAB or SPC) at the end.

Semicolon sub-codes will no longer be needed. Semicolons will be assumed between each print item unless there is a comma between them. Multiple unnecessary semicolons will no long be permitted as these add nothing. If there is a semicolon at the end of the line, then a semicolon code will be added to the end of the line instead of a PRINT code. At run-time, the semicolon will no nothing, but when the line is recreated, the semicolon will produce the PRINT keyword in additional to the semicolon at the end of the line.

Wednesday, August 7, 2013

New PRINT At Run-Time (Retracted)

After considering this design, it was decided to be more complicated than it needed to be, so this post has been retracted, see revised post.

Consider the previous example statement and its new translation:

PRINT A,B$;TAB(20);C%
A B$ 20 C% PrintDbl , PrintStr TAB( PrintInt PRINT

At run-time, in additional to the evaluation stack used to hold values of operands and results, there will also be a command stack (not to be confused with the current command stack used by the old translator routines, but not needed by the new translator routines). The special print codes (specific print type codes, comma, and print functions), will only push themselves to the command stack.

When the PRINT (or semicolon) code is executed, it will process the command stack, except it will start at the bottom of the stack instead of popping items from the top. This is possible because the QStack class is derived from the QVector class, which allows its elements to be access by index. The values of the evaluation stack will also be accessed the same way.

When the PRINT (or semicolon) begins, it will start with indexes (one for each stack) set to zero and will process each item on the command stack. For the specific print type code, it will access the value at the evaluation stack index, output it, and increment the index. For a comma code, it will output the appropriate number of spaces (leaving the evaluation stack index as is). For a print function (TAB or SPC), it will access the value at the evaluation stack index, output the appropriate number of spaces, and increment the index.

At the end (top) of the command stack, it will clear both stacks, return the indexes to zero (for the next command), and for the PRINT code only, will advance to the next line. This scheme works because both the evaluation and command stacks will be empty at the beginning of each command.

New PRINT Implementation (Retracted)

After considering this design, it was decided to be more complicated than it needed to be, so this post has been retracted, see revised post.

The run-time design of the PRINT statement was reconsidered. The old scheme inserted specific type print codes into the translation with an optional PRINT code at the end depending if the line ended with a semicolon, comma or print function (TAB or SPC). Consider this example statement and its old translation:

PRINT A,B$;TAB(20);C%
A PrintDbl , B$ PrintStr 20 TAB(';' C% PrintInt PRINT

The specific type print codes pop a value from the evaluation stack and outputs it. The comma token outputs spaces to advance the cursor to the next column (traditionally groups of 8 characters). The TAB print function outputs spaces to advance to specified column. And the final PRINT advances to the next line. If the line ended with a semicolon, comma or print function, the final PRINT code is not present. The new translation will look like this (details to follow):

A B$ 20 C% PrintDbl , PrintStr TAB( PrintInt PRINT

The final PRINT code is always present, and at run time will know when to advance to the next line (depends on the code immediately preceding it). Semicolon tokens (or a sub-code) is not necessary and is implied between the items to print. Multiple semicolons, though allowed in the old translator (the semicolon sub-code and any semicolon tokens would do nothing at run-time), will not be permitted as there is no reason to allow them (unlike multiple commas).

A semicolon at the end of the line needs special consideration. If there is a semicolon at the end of the line, it will take the place of the PRINT code. At run-time, both perform the same actions except only PRINT advances to the next line. A semicolon can be used this way since no other statement needs a semicolon token. A 'Keep' sub-code or a new PrintKeep code could have been used, but why have a sub-code check at run-time or create a new code when a semicolon token can be used.

The next post will contain the details of how this new PRINT translation will work at run-time...

Tuesday, August 6, 2013

Extra Token Delete Detection

The other type of token allocation error that can occur is when a token that has already been deleted is deleted again. This is a less common problem, but can still could occur none the less. Extra token delete detection was also previously implemented, but removed. The original implementation contained a list of tokens deleted extra times. A copy of the token was appended to the list.

This time around, the text of the token (by using the text access function) along with its index (as part of the string) is saved in a list of strings of deleted tokens. When it comes to to report these errors, each string in the list is output. In the reimplemented delete operation function, before the token is marked a unused in the used token vector (its pointer set to null), if the token is already marked as unused, its index and text is appended to the deleted list instead and the token is not pushed onto the free token stack (it is already there).

A new private DeletedList class was implemented inside the Token class. The destructor is called automatically upon termination of the application and any extra token deletes are reported. This class also contains a report errors function that does the work of outputting the deleted list to the standard error output stream and clears the list afterward. The destructor calls this function.

Again, a separate function was implemented so that it can be called at the end of each line translated in test mode and extra token deletes for that specific line are reported with the line. For non-test (GUI) mode, any extra token deletes are reported when the application terminates.

Token (Memory) Leak Detection

While implementing the token caching, I thought it would be extremely helpful to put back some token leak detection, a common time consuming problem seen during debugging of the new translator routines. Leak detection was previously implemented, but removed during the Qt transition. According to the post on October 28, 2012, this was due to obscure compiler errors. Since valgrind was being used for memory leak detection by then, this code was removed. The log for this commit did not provide any additional clues to the issues other than to say Qt does not interface well to self implemented new and delete operators (?).

The original token leak detection consisted of a static list of used tokens and each token contained a pointer into this list. This was easy to implemented with how the original List class was designed, but the QList class had no easy equivalent (using an iterator did not work). I suspect this partly caused the problem, which may also have had something to do with the way members are initialized by the constructor when a new instance is created and having a Qt member complicated things. Anyway, a new easier method of keep track of used tokens was implemented this time around that did not cause any problems.

A plain integer index member was added to the Token class. Every token allocated gets a unique index number, which the token will have through the life of the application. The index values assigned start at zero and each additional token allocated gets the next index value. This index is used to index into a QVector of token pointers.

When a new token is allocated, the index is set to the size of this used token vector, and then the token is appended to the vector. In other words, the index points to its corresponding element in this vector. When a token pointer is popped from the free token stack, the element corresponding to the token is set to the token pointer. When a token is deleted, the element corresponding to the token is set to a null pointer to indicate the token is not currently used (it is in the free token stack).

A token pointer was used for this vector instead of a simple boolean flag (for indicating a token was used), so that when it came time to see if there are any tokens used that have not been freed (the element in the vector contains a non-null token pointer), the token pointer can be used to print information about the token (type, code, string, etc.).

A new private UsedVector class was implemented inside the Token class, similar to the FreeStack private class. The destructor is called automatically upon termination of the application and any token leaks are reported and the tokens deleted so that valgrind does not also report memory leaks. This class also contains a report errors function that does the work of scanning the vector and reporting any token leak errors found to the standard error output stream and deletes the tokens. The destructor calls this function.

A separate function was implemented so that it can be called at the end of each line translated in test mode and token leaks for that specific line are reported with the line. This does not interfere with the test scripts since the errors are output to the standard error stream, which is not captured in the output that gets compared with the expected results files. Once a test is seen to contain errors, it can be rerun from the command line to the standard output to see which test statements caused the errors. Having the errors output with the test lines saves debugging time in having to identify which test statement caused the error.

For non-test (GUI) mode, any token leaks are reported when the application terminates, though if not run from the command line, these errors will not be seen. This should not be an issue because any token leaks should have been eliminated by this time.

Monday, August 5, 2013

Token Caching (With Error Checking)

The PRINT translation will involve many token deletes for unneeded tokens (like semicolons) but also many new tokens (to hold the specific data type print codes). More on the redesign PRINT translation in an upcoming post, but it is desirable to reuse tokens as much as possible instead of repeatedly using the free memory heap, which can get fragmented with many allocations and deletions requiring time consuming garbage collection. Other commands (like INPUT) will also require this reuse capability.

Instead of some type of token saving code in each command translation routine as needed, a global method of caching tokens was implemented. This was accomplished by reimplementing the new and delete operators of the Token class. For the delete operator, instead of simply freeing the memory of the token, the pointer to the token is saved on a free token stack. For the new operator, if the free token stack is not empty, a pointer to a token is popped from this stack, otherwise a new token is allocated using the global new operator.

The free token stack was implemented as a private class based on the QStack class containing a reimplemented destructor function, which deletes all the tokens on the free stack. The destructor is called when the free stack goes out of scope, which occurs when the application terminates.

During the implementation and debugging of LET statements, one area that was time consuming was tracking down memory leak errors, mostly related to tokens. (Other leaks were easy to identify and correct.) First the test statement causing the problem had to be identified (time consuming), then the debugger had to be run to identify what was in the token that was not deleted so that the problem could be corrected (also time consuming). Missed through all of this was the token leak detection that was removed a while ago due to issues with the Qt libraries during the Qt transition.

A new token leak detection scheme was implemented along with detection of tokens that are deleted more than once. More details of these schemes in following posts. Any token leaks and extra token deletes are reported after each translator test line is translated. This will save the time to identify which statement caused errors. The text of the token is output for each error along with its index (will be detailed in the next post). Any token errors are also reported at application termination for non-test mode.

[commit 5fa1780cda]

Thursday, August 1, 2013

New Translator – LET Statements (Tagged)

The implementation of LET statements including multiple assignments and sub-string assignments in the new translator is now complete and version v0.4.2 has been tagged. All tests pass with the old translator routines (after some corrections were made). All expressions tests and all assignment statements in the translator tests pass in the new translator routines. Some minor cleanup was preformed with the latest commit along with updating the files for v0.4.2 (see the commit log for details). Implementation of PRINT statements can now commence in the new translator.

[commit a5d3434fbe]

Wednesday, July 31, 2013

Memory Testing / Minor Memory Leak

Since all of the tests are now working with the new translator (excluding the commands not yet implemented), it seemed appropriate to change the memtestn script to run all of the tests. After changing this script, two memory leaks were discovered in translator tests #7 (Errors) and #9 (Semicolon Errors). The memory leak determined to be occurring with sub-string assignment statements that contained an error.

The memory leak occurred because an RPN item was allocated for the sub-string assignment token, which is not appended immediately to the RPN output list. The RPN item is left on the done stack, which the LET translate routine pops and pushes it's token to the LET stack and then deletes the RPN item. However, if the next token that should be a comma or equal token is not or a parser error occurred, then this does not occur. The error clean up code assumes that all RPN items on the done stack have been added to the RPN output list, so only the items in the output list are deleted.

This problem was corrected by slightly rearranging the code in the LET translate routine where if there is an error with the comma or equal token, and the top of the done stack contains a sub-string assignment token (that has not been added to the RPN output list), then the done item on top of the done stack is popped and deleted, which deletes the RPN item and its token(s). With this change, all of the tests with the new translator have no memory errors.

[commit 229af22a78]

Parser And Unary Operator Errors

While looking at the results of all the translator tests to make sure at least the assignment tests were working (the PRINT, INPUT, REM statements currently report "not yet implemented" errors), it was noticed that many of the parser errors in translator test #14 (Parser Errors) were not reporting some errors correctly.

While working on this issue, another problem was found with unary operators when they occur when a binary operator was expected. The error should include the word "binary" in front of word "operator" to indicate that the unary operator was not expected, but a binary operator was expected. This is to avoid the confusion that a unary operator is an operator.

See the commit log for details of the changes, but basically the get expression routine needs to return a parser error and the caller needs to determine the appropriate error. The caller also needs to check if the token causing the error is a unary operator and use the appropriate error with the additional "binary" word. The get token routine was modified to which errors are reporting by also checking the expected data type, specifically if the current data type is string, then a number parser error should not be returned.

Previously, the only error with the word "binary" was the "expected binary operator or end-of-statement" error, which is not appropriate for a number of cases like when a comma or closing parentheses is expected, not an end-of-statement (for instance, inside parentheses of a parenthetical expression, an internal function or a parentheses token). Therefore, several new errors were added.

Because of the unary operator issues, a number of new test statements with unary operator errors was added to translator test #14. Many of these don't pass with the old translator routines. All of these pass with the new translator routines (excluding those with commands not yet implemented).

[commit 87e4072ba7]

Sunday, July 28, 2013

New Translator – Data Type Checking

Up until now, the data type passed to the new get expression and get operand routines was only used to the return the appropriate error when a problem was detected. The intention was that this data type be used to check that the resulting expression before returning to the caller of the get expression routine. This simplifies the design in that the callers don't have to check if the expression is the correct type - the checking is done in one place. Also part of this, any hidden conversion codes are added (to convert from integer to double or double to integer).

The get operand routine cannot check the data type of the operand obtained against the data type passed. Consider the valid expression A%+(B$=C$). After the open parentheses, the expected data type for the second operand of the add operator would be a number. An error can't be reported against the B$ operand. This checking will be handled when the operator is processed. However, when the caller of the get operand routine requests a reference, the data type of the reference operand can be checked.

These routines were modified to check the data type. The get expression required a new level argument, which is incremented for each level of parentheses recursively calls the get expression routine. A simple flag could have been used instead of a level value, but having an actual level value could be useful for debugging. Only at the end of the expression at the first level can the data type be checked. For the Any data type, no checking is done, and for the Number data type, the data type can either be Double or Integer (no conversion code is added). Otherwise, hidden conversion codes are added to the RPN output list as needed or an error is reported.

Callers of these routines no longer need to check the resulting expression or reference operand. Previously for internal functions and tokens with parentheses, the find code routine was called for each argument (internal function at each comma) and the process final operand routine was called for the final argument or sub-script (at the closing parentheses). These routines no long need to be called (but are still used for processing operators).

All of the translator tests that only contain assignment statements (though three have a single PRINT statement) still pass successfully. One statement in test #8 contained an array assignment with a string sub-script. The old translator did not catch this error (capability was not implemented). The new translator without these latest data type checking also did not even though a numeric expression was requested, the data type was not actually checked. The expected results for test #8 were updated for this (and the old results saved). For more details of all the changes needed, see the commit log.

[commit b4a908b60d]