Interactive BASIC Compiler Project

Sunday, March 27, 2011

Translator – Unary Operator Problem

While testing the negative constant changes, a new problem was discovered with unary operators, specifically this statement:

A = ---B

Which produced a “done stack empty” bug error at the first negation token. The problem occurred because the second negation operator forced the first negation operator from the hold stack because it was greater or equal precedence, and when checking the operand of the first negation operator, there was nothing on the done stack. Here and some additional examples:

A = B*-C
A = B^-C

The first statement translated correctly because negation is higher precedence than multiplication leaving multiplication on the hold stack. However, the second statement failed because negation is lower precedence than exponentiation forcing exponentiation from the hold stack but with only one operand on the done stack generating the done stack empty bug error. A new rule was needed for unary operators.

Basically, unary operators should not force any tokens (unary operators, binary operators, arrays, or functions) from the hold stack regardless of their precedence because not all of their operands have been received yet (the negate and its operand will be their operand and it has not been fully received yet). As currently implemented, other non-unary operators should still force unary operators from the done stack if the unary operator has higher precedence.

The check to force tokens from the hold stack was changed to if the precedence of the operator on the hold stack is higher than the current operator and if the current token is not a unary operator. Unary operators will now not force other tokens from the hold stack, but other tokens will still force unary operators from the hold stack if higher in precedence. While testing this change, a curious result was produced from one of the test statements...

Parser – Negative Constants

Negative constants were previously not considered by the Parser, which interpreted a minus as the subtract operator. The Translator then changed it to a negate operator when it appeared in the operand state. Consider these two examples (along with there current translations):

A = B-1.5 A B 1.5 Sub Assign
A = -1.5+B A 1.5 Neg B Sub Assign

The reason for the Parser to not look for signs on numerical constants can be seen in the first example. If the Parser produced the four tokens A = B -1.5, the Translator would generate an “expected operator” error at the -1.5 token since a second operand token was received when it was expecting a binary operator. The second example produces an unnecessary negate token after the constant. While perfectly valid, this is not desirable.

In order for the Parser to correctly interpret negative signs on numerical constants, it needs to be aware of whether the Translator is in operand state or not. If in operand state, the Parser can look for a negative sign in front of a number constant, otherwise a minus should be interpreted as an operator.

A new operand state flag was added to the Parser with an access function to set its value (which is initialized to off). The Parser get number routine was modified to have a new sign flag used to determine if a negative sign was found first. This flag will also prevent multiple negative signs. However, it will only check for a negative sign if no digits or a decimal was seen and if the new operand state flag is on.

An access function was added to the Translator to get the current operand state (either operand or operand-or-end state). Before calling the Parser get token routine, the Parser operand state is set from the Translator's current operand state. While testing, a problem was discovered with unary operators...

Saturday, March 26, 2011

Parser – Tokens With Parentheses

While correcting issues with define function tokens, it was noticed that it is not necessary to also store the opening parentheses in the string field of the token. This also includes the generic tokens with parentheses. The parentheses is not necessary because there are separate token types to identify tokens with and without parentheses.

It is also advantageous to not store the parentheses so that an array or define function name can be found in the dictionary. For define functions, take the code snippet:

DEF FNHypot(X,Y)
FNHypot=SQR(X*X+Y*Y)
END DEF
Z = FNHypot(3,4)

Both the FNHypot( and FNHypot tokens appear, which represent the same function. If the parentheses was stored in the dictionary for the function name, it would require complicated string comparisons to figure out that the FNHypot token is the same function. This same issue applies to regular function names.

A similar issue also applies to arrays. When functions and subroutines are implemented, there will be a feature to allow an entire array to be passed to a function or subroutine. Exactly how this will work has not been defined yet, but this code snippet shows how it would look:

DIM Array(10)
CALL subroutine(A)

The changes for removing the parentheses from these tokens were very simple. In the Parser get identifier routine, when creating the string for these tokens, the length provided to the string constructor was changed to be one less than the actual length so the parentheses would not be included. In the Translator when reporting an error at the open parentheses of a define function with parentheses token, the minus one was removed since the length is now one less. Finally, in the token test output routines, an open parentheses was added to the output of these tokens.

Project – Tokens Status Enumeration

Each time a new error (token status) is added, renamed or deleted, two changes were needed. Both the token status enumeration (include file) and the message array (source file) needed to be changed. The correct changes were needed or the two files will be out of sync. To check for problems, code had been added to initialization to check for duplicates and missing entries.

Similar to automatic generation of the code enumeration from the table entries, the token status enumeration will also be generated automatically. Each message array element was a structure containing a token status value and a pointer to a message string. The elements were changed to just the message string with the name of the token status in a comment at the end of the line.

The codes awk script was renamed to enums and was modified to also read the token status message array to generate the token status enumeration automatically be reading the name in the comment after the message string. The name of the output file was changed from codes.h to autoenums.h to be more generic and allow for additional automatic enumerations.

During initialization, in addition to checking for duplicates and missing entries, a translation index array was built to translate from token status value to index. Both the checking and the translation array were removed since they are no longer necessary. The awk script will check for duplicates and the token status is now the same as the index into the message array.

The error type template class is no longer needed for token status errors (but still for the table entry erros). Also, the duplicate and missing were no longer needed and were removed. Some problems were found in this template class for table entries where the range error was not working because the wrong constructor was being called. Instead of storing indexes to the errors, the variables were changed to be the type of the template. This made the range error constructor unique.

Friday, March 25, 2011

Translation – INPUT Command (Release)

The remaining problem was due to the input command handler deleting the token passed in (the token terminating the invalid string prompt expression), however, the caller (the call command handler routine) was also deleting the token since the command handler changed the token to point to the error token. The extra token delete was removed from the input command handler.

The INPUT command is fully working and ibcp_0.1.15-src.zip has been uploaded at Sourceforge IBCP Project along with the binary for the program. To support the INPUT command, several other changes were needed including making the token codes and table entries indexes one in the same, correcting print function issues, correcting sub-string assignment issues, handling assignment token mode differently, handling define function tokens correctly, implementing an end statement Translator state, and implementing the reference token mode.

Next up, a slight change to the direction of this project to make it a little more interesting. Instead of prodding along with the translation of more commands, work will begin of the other components including the encoder, dictionary, recreator, program (maintaining the internal program and the program editor), and the run-time module.

The goal is to get this BASIC working as there are (almost) enough commands to make a very simple BASIC program (input, assignments and output). Once this is working, more commands can be implemented. But first a couple of minor things will be implemented before proceeding with this new direction.

Thursday, March 24, 2011

Translation – INPUT Error Debugging

The main problem with the wrong tokens being reported for errors in the INPUT command was because the token with the error was not put into the command item structure passed to the INPUT command – a requirement of command handlers reporting an error. Once this was added, most of the errors were now pointing to the correct token.

A new “expected operator, semicolon or comma” error was added for when an end statement token (for example the EOL in the incomplete statement INPUT PROMPT A$) is received after a valid string expression because this is a little more accurate than just the “expected semicolon or comma” error.

The two remaining errors were “invalid mode” bug errors that were occurring in the end expression error routine that is called when an end statement token is received during operand state. Support for the reference mode needed to be added to this routine, which needed to return the “expected variable” error.

One problem remains. The error test statement INPUT PROMPT A+B*C is causing extra token deletes, which was detected by the memory leak detection mechanism that was implemented a little while ago. To be continued...

Wednesday, March 23, 2011

Translation – INPUT PROMPT Debugging

The first problem found with the INPUT PROMPT command was that reference mode was not being set after the string prompt expression was processed. The next problem was that the Translator was still in binary operator state following the comma or semicolon after the string prompt string expression was processed, so this required setting the state back to operand state, which lead to the discovered of some other minor state issues.

The first operand state is very similar to the operand state except that end expression tokens (like comma, semicolon, and EOL) are also acceptable (normally considered operators). This state was implemented for the PRINT command since these tokens are allowed when an operand is expected (for example the PRINT,,A statement). This state is also set after a command is received, and it is up to the command handler to decide if an immediate end expression is allowed (which it currently is only for the PRINT command).

The equal token handler was found to be incorrectly setting the first operand state in an assignment statement. This did not cause a problem because the end expressions operators were being caught elsewhere when in expressions incorrectly (by their respective token handlers). Anyway, calling it the first operand state was a little confusing and was therefore renamed more appropriately to the operand or end state. The equal token handler was corrected to set only operand state.

The valid INPUT PROMPT test statements are now working, now on to the invalid INPUT statements, which for the most part are not reporting the correct token where an error is detected or just not reporting the correct error including some bug errors...

Monday, March 21, 2011

Translation – INPUT Debugging

Several minor issues were found and corrected. When the end of the INPUT command occurs, the last input parse code has to be marked with the end sub-code. Support for reference mode also needed to be added to comma token handler.

When the EOL token was received by the INPUT command handler, it reused the token for the input assign code for the variable. Upon return from command handler, the EOL token handler proceeded to delete the EOL token (which was not the EOL token anymore). To prevent this, a check was added that if the token no longer contains an EOL code, it is assumed to have been used by the command handler and will not be deleted.

The InputBegin code was just being appended to the output, but the first variable had already been added to the output, so it was after the variable instead of at the start of the statement. This code could be inserted at the beginning of the output list, however, this would only work if the INPUT command was at the beginning of the line, which may not be the case (multiple statements per line will be supported).

So instead, the element pointer in the command item will be set to the current last item in the output list when a command is pushed to the command stack. Since this pointer can no longer be checked for null to determine if an input begin code has been added, a new input begin command flag was added, which is set once an input begin code has been added to the output.

Another problem found was that none of the input variables had their reference flag set once added to the output. The find code routine was not checking for a reference (not important) but was clearing the reference flag (a problem) since the token being checked (the input assign code) did not have its reference flag set. The reference flag of the input assign token was set before calling the process final operand routine to make the find code routine work as desired, and then cleared upon return.

The valid INPUT test statements are now working, now on to the INPUT PROMPT statements, which are not working...

Sunday, March 20, 2011

INPUT Translation – Variable Handling and Ending

To look up the appropriate input assign code, the process final operand routine is used, which calls the find code routine that checks the token on the done stack to see if the reference flag is set since the input assign codes have the reference flag (this should not be necessary since the INPUT command uses the recently implemented reference mode). The reference variable will be popped from the done stack and the input assign code will be appended to the output before returning.

The InputAssignInt and InputAssignStr are associated codes for the InputAssign code. Once the input assign code has been appended to the output, the appropriate input parse code needs to be inserted after the input begin code or last input parse code. The easiest way to get the appropriate input parse code was to have each (InputParse, InputParseInt, and InputParseStr) be associated codes to the input assign codes. The second associated code will be used for these.

When a final semicolon token is received, the stay on line command flag is set (the same flag used for the PRINT command) and the state is set to end statement. When an end statement token is received, the INPUT command handler check if the stay flag is set and sets the keep sub-code of the INPUT command token, which is then appended to the output.

If an end statement token is received with no semicolon, the INPUT command token is immediately appended to the output without the keep sub-code. The INPUT command handler has now been implemented and the code compiles, so debugging can begin...

Saturday, March 19, 2011

Translator – End Statement State

Once a semicolon is received at the end of an INPUT statement, no more tokens should be received except for an end-of-statement token. If another token is received, then an error should be reported. To accomplish this, a new end statement state similar to the end expression state is needed. While the end expression state only needs to be checked when operators are expected, the end statement state needs to be checked for all tokens.

The end statement state check was added just before the check for operand or first operand state in the main translator add token routine. When in end statement state, if the token does not have the end statement flag set in its table entry, then an “expected end-of-statement” error is reported against the token.

The access function for getting the table entry flags for a token already checks if the token has a table entry, and if there is not table entry, then zero (no flags) is returned. Currently only the EOL code has the end statement flag set, but eventually the Colon, ELSE and ENDIF tokens will also and possibly other codes.

Friday, March 18, 2011

Translation – Define Function Token Issues

For now in the process operand routine, define function with parentheses tokens will not be allowed in command or assignment mode. This will need to be changed later when the DEF command is implemented. This check was also made in the check assignment list item routine.

However, since a define function without a parentheses token is allowed in assignments, the error was set to point to the open parentheses as an "expected equal or comma for assignment" error. The open parentheses is at the end of the token, so to get the error to point to it, the column of the token was incremented by the length of the token minus one.

Previously in the close parentheses token handler, the reference flag was being set for token with parentheses and define function with parentheses tokens. This check was modified to only set the reference flag for token with parentheses.

The reference flag for define function without parentheses tokens was already being set, so no change was needed for these tokens.

Thursday, March 17, 2011

Language – Define Functions

Before defining what needs to be done with define function tokens in the Translator, a quick review of their syntax is required. The will be two forms of define functions that will be supported, a single line and a multiple line. An example of a single line define function would be:

DEF FNHypot(X,Y)=SQR(X*X+Y*Y)

Notice that this form has the same format as an assignment except for the DEF command at the beginning. The define function token in this statement is FNHypot( - in other words, a define function with parentheses. This implies that a define function with parentheses token could appear in assignment mode (assuming the DEF command sets this mode), but only for the DEF command. An example of the multiple line form of the same function would be:

DEF FNHypot(X,Y)
FNHypot=SQR(X*X+Y*Y)
END DEF

The assignment of the define function name (a define function without a parentheses) returns the value for the function. Here a define function without a parentheses can appear in an assignment statement, but only inside a DEF/END DEF block. Since the Translator is not aware of blocks, it will permit an assignment of define function without a parentheses token. It will be the Encoder's job to verify if the assignment is valid.

Wednesday, March 16, 2011

Translation – Assignment Token Issues

Finally, some issues were discovered when making the change from command mode to assignment after processing an operand token. There are two main types of operand tokens, ones with parentheses and ones without parentheses.

The operand tokens with parentheses include internal functions, defined user functions (DEF FN) and generic tokens with parentheses (which can be arrays or user functions). Internal functions were already invalid for command or assignment mode except for sub-string functions. A check was added for defined functions with parentheses.

The operand tokens with no parentheses include internal functions with no arguments (currently only RND), constants, define user functions with no arguments, and generic tokens with no parentheses (which can be variables or user functions with no arguments). Internal functions and constants were already invalid because they didn't have the reference flag set when the comma or equal token looked for the reference flag. That left define function tokens...

Tuesday, March 15, 2011

Translation – Command Token Issues

The last problem statements related to unexpected command tokens were:

A PRINT B
MID$(A$ PRINT,4)=""

The first statement gave an “expected operator or end-of-statement” error at the B token. The second statement was actually accepted, but with a strange translation. The problems were caused because when the PRINT command token was received, it was immediately pushed to the command stack because the mode was still set to command.

Both statements start as assignment statements, but assignment mode was not being set (unless preceded by the LET keyword). When an equal token is received expression mode is set, or when a comma token is received assignment list mode is set. The “unexpected command” error was only occurring when the mode was not set to command, which didn't occur with the statements above. Also, this message again does fit with the “expected ...” type of message.

Once a command token is received in command mode, the mode is set according to the token mode in the command's table entry. A change was made to the process operand routine that once an operand token is processed, if in command mode, the operand is assumed the beginning of an assignment statement and so the mode is changed to assignment.

To remove the “unexpected command” error and report a more appropriate error, the command token has to be passed through the rest of the Translator. This will occur when the Translator is not in command mode. So the main add token routine was changed to not report this error if a command token is received and the mode is not command.

Command tokens received in operand state were already being reported correctly since commands are also considered operators, which are not valid operands (unless the operator is a unary operator, which commands are not). Commands are considered operators because some commands can be found where an operator is expected, for example, THEN and ELSE.

Commands tokens received in operator state are only valid if they have a token handler. In the process operator routine, when an operator token does not have a token handler a default operator token handler is called. Before the default operator token handler is called, a check was added to return an appropriate error if the token is a command.

Sunday, March 13, 2011

Translation – Other Errors

After changing the “item cannot be assigned” to “expecting item for assignment” error, there were several other errors that didn't fit the “expecting ...” type of message. It turned out that most of these were not actually being used, so there were removed.

Another remaining message was the “missing open parentheses” error that occurs when there is a parentheses with no open parentheses, function or array. After some consideration of possibly leaving this message as is, it was decided to change this to an “expected operator or end-of-expression” error since the problem could also be a missing function or array, or even that the open parentheses was just a mistake.

Again assuming that everything is correct up to the problem, this change seemed appropriate, and “...expression” was used instead of “...statement” because the next token could be a comma or semicolon in a PRINT statement or a THEN in an IF statement.

The last message was the “unexpected command” error that occurs when there is a command token when not in command mode. However, there were a number of additional problems with command tokens received when not expected...

Translation – Parentheses Issue

The next problem statement was:

MID$((A$),4)=""

This was reported as an “item cannot be assigned” error and then crashed. Again, this error didn't fit the “expected ...” messages. This error is also returned for statements like 3=A and 1,A=B and was renamed to the “expecting item for assignment” error. For the statement above, the “expecting string variable” error should be returned.

The crash occurred because the open parentheses token was returned for the error with its range extended to the closing parentheses to report the entire (A$). The caller deleted the error token since it was an open parentheses to prevent a memory leak. However, in this case, the A$ token was still on top of the done stack with the open parentheses attached as the first operand. When the Translator clean up routine (called upon an error) was emptying the done stack, it deletes each item's first and last operand – the open parentheses was getting deleted twice causing the crash.

Initially to fix this problem, when an error occurs and the first through last operand is returned, the first operand pointer for the item on the done stack was set to null to prevent it from being deleted a second time. While this fix was sufficient for the statement above, this statement still were not being reported correctly:

MID$(-A$,4)=""

The error was “expected numeric expression” pointing to the A$. Both the open parentheses and the minus are initially processed in the process unary operator routine. So a check was added to this routine to return an error if there is a sub-string function on top of the hold stack with its reference flag set (sub-string assignment) and it is as the first operand. Both of theses statements then correctly reported “expected string variable” at the open parentheses and minus. The initial fix was not necessary since the error was being caught sooner.