Interactive BASIC Compiler Project

Saturday, April 10, 2010

Translator – Comma Token Processing

Some comma token processing has already been implemented – commas separating subscripts of arrays or arguments of functions. Commas outside of arrays or functions caused an “unexpected comma” error. This processing needs to be extended to include comma separated multiple assignment statements. Again the current mode will determine how the Translator processes comma operator tokens:

Command: The beginning of a comma separated multiple assignment. The mode will be changed to Comma indicating a comma separated multiple assignment. The comma token is not needed, so it can be deleted. Eventually an assignment operator will be pushed on the hold stack.

Equal: Continuation of a multiple equal assignment statement, so a comma token at this point cause an “unexpected comma in assignment” error.

Comma: Continuation of a comma separated list of a multiple assignment. No further action is needed so the comma token can be deleted.

Expression: Within an expression, so proceeds with the previously implemented comma processing. When the counter stack is empty or the top counter has a zero value, an “unexpected comma in expression” error occurs (changed from “unexpected comma” to differentiate it from the “unexpected comma in in assignment” error).

The comma token has the same low precedence as closing parentheses and assignment operators so most tokens will be removed from the hold stack. Comma tokens are not pushed to the hold stack so they won't be removed. Closing parentheses are also not pushed to the hold stack. And there will never be an assignment operator pushed to the stack before a comma token is processed.

Translator – Equal Token Processing

The Parser returns an equality operator token for an equal. The current mode will determine how the Translator processes equality operator tokens:

Command: The token will be changed to an assignment operator and the mode will be changed to Equal indicating a possible multiple equal assignment and to prevent a comma separated assignment. The assignment operator will be pushed onto the hold stack.

Equal: Continuation of a multiple equal assignment statement. There is already an assignment operator on the hold stack, so this token can be deleted. No further action is needed. During run-time, the single assignment operator will handle multiple variable references.

Comma: The end of a comma separated list of a multiple assignment. The token will be changed to an assignment operator and the mode will be changed to Expression indicating the start of the expression being assigned. The assignment operator will be pushed onto the hold stack.

Expression: Within an expression, equals are equality operators, so the token (already an equality operator) will be processed as a regular binary operator.

The assignment operator will have a very low precedence, which will keep on the hold stack until the end of the statement. Its precedence can be the same as the closing parentheses. The assignment token will be first on the hold stack, so there won't be a closing parentheses token to remove. The closing parentheses only empties the hold stack to an opening parentheses or internal function token (with a lower precedence), so there won't see an assignment token.

Friday, April 9, 2010

Translator – Multiple Assignment Vs. Equality

There are two types of multiple assignment statements being supported – multiple equals and comma separated. Having a two value “in command” flag is sufficient for determining whether an equal is an assignment or equality operator. However, it's not sufficient for knowing which type of multiple assignment is present. For this, two more values are needed, whether in a multiple equal assignment or in a comma separated assignment.

Instead of calling this the “in command” flag, it will be called mode. There will be four modes: in command, in multiple equal assignment, in comma separated assignments, or in expression.
In addition to knowing if the Translator is in command mode or expression mode, the Translator also needs to know if it is in a multiple equal assignment or a comma separated assignment. At the beginning of the line, the mode will be set to command.

For equal tokens, comma tokens and other operator tokens; this mode variable will be used to determine what action to take or what error needs to occur. One of the error conditions to detect is if the two types of multiple assignment statements are mixed, something that is not allowed.

Thursday, April 8, 2010

Translator – Assignment Vs. Equality

The Translator needs to decide when to add an equality operator to the output list and when to add an assignment operator. For now, commands won't be considered since support for commands has not been added to the Translator yet. Consider these assignment statements:

A = X * 5
A = B = C + D
A = B + (C = 5) * D
A = B = C + (D = 5)

There needs to a flag for when an equal should be an assignment operator and when it should be the equality operator. At the beginning of a line, a command is expected, so when there is an equal, it should be interpreted as an assignment operator (which is technically an assignment command). Once the expression starts, any equals should be interpreted as equality operators.

In the second example statement, the first and second equals are assignment operators since the expression to be assigned hasn't started yet. In the third example, the plus begins the expression, so the second equal is an equality operator. In the last example, the plus again begins the expression, so the third equal is an equality operator.

The flag will indicate whether whether the Translator is within a command and will be initialized to on at the beginning of the line. If a non-assignment operator is added to the output list this “in command” flag will be turned off. At each equal, if the flag is on then an assignment operator will be put on the hold stack; and if the flag is off, then an equality operator will be put on the hold stack.

Wednesday, April 7, 2010

Translator – References

When processing operands, the Translator needs to distinguish between the values for and references to variables and arrays. This will be accomplished with a flag in Token to indicate whether it contains a reference or not. When a variable or array token is added to the output list, it's reference flag will be set. When the Encoder processes tokens from the output list to generate the internal code, it will generate a push reference instruction if the reference flag is set and a push value instruction if the reference flag is not set.

In the Translator, when a non-assignment operator is processed and it pops it's operands off of the done stack, it will clear the reference flag in the token of the operand since it needs values and not references. For an assignment operator, the reference flag of only the second operand will be cleared; however, the reference flag of the first operand needs to be checked to make sure that it is set, so that expressions like “5 = A” or “A+B = 4” will cause a “non-reference cannot be assigned” error.

As previously mentioned, the Translator does not know the difference between a variable and function with no arguments or an array and a function with arguments. Therefore, it could be setting the reference flag for a token that refers to a function and not an variable or array. This will not cause a problem however. Only tokens on the left side of an assignment operator will have the reference flag set and there should be no functions on the left side (except in this case of setting the return value of a function with it's own function name – therefore setting the reference flag makes sense, but functions are much later).

Tuesday, April 6, 2010

Assignments and References

Assignment statements will be processed like any expressions with binary operators, however its first operand is handled differently than the second operand (which is the same as operands of other binary operators). Consider the expression with its RPN translation:

A + B A B +

During run-time, the value for the variable A is pushed onto the evaluation stack followed by the value of variable B. When the add operator is executed, it pops the two values off of the stack, adds them, and then pushes the resulting value back onto the stack. Now consider an assignment expression with its RPN translation:

A = B A B =

During run-time, the assignment operator needs to know where to store the value being assigned. So, for the A operand, a reference to the variable value needs to be pushed onto the evaluation stack instead of its value. The assignment operator will not be pushing anything back onto evaluation stack.

The assignment operator will always be the last operator to be processed, so to handle multiple assignments, the multiple references will be pushed onto the evaluation stack. The assignment operator will simply keep popping references and assigning the value until the stack is empty.

Monday, April 5, 2010

Assignment Statements

In BASIC, the “=” character represents both the equality operator and the assignment operator – it all depends on where the “=” is found. The is different from C, which has unique operators for equality (“==”) and assignment (“=”). Consider these examples:

IF A = 5 THEN
A = 5
A = B = 5

The first example contains equality operator and the second example has the assignment operator. In the third example, does the statement contain two assignment operators (like C) or does it contain an assignment operator and an equality operator? In GW-Basic, SmallBASIC, FreeBASIC and QuickBASIC, it is the latter assigning A to a true or false value depending on where B is equal to 5 or not.

Multiple assignments can be convenient so they will be supported like in the third example above. But it can also be convenient to be able to use the equality operator (or any relational operator) in expressions beyond conditional expressions (in IF, WHILE, UNTIL, etc. statements). Two forms of multiple assignments will be supported:

Variable1 = Variable2 = Variable3 = Expression
Variable1 , Variable2 , Variable3 = Expression

For the first form, the “=” characters will interpreted as assignment operators until the expression starts, in other words, once there is a another operator including an expression in parentheses:

Variable1 = Variable2 = (Variable3 = Expression)
Variable1 = Variable2 = Variable3 + Expression1 = Expression2

Here Variable1 and Variable2 will be assigned to a true or false value depending on whether Variable3 is equal to the Expression in the first example of Variable3 + Expression1 is equal to Expression2 in the second example (the third “=” in both examples is the equality operator).

Sunday, April 4, 2010

Translator – Internal Functions (Release)

During testing of the “wrong number of arguments” error, it was noticed that the error was reported on the closing parentheses. While this is acceptable, it would be better if the error pointed to the actual function. However, the caller always pointed to the token that caused the error. Therefore, in order to report an error at any token, the token pointer argument of the add_token() function was changed to a reference, so that it could be changed to any token so that caller will point to this token instead when the error is reported.

The name2 member is being used for the name to output during testing, so for example, for Mid2_Code and Mid3_Code, name2 is set to “MID2$(” and “MID3$(” so that is can easily be seen which code is in the token. This was needed when outputting internal function tokens in addition to operator tokens. So instead of copying these lines of code, a new access function debug_name() was added to get the appropriate name for debug test output.

I also realized there were some memory leaks in the code - the EOL token was not deleted; when an error occurs, the token of the error was not deleted; and when the token is replaced (see above), the original token was not deleted. Therefore, the code was cleaned up to prevent these memory leaks. Nothing was done when the “BUGs” occur since these should not happen once the code is debugged.

To make the code clearer, “Error_” was added to all the error and “BUG_” was added to all the diagnostic error Translator status enumeration names. By the way, each of theses codes is only used in one place so that it is easy to figure where an error occurs (if one “stack empty” was used instead of five unique codes, then if one occurs, there would be confusion which place in the code it came from).

The code now checks the number of arguments for internal functions and ibcp_0.1.7-src.zip has been uploaded at Sourceforge IBCP Project along with the binary for the program. Next handling assignment statements and the concept of references...

Translator – Internal Functions (Implementation)

The number of arguments member needs to be added to the TableEntry structure and values to the table entries along with an access function to this member in the Table class. New entries are needed for the MID$, ASC and INSTR functions with the appropriate number of arguments. A new multiple entry flag is needed to the first entry for each of these functions to indicate that more entries exist with different number of arguments.

The check for the number of arguments for internal functions will be added to the closing parentheses token processing in the array or function section. If the token popped off of the hold stack is an internal function and the number of operands doesn't match the number of arguments in the table, then if the multiple entry flag is set it needs to search for another entry of the function that matches the number of operands. If found, the the token's index will be changed to the new entry's index. Otherwise a “wrong number of arguments” error will occur.

A new table search function is needed that will search for a function with a specific number of arguments. The index of the first entry for the function will be specified assuming that it has already been checked for the number of arguments (this is the index the Parser originally returned). The search will proceed with the entry after the index and continue until the end of the internal functions (when an entry with a NULL name if found). The additional entries must be in the same internal functions section (words with parentheses).

Saturday, April 3, 2010

Translator – Internal Functions

The Translator is unable to check the number of operands for arrays or functions without more information – information that will be contained in the Dictionary, so this checking will be left for the Encoder. However, the number of arguments for internal functions are fixed and can be checked within the Translator.

The number of arguments cannot be checked with knowing the number, therefore a new number of arguments member needs to be added to the Table entries. However, some internal functions may have more than one form. Several of these types or functions are currently planned, which include MID$, ASC and INSTR.

This means that there will need to be multiple Table entries, like for the existing Sub_Code and Neq_Code for the minus operator. For example with MID$, there will be Mid2_Code and Mid3_Code. Initially the Parser will set the code to Mid2_Code since that will be the first entry found in the table.

When the MID$ token is about to be added to the output list, the Translator will have the number of operands that needs to be checked. If the number matches what's in the table, nothing further needs to be done. If the number doesn't match, the Translator either needs to return an error or go looking for another Table entry with the correct number of operands found. Therefore, there needs to be a flag in the Table that indicates that there are more entries. If this flag is set then the Translator needs to look for it.

Friday, April 2, 2010

Translator – Arrays and Functions (Release)

The SimpleStack class is in the new source file stack.h. A simple stack test program was written to test the SimpleStack class, especially the automatic expanding feature of SimpleStack. There is no VIDE2 project file for with program, it was compiled with the command line “g++ test/test_stack2.cpp -o teststack2.exe” via MSYS.

Several test expressions were added for testing arrays and functions including multiple levels of identifiers with parentheses, defined functions and internal functions. At this point there is no difference between arrays and user functions – both are identifiers with parentheses. The last several expressions are ones that were used to test error conditions (these actually caused problems during testing).

A change was made to how the test output is generated specifically related to internal operator codes. Right now there is only one of these, the Negate code. Originally the name string was set to “Neg” to distinguish between “-” (binary operator subtraction) and “-” (unary operator negation) in the test output. However, the name will eventually need to be “-” for the Recreator. There will be many more of these operators. Therefore, name2 will now be used for the test output and name will be for the actual output. For test output, if name2 is not set (NULL) like it is for all operators, name will be used for the test output, otherwise name2 will be used.

The code now handles arrays and functions and ibcp_0.1.6-src.zip has been uploaded at Sourceforge IBCP Project along with the binary for the program (the test stack program is not included). Next the checking for the correct number of arguments for internal functions will be added to the Translator...

Translator – Arrays and Functions (Implementation)

For the new counter stack in the Translator class, the SimpleStack class was implemented. This class is a very simple stripped down version of the List class implementation that uses simple allocated array (expanded as needed) instead of using a double linked list. The template contains arguments for the initial size of the array and the size the array is increased when it gets filled (both arguments default to 10). A key feature of this class is that the top() function can be used to manipulate the item on top of the stack (in this case a comma counter) directly.

When an array or function token is encountered (any token that has a parentheses), it is pushed onto the hold stack. The Translator state is left set to Operand since the next token must be an operand (or optional unary operator). The pending parentheses check needs to be made before the token that has a parentheses is handled.

For comma handling, a 0 is pushed onto the counter stack for an open parentheses token. If a comma token is encountered and the counter stack is either empty or the top of the counter stack is 0, then an “unexpected comma” error occurs. For a token that has a parentheses, a 1 is pushed onto the counter stack. For each comma token encountered, the counter on top of the stack is incremented.

Now when a closing parentheses token is encountered, if the counter stack is empty, then a “missing opening parentheses” occurs. The top counter is popped off the stack. If this counter value is zero, then there was an opening parentheses, so an open parentheses token is expected on top of the hold stack. The pending parentheses token pointer only needs to be set for closing parentheses tokens.

When the value popped from the top of the counter stack is not zero, it contains the number of operands for an array or function, which is expected on top of the hold stack (specifically a token that has a parentheses). For now, the operands are just popped from the done stack. The array or function token is popped from the hold stack, added to the output list and pushed on the done stack.

Other changes needed was the precedence check when emptying the hold stack to work with tokens that don't have table entries (array and function tokens); at the EOL processing, adding a token has parentheses to the missing closing parentheses check; and emptying the counter stack in the clean up routine.

Sunday, March 28, 2010

Translator – Arrays and Functions (Design Issues)

The first issue is that the number of operands counter cannot be a simple variable because operands of arrays and functions can be other arrays and functions, so there needs to be a counter variable for each nested array or function. Therefore a counter stack is needed for multiple nested levels of arrays or functions. Using the existing List class for this simple counter stack would be overkill, so a new simple stack class will be developed with a similar interface (so it would be easy to switch from one to the other for stacks).

The second issue is when an array or function token is pushed on the stack. The Paren and DefFuncP token types do not have a table index and therefore a precedence can't be obtained from a table entry. To resolve this, a new precedence() function is needed in the Token class, which will work the same as the is_operator() and has_paren() functions using a static precedence array. A new precedence() function will be added to the Table class with a Token pointer argument that will use the new Token precedence() function.

The values in the static precedence array within the Token class will be initialized to -1 if the token type has an associated table entry, otherwise the value is set to the desired precedence. For the Command, Operator and IntFuncP token types that have table entries, the token precedence will be set to -1 and for the DefFuncP and Paren token types, the precedence will be set to 2 (the same as the open parentheses precedence). The other token types won't be pushed to the hold stack so no value is needed at the moment.

Saturday, March 27, 2010

Translator – Commas

The commas that separate the subscripts or arguments will be processed as operators. The precedence of the comma operator will be the same as the closing parentheses, which will empty the hold stack up to but not including the array or function token. The comma will be processed as a token requiring special processing, where the comma will simply be counted and the comma token will not be pushed onto the hold stack.

This will require a number of operands counter. This counter is initialized to 0. When a parentheses token is pushed onto the hold stack, this variable is set to 1 indicating there is an array or function token on the stack. Each comma operator processed increments this counter. If no commas tokens are seen, the counter remains set to 1. If the counter is 0 when a comma token is seen, then an “unexpected comma” error will be reported.

So now, upon reaching the closing parentheses, if the counter is 0, then an open parentheses token is expected to be on top of the hold stack as previously implemented. If the counter is 1 or more, then the counter contains the number of operands added to the output list and an array or function argument is expected to be on top of the hold stack. The array or function token is then popped off of the hold stack and added to the output list.

Translator – Arrays and Functions

As mentioned, the Translator will not know the number or subscripts that arrays should have. It will also not know (except for internal functions) the number of arguments that functions should have. Worse, the Translator will not even know if an identifier with a parentheses refers to an array or a function. (The Translator will be able to identify a define function since the Parser has already identified these by the “FN” with their own token type.)

As it turns out, translation wise, arrays and functions can be are handled identically – one has subscripts and the other has arguments. But these can also be thought as operands just like for operators except where operators have one (unary) or two (binary) operands, arrays and functions can have one or more operands. Functions with no arguments (including the IntFuncN and DefFuncN token types) are already being handled as simple operands – the Translator didn't need to distinguish between variables and functions with no arguments.

To translate arrays and functions, a token that has a parentheses will be pushed onto the hold stack. The precedence of these tokens will the same as the open parentheses (between Null and close parentheses for the same reason) to keep them on the hold stack. Upon reaching the matching closing parentheses token, the number of operands will be counted (and validated for an internal function). The array or functions token will then be popped off of the hold stack and added to the output list. But exactly how will the operands be counted...

Translator – Functions

Now that the translation arrays have been defined, it's time to define the translation of functions. Functions (whether Internal, one-line Defined or User) have arguments separated by commas. Take the example MID$(A$+B$,INT(A+0.5),I+5) containing two internal function calls where the tokens will be parsed as:

MID$( A$ + B$ , INT( A + 0.5 ) , I + 5 )

Again there are tokens for commas and closing parentheses. This expression will be translated to RPN as:

A$ B$ + A 0.5 INT( I 5 + MID$(

The functions are found after their associated arguments. Functions will pop their already evaluated arguments off of the run-time stack (which will be in reverse order), perform there operations, and then push their result back onto the run-time for the next part of the expression.

Though this example contains only internal functions, one-line defined functions and user functions work the same way. The big difference is that Translator (using Table) will know the number of arguments that internal functions have and can perform error checking for the number of arguments. However, this can't be done for one-line defined (FN) functions and user functions. This job will be left for the Encoder (which will access the Dictionary for this information).