Interactive BASIC Compiler Project: July 2010

Saturday, July 31, 2010

Translator – New Find Code (Multiple Flag)

The next issue for the new find code implementation is with internal functions that have different number of arguments (MID$, INSTR and ASC). Currently the number of arguments is checked at the closing parentheses before the call to find code to check the data types of the arguments. Before calling find code, if the number of arguments was not correct, but the Multiple Flag was set, then a search for another code was made where the number of arguments did match.

For the new implementation, the data types of the internal function arguments will be checked as each argument is processed – in other words, at the comma token (or close parentheses token for the last argument). The checking for an alternate form of the internal function will now take place in the comma token handler.

Previously to support checking if a comma was valid in an internal function call (as opposed to a closing parentheses), the internal function table entry with the most number of arguments had to first in the table because otherwise an error would occur since a closing parentheses would be expected if the table entry with the smaller number of arguments was listed first. The number of arguments would be checked, and the code changed if necessary, in the closing parentheses token handler.

Now with the comma token handler taking care of changing the code when needed, the smaller number of arguments table entry needs to be first. If a comma is received where a parentheses is expected (at the last argument), a check will be made and if the Multiple Flag is set then it will move to the next table entry with the next number of argument (otherwise an “expected closing parentheses” error occurs).

This method implies that the table entries for the number of arguments must be in order by number of arguments. If there are three forms of an internal function (there are currently none), then the Multiple Flag should be in each table entry except the last. To insure the table entries are placed in the table correctly, a check will be added to the table initialization along with the other current table checks.

Translator – New Find Code (Reference Flag)

here has been a struggle with a number of issues that have turned up with the new find code implementation, including reference checking, internal functions with multiple forms (different number of arguments), and multiple assignments (specifically multiple equal assignments, which may need to be removed from the language).

The current find code routine checks if the first argument is a reference if the table entry for the token has the Reference Flag. This was used for the assignment operators when called from the set assign command routine, which is called from the comma and equal token handlers (assignment operators have the Reference Flag). If this would be the only location that needs to check for a reference flag, the checking could be moved from the find code routine to the set assign command routine.

However, there is another situation that needs to check for a reference flag - the first argument to a sub-string assignment. This is currently handled by an additional check in the find code routine. With the new find code, this will occur at the first comma for the sub-string function (from the comma token handler). So the reference check should still be done in the new find code routine.

The reference flag can't be added to the sub-string table entries since not all instances of the sub-string functions are assignments. The token being checked (internal function, operator, assignment code or print code) is passed to the find code routine. A simply method would be to set the reference flag in this token argument if it's operand needs to be a reference. The find code would check for a reference if the token's reference flag is set.

Therefore, in the set assign command, the reference flag of the assignment token will be set. Also, when internal functions are first pushed to the hold stack, a check is made to see if the mode is currently one of the assignment modes and only a sub-string function is permitted. At this time, the reference flag of the sub-string function can be set.

There will be no problem leaving the reference flag set in the internal function or operator token since nothing downstream will be checking for a reference flag on these type of tokens.

Saturday, July 24, 2010

Translator – Find Code (New Design)

The current find code routine pops all of the operands for an operator or internal function from the done stack, checks the data type of each changing the code to an associated code as needed, and inserts hidden conversion codes as needed. The find code routine is currently called from these locations:

add operator – to get the appropriate code for unary and binary operators
add print code – to get the data type specific print code
set assign command – to get the data type specific assignment code
close parentheses handler – to check the arguments for internal functions

The Translator will be modified to find the appropriate binary operator code at the first operand, and find the appropriate binary operator code again at the second operand. Also for internal functions, each argument will be checked as it is processed, in other words, in the comma handler.

The new find code will only check one operand where the index of the operand to check will be passed as an argument. The callers of find code will be modified:

operator handler – call for first operand of a binary operator
add operator – call for second operand of a binary (or operand of a unary) operator
comma handler – call for current argument of an internal function
close parentheses handler – call for current argument of an internal function

Tuesday, July 20, 2010

Translator – Operators (New Codes)

Table entries for the new codes for the Double/Integer (I2) and Integer/Double (I1) were added for the operators. It is not necessary to have all the operand combinations for the logical operators (AND, OR, etc.) that produce an integer result for integer operands. These should be used with integers, but double operands will be allowed and the Translator will continue to insert hidden CvtInt codes for these.

Likewise for the integer division operator (\) that's meant to take double operands and produce an integer result. There's no reason to have separate versions for integer operands as the regular division operator can be used for these. So hidden CvtDbl codes will be inserted for integer operands.

Sunday, July 18, 2010

Translator – Operators (New Design)

The new design for operators will be that each data type combination will have it's own code. There will be associated codes for the first operand and associated codes for the second operand. For efficiency, there will be just one associated codes array with the second operand associated codes after the first operand associated codes. In addition to the number of associated codes value, there will be a new secondary associated codes index that will point to the first second operand associated, which can also be used to determine the end of the first operand associated codes.

Again using plus as an example, the main codes along with associated codes are list below. The convention will still be that the default operator has double arguments. The “I1” and “I2” codes represent the first or second operand being an integer, and the “Int” code for two integer operands. Similarly for strings where the “T1” and “T2” codes represent the first or second operand being a temporary string, and the “TT” code will for two temporary string operands.

Add (4, 3) AddI1, CatStr, CatStrT1, AddI2
AddI1 (1, 0) AddInt
CatStr (1, 0) CatStrT2
CatStrT1 (1, 0) CatStrTT

The second operand associated codes are underlined. The first number in parentheses is the number of associated codes and the second underlined number is the start index of the second operand associated codes. The decision for this design was arrived at by looking at the assembly language generated for approximately what the run-time will look like. Information about the assembly language research after the Continued break...

Continued... »

Friday, July 16, 2010

Translator – Operators (Change Needed)

A problem has surfaced as the new table entries were being added for temporary strings. The problem is unrelated to strings, but is related to integers. The plan was to make use of a associated codes as each operand is processed. Using plus operator as an example, there would have been these codes (with their return and operand data types):

Add (Dbl) Dbl Dbl (also handles Int Dbl and Dbl Int with CvtDbl)
AddInt (Int) Int Int
CatStr (Tmp) Str Str
CatStrT1 (Tmp) Tmp Str
CatStrT2 (Tmp) Str Tmp
CatStrTT (Tmp) Tmp Tmp

The Add code would have the associated codes of AddInt, CatStr and CatStrT1. In other words, one code for each of the four data types of the first operand. For handling the second operand for strings, the CatStr code would have an associated code of CatStrT2, and the CatStrT1 code would have an associated code of CatStrTT. The main code would be selected if the second operand was a string (i.e. could not be determined to be a temporary string), and the associated code would be selected if the second operand was a temporary string.

With this scheme, a problem occurs when the first operand is an integer, where the AddInt code would be selected. But what happens when the second operand is a double? The code needs to revert to the Add code since integers need to be promoted to doubles (doubles do not demote to integers).

This could be handled by making the Add code an associated code of the AddInt code. But a CvtDbl needs to be added for the first operand, but the first operand is no longer being saved (except for strings that are not temporary). Which means that integer first operands would need to be saved. Anyway, this all started to get rather complicated. I have some ideas on how to resolved this, but some test code needs to be tried to see which idea is better...

Tuesday, July 13, 2010

Translator – String Processing (Change)

The planned changes for operator and internal functions where their operands (arguments) will no longer be kept on the done stack conflicts with the way strings are handled.

If an operator or internal function has string operands, all the operands are attached to the operator or internal function. This is necessary for string operands because the Translator has insufficient information to determine if generic string tokens (with and without a parentheses) are temporary or not. Temporary strings need different codes at run-time because the temporary strings are deleted when no longer needed. The Encoder will determine if these tokens refer to variables and arrays (not temporary) or functions (temporary).

As mentioned, all operands are attached, which is unnecessary because only the string operands are actually needed. Further, the Translator can determine if some strings are temporary, like the result of the concatenate operator and most of the internal functions that return a string. The sub-string functions only return a temporary string if their argument is a temporary string.

To resolve this conflict between saving the string operands and the new operand processing, if a string operand (or argument) cannot be determined to be a temporary string, it will be left on the done stack. When the operator or internal function is appended to the output, it's specific code for the data type has already been determined. If the code has one or more string data type operands (not temporary string operands), the string operands will be counted to determine how many strings were left on the done stack. These strings will be popped and attached to the operator or internal function token.

Monday, July 12, 2010

Translator – Operand Processing (Change)

Previously, operands were processed (checking types and inserting conversion codes) after all of the operands were received. The operands are pushed to the done stack. Operators are temporarily pushed to the hold stack and are removed before a lower precedence operator is pushed to the hold stack (the method of rearranging operators according to their precedence).

Similarly for functions and arrays that are not processed until the matching closing parentheses token is received, at which time all the arguments or sub-scripts are on the done stack. No type checking can be performed for non-internal functions and arrays, so their operands are saved with (attached to) the array or function token.

A change has already been implemented for multiple list assignments where each assignment item (variable or sub-string) is checked as received, at either comma or equal tokens. Something similar can also be performed for internal functions, in fact, some logic has already been added that determines if a comma or closing parentheses token is valid after each argument. Logic can be added to also check the data type of the argument, adding conversion operators when necessary.

It has also been determined that for a binary operator, it data type specific code is needed on the hold stack based on the first operand of an operator (for possible error reporting). If necessary, a conversion code will be added at the same time. And it is now not necessary to keep the first operand on the done stack, so it can be popped. When the operator is emptied from the hold stack, only the second operand needs to be checked.

It is possible that the operator code may need to be changed from the code found based on the first operand. Currently there is only one operator that has more than one code with the same first operand data type but different data types for the second operand (Power and PowerMul, discussed here). The multiple table flag can be used to identify these type of operators (this is the same flag used to identify internal functions with different codes for the number of arguments). When set, the Translator will look for the correct code instead of inserting a conversion code.

Sunday, July 11, 2010

Translator – Expressions Type (Implementation)

This whole expression type subject is getting rather involved and I am not convinced that are situations will be handled correctly. Consider one more example:

Z$ = A$ + B$ < C$ + D$ = E < F

The expression is a perfectly valid integer expression, however, it is the wrong type for a string assignment. Previously, the error would have been “expected string expression” pointing at the = token (confusing). With the first operand implementation, the error will be pointing to the A$ token (good), in other words, the beginning of the numeric expression. Conceivably a better error might be “expected string operator or end-of-statement” pointing at the < token. But going with the “what is entered is what was intended” then the first operand implementation may be good enough.

The best course of action is to proceed with the first operand implementation and see what the results are.

One last action needs to be performed when operator tokens are received. In order to produce the correct error when a statement prematurely ends after an operator (when another operand is expected), the data type of a binary operator's second operand needs to be known. Therefore, before pushing an operator to the hold stack, the specific code of the operator will be determined based on the first operand, which will be on top of the done stack, and this code will be pushed to the hold stack. If an error occurs, the top of the hold stack can be checked to see what type of expression is expected to follow. (The find code routine with respect to operators, may need to be rewritten as a result of this change.)

Translator – Prematurely Ended Expressions

A solution has been defined for the issue of identifying the correct location of a data type error for properly completed expressions, that is, expressions that end with an operand (when an operator is expected next). But there is still an issue when an expression ends prematurely – when an operand is expected after an operator. Consider these examples:

Z$=A$+B$+
Z%=A$+B$<
Z$=A$+B$<
Z%=A$+B$+

Each of these could be either an “expected numeric expression” or “expected string expression” error pointing to the end of the expression (the token or EOL after the last operator). But could better errors be output? For example 1 and 2, an “expected string expression” error pointing to the end of the statement is appropriate because a string expression would make these statements correct.

The situation is a little murky with example 3, but it appears that the final expression should be a string, where is as the < operator takes string operands to produce an integer. A string expression is expected after the < token, but this would make the final expression the wrong type. It would appear that the correct error should be “expected string operator or end-of-statement” pointing to the < token.

The situation is very murky with example 4. The expression could be correctly finished with say a “C$<D$” so an “expected string expression” could be appropriate. However, final + could also be replaced with a “<C$” so an “expected numeric operator” could be appropriate also (but without the words “or end-of-statement” since that would make the wrong final expression type). This may be unclear since not any numeric operator would work (it must be an operator that takes string expressions and produces a numeric value, specifically the relational and equality operators – the error message could be reworded to indicate this).

It is best to assume that what has been entered up to the error is what was intended. So for example 4, the error should be “expected string expression” pointing to the end-of-line. However, for example 3, even with a string expression, the final expression would be the wrong type. While there are operators that take string operands and produce numeric results, there are currently no planned operators that take numeric operands and produce string results. This assumption could be used to define the proper error, but these type of operators could be added later. There is one such operator in FreeBASIC - the & concatenate operator can take any type argument including numeric operands and produce a string result. So it will be assumed that these type of operators could exist.

Saturday, July 10, 2010

Translator – Expression Types and Unary Operators

Unary operators only have one operand, so do not have a first operand. In fact, any errors reported at the unary operator sub-expression, should point to the unary operator, not it's operand (unless the operand has an error). Therefore, the first operand is only attached to a binary operator pushed to the done stack. The first operand token will be set to the default value of NULL for a unary operator.

Translator – Expression Types and Parentheses

Normally during the translation process, parentheses tokens are removed. However, for reporting errors with the data type of an expression, if an expression starts with an parentheses and is the wrong data type, then the error needs to point to the opening parentheses, not to a token inside the parentheses.

When a closing parentheses is received and processed, after emptying the hold stack of all operators, the opening parentheses will be popped from the hold stack. Previously, both the open and closing parentheses tokens were deleted. If the parentheses were unnecessary, the parentheses sub-code was set in the last operator appended to the output. The last operator will also be on top of the done stack with it's first operand. Consider this invalid statement:

Z$ = A$ + B$ + (C + D * E)

When the closing parentheses is processed, there will be a * token on top of the stack and its first operator will be set to the C token. The “expected string expression” should point to the open parentheses. Therefore, the closing parentheses needs to change the first operand token to the open parentheses token. This also means that the open parentheses token can't be deleted.

Since the open parentheses token can't be deleted, it must be marked as a temporary token (using a Temporary sub-code). When an operator token is popped from the done stack (either by another operator or at the end of the expression when a command is processed), if it contains a temporary first operand token, the token must be deleted.

When an error occurs, or an expression is prematurely ended (also an error), the data type of the first operand will be checked to determine which error will be reported (based on what type of expression is expected). Open parentheses tokens do not have a data type. Therefore, when an operator's first operand is set to an open parentheses token, it must inherit the data type of the first operand that it is replacing.

Parentheses may also be nested, so another open parentheses token may replace a first operand that is already set to an open parentheses. No extra checking is necessary since the previous open parentheses will have a data type – the new open parentheses token will inherit the same data type.

Friday, July 9, 2010

Translator – Expression Type (Procedure)

To show how keeping the first operand will aid it reporting errors at the correct token, consider these examples again:

Z% = A$ + B$ + C$
Z% = A$ + B$ > C$

The first statement is processed in this sequence:

The Z% token is appended to the output and pushed to the done stack.
Since the mode is Command, the = token is interpreted as an assignment, so an AssignInt command is pushed to the command stack, the Z% token is popped from the done stack and the mode is set to EqualAssigment.
The A$ token is appended to the output and pushed to the done stack.
The first + is pushed to the hold stack, and being an operator, the mode is changed to Expression (further equal tokens will be interpreted as an equality operator).
The B$ is appended to the output next and pushed to the done stack.
When the second + is received, it empties the first + from the hold stack (being the same or higher precedence).
The first + pops the A$ and B$ from the done stack, and a +$ is appended to the output. The first operand, A$, does not contain a first operand (it's not an operator), so the +$ is pushed to the done stack with A$ as it's first operand.
The second + is pushed to the hold stack.
The C$ is appended to the output.
The end of statement empties the second + from the hold stack.
The second + pops the +$(A$) and the C$ from the done stack, and a +$ is appended to the output. The first operand, the +$(A$) has a first operand, A$, so the second +$ is pushed to the done stack with the A$ as it's first operand.
The assign command handler will be called since there is an assignment on the command stack, which will pop the value being assigned, the second +$, and will see that it is the wrong data type (an integer is expected), so an “expected numeric expression” is reported. But instead of pointing to the second +$, it's first operand, the A$ token, is returned.

For the second example, the >$ that would be on top of the done stack when the assign command handler is called. It's data type is an integer, which is correct, so no error occurs. However, sub-expressions in parentheses need additional handling...

Thursday, July 8, 2010

Translator – Expression Type (New Design)

The new design for the rest of the data type error detection consists of remembering the token of the first operand of each sub-expression within an expression, so that an error can be reported against this token when a data type error is detected. Consider this invalid assignment statement:

Z% = A$ + B$ + C$

Currently this reports an “expected numeric value” at the second + operator. It should report “expected numeric expression” at the A$ token. However, the detection can't occur at the A$ token before the entire expression is processed. Consider this valid assignment statement:

Z% = A$ + B$ > C$

The expression becomes an integer at the > operator, therefore an “expected numeric expression” can't be reported at the A$ token.

Each binary operator token appended to the output is also pushed onto the done stack, replacing it's operands. The token of the first operand of the operator will be attached to the operator when it is pushed to the done stack. If the first operand is another operator, then this operator's first operand is attached (in other words, the operator will inherit the first operand's first operand if there is one).

Wednesday, July 7, 2010

Translator – Assignments (Development)

The last several days was spent implementing the new design for handling assignments. The assignment operators are no longer handled by the operator routines, and now by the comma and equal token handlers and the assign command handler (which previously didn't do anything). Two support functions were also implemented, one to put the appropriate assignment command on the command stack based on the first (perhaps only) item being assigned, and the other to check each assignment item for the correct data type (allowing for mixed string and sub-strings).

At the end of the statement, the value (expression) being assigned is checked for the correct type in the assign command handler, adding a hidden conversion as needed for the numeric data types. Because the assignment operators are no longer handled as binary operators, the table entries for the assignment operators were modified where each only has one operand (for the value being assigned).

Also, tokens with parentheses being assigned can be assumed to be arrays since a function with arguments cannot be assigned (only the function name alone, without parentheses, can be assigned). Therefore, unlike tokens with parentheses in expressions that can be either an array or a function call, the values in parentheses of an array being assigned can be assumed to be subscripts, which must be integers (or doubles with conversion). If it turns out that the name is not an array when encoded, the Encoder will report the error.

Several other data type reporting error issue were also corrected but without adding any special expression type handling as was planned in the failed design concept. A new concept was developed, which probably won't require the Translator to keep track of the expression type while translating. More details to follow...

Saturday, July 3, 2010

Translator – Expression Type (Failed Design)

The first idea on how to implement expression type checking into the Translator failed. It's not necessary to go into the details of the design, but basically the idea was to set the expression type at the start of the expression. For assignments, the expression type would be the same as the variable(s) being assigned. For PRINT statements, any expression type would be allowed, but would be set based on the first operand. For INPUT PROMPT, the expression type would be string. However, certain expressions would trip this up. First consider this statement:

Z = A$ + B$ + C$

Currently this would report an “expected numeric expression” at the second plus. It should report the error at the A$. The idea was at the equal, the expression type would be set to numeric, then upon seeing A$, which is a string, an error would be reported. However, consider this valid statement:

Z = A$ + B$ < C$

The Translator can't set the expression type to numeric and then report the error at the A$ because the expression eventually becomes numeric at the less than.

The bottom line is that the Translator can't determine if the expression is an the correct type until the entire expression is translated, but, still needs to report at the first instance where the data type error occurs, not at the last operator processed. The goal now is to get the new assignment implementation working (needed first anyway), and then get the expression type handling implemented...

Friday, July 2, 2010

Translator – Assignments (New Implementation)

The new implementation of assignment statements consist of removing the existing code list assignment processing from the add operator routine and adding processing to the equal token handler, comma token handler and the assign command handler (which currently does nothing).

Comma: When the first comma token is received, the data type of the assignment is set to the first assignment item and an assign list code appropriate for the item's data type is pushed to the command stack instead of pushing the main assign list code to the hold stack. If there is a LET command already on the command stack, it will be replaced with the assign list code.

For each additional comma token, the data type of the assignment item will be checked to make sure it matches the current data type. If the data type is a string or sub-string and the new item's data type is also a string or sub-string, but not the same, the assign list code will be changed to the AssignListMixStr code. When an equal token is received in the comma assignment, the mode is set to expression after the last list assignment item is checked and the expression type is changed to Numeric or String.

Equal: When an equal token is received first, the data type of the assignment is set to the assignment item and an assign code appropriate for the item's data type is pushed to the command stack instead of the hold stack. If there is a LET command already on the command stack, it will be replaced with the assign code. The expression type is changed to Numeric or String in case the expression starts after the equal.

If a second equal token is received, the assign token on top of the command stack is changed to the appropriate assign list code. For each additional equal token, the data type of the assignment item will be checked to make sure it matches the current data type. The same check for strings made for each additional comma token is also made for each additional equal.

I realized that it is not necessary to save string variables in assignment and list assignment statements – it is already known these operands will not be temporary strings. Only a string value needs to be saved for string assignments as it may be a temporary string. Also, the Translator can identify some strings as temporaries (the result of the concatenate operator and all but the sub-string internal functions that return a string). Modifying the Translator for these temporary strings will be done after the INPUT command is implemented.

Thursday, July 1, 2010

Translator – Assignments (Change)

The way assignment and list assignment statements are processed needs to be changed to work with the expression type handling. Currently the assignment tokens are processed as operators, pushed to the hold stack when received and processed when emptied be a lower or same precedence token is received (an EOL token). The processing is performed by the find code routine activate by seeing the reference table flag set, with the additional list assignment processing in the add operator routine after find code returns.

A better method is to process assignments and list assignments as the statement is processed. This means as each comma and equal tokens are received, the data types of items being assigned are checked immediately. The first item received (a token with no parentheses, a token with parentheses or a sub-string function) determines the type of the assignment. For each additional item received in the assignment list, the data type of the item must match exactly, except for the string data type where strings and sub-strings may be mixed in the same list assignment statement.

Once the expression starts, the expression type is set to either Numeric (for the double and integer data type since both can accept either for an assignment value) or String. At the end of the statement, the data type of the expression will be checked, a hidden conversion operator will be added if necessary, followed by the assign or assign list token.

This new method is simpler than the current method, which contained involved code for detecting errors in the list assignment – the first error in the statement needed to be reported. Complicating matters was the fact that the assignment list items are processed backwards because the items are popped from the done stack in the reverse order from how they were received.

Interactive BASIC Compiler Project