Interactive BASIC Compiler Project: May 2010

Monday, May 31, 2010

Translator – PRINT Command (Translating)

Currently, command tokens are pushed onto the command stack, so will a PRINT command token when it is received. The PRINT's next token mode value will be set to Expression. The Semicolon, Comma and End-of-statement will trigger the insertion of the value specific print code.

The existing find code function can be used to determine which specific print value code to append to the output. The PrintDbl code will serve as the main code, which will have PrintInt, PrintStr and PrintTmpStr as associated codes, each with one operand of the appropriate data type. A new token will be created with the PrintDbl code and passed to the find code function, which will return the correct code for the expression. The token is then be appended to the output.

At the end of the statement, the PRINT command token will be on top of the command stack. The end of statement processing needs to pop this token off from the stack and perform whatever handling is necessary. There needs to be a command handler that performs a command's specific needs. For reasons that will be clear later, the token handler function cannot be used for this purpose, so a new command handler function pointer is required in the table entries.

The PRINT command handler needs to determine when the PRINT token should be appended to the output (to go to a new line), or not (stay on the same line). If the done stack is not empty, then there is an unprocessed expression that needs a specific print value code. This means there was no semicolon or comma at the end of the statement, so the PRINT token can be appended. However, an empty done stack does not indicate no PRINT token is needed, consider the PRINT command by itself.

Therefore, a flag is needed, to be set when an print-only internal function, semicolon or comma is processed. If the done stack was empty, the command handler will check if this flag is set, to determine not to append the PRINT command token (in which case the PRINT token is deleted). This flag will be added to the command stack's item structure item, which is cleared when the command is initially pushed onto the stack.

Translator – PRINT Command (Translation)

The translated statement PRINT will consist of a number of different printing codes, which include PrintDbl, PrintInt, PrintStr, PrintTmpStr, Comma, Semicolon, and Print. The print only Spc and Tab internal function codes will also print. Before explaining further, here are some examples with their translations:

PRINT Print
PRINT A;B% A PrintDbl B PrintInt Print
PRINT A$+B$,C$; A$ B$ + PrintTmpStr Comma C$ PrintStr
PRINT TAB(A+1);"Hello" A 1 + Tab "Hello" PrintStr Print
PRINT "Start";SPC(10) "Start" PrintStr 10 Spc
PRINT ,,A Comma Comma A PrintDbl Print
PRINT ; Semicolon

Note that when the cursor is to go to a new line, there is a Print code at the end of the translation, otherwise it is absent. The Print code at run-time preforms the new line action, so in it's absence, the cursor remains on the same line. The PrintDbl, PrintInt, PrintStr and PrintTmpStr will output the value that is on top of the evaluation stack, with PrintTmpStr deleting the temporary string when done.

At run-time, the Comma code will advance the cursor to the next column by outputting spaces. The Tab and Spc codes perform there action by outputting spaces. The Semicolon code does not perform any action during run-time, it is there only so that it can't reproduced during recreation of the source. Next, how the PRINT statement will be translated...

Translator – PRINT Command

It is necessary to look at how PRINT statements will be executed to determine how they need to be encoded and thus translated. But first, here is the general syntax of the PRINT statement:

PRINT [<expression>][;|, [<expression>]...]

Zero or more expressions may be included. The expressions can result with a double, integer or string value. Numbers are output with a trailing space. The expression may be one of print only functions SPC and TAB, both taking an integer expression. After the expressions are output, the PRINT will move to a new line unless the line ends with a semicolon, comma, SPC or TAB. (More options will be added later like PRINT USING and PRINT to a file.)

The SPC function outputs the number of spaces of the integer expression (less than 1 outputs no spaces). The TAB function advances the cursor to the column of the integer expression, where the first column is defined as column 1 (less than 1 is taken as column 1). If the cursor is already beyond the column specified, then PRINT goes to the next line, and then to the specified column. Using TAB(1) in the middle of a PRINT statement has the effect of doing a new line (an '\n' in C).

A semicolon is used to separate expressions, but does not affect what is output. The syntax above implies multiple semicolons may be entered. This is allowed, but will cause dummy semicolon tokens to be inserted into the code. A comma advances to the next column. A column will initially be defined as one-fifth of the screen width (at 80 columns, this is 5 columns of 16 characters). Eventually there will be a way for this width to be defined to any value.

Sunday, May 30, 2010

Translator – LET Command (Release)

The LET command is implemented and working. A new set of translator test inputs were added for commands, with LET statements for each possible assignment (each data type, non-list and list) with the LET keyword were included. Three error test inputs were added also, one with a two LET keywords, one with a LET keyword in the middle, and a PRINT (to test the “not yet implemented” error).

The plan was to release a whole series of development releases (0.1.12-dev-X) starting with the LET command, but some many changes were made to the code (including the parser correction, parser test updates, token handlers, etc.) from the 0.1.11 release, that this release will be an official Translator development release (0.1.12). This will probably continue for each command added to the Translator. Sub-developmental releases may still be used for the more complex commands.

This release also contains regression test scripts that were used during the latest round of changes. The scripts saved a lot of command line typing due to all the reorganization of the code (token handlers, token status and token mode changes). A Windows batch file equivalent is also included, but does not work near as nice because of the limitations of the Window compare utility. Both scripts delete any current output files, runs the tests and then compares to the files in the test directory.

The Translator now has initial support for commands with just the LET/Assignment statement supported and ibcp_0.1.12-src.zip has been uploaded at Sourceforge IBCP Project along with the binary for the program. Next, implementing the PRINT command...

Translator – LET Command (Implementation)

The SimpleStack class is used for the new command stack, which will hold a CmdItem structure consisting of a token pointer and a code. The code will initially be set from the token's index, but may be changed to other associated codes as commands are processed by the Translator. Since the hold and done stacks are also simple stacks, these were changed from the List class to the SimpleStack class. This only required minor changes to push and pop calls for these stacks.

As previously hinted, the Translator status enumeration needed to be moved outside the Translator class before the TableEntry structure. The enumeration was renamed TokenStatus as that seemed appropriate for the return value of the Translator's add token function. Each of the enumeration values were also renamed except for the BUG statuses, which were left alone.

A previously mentioned, the mode must be changed from Command to Assignment upon receiving the LET command token. Some commands will need to change the mode from Command to Expression (PRINT, IF, WHILE, etc.). Some commands will need to change the mode from Command to a new End-of-Statement mode, since nothing is expected after command keyword (END IF, DO, LOOP, etc.).

Having an every growing switch statement of the command code is not efficient. Therefore, a next token mode value was added to the TableEntry structure. When a command token is received, if the mode is currently Command, then the mode will be changed to the command's table entry next token mode value. The command will be pushed onto the command stack along with it's current code. No further action needs to be taken until the rest of the statement is processed.

Since there will be mode values in the TableEntry structure, the Translator mode enumeration was also moved before TableEntry and renamed TokenMode (again an appropriate name). The enumeration values were also renamed to reflect this. For now, if the next token mode is not set for a command (a new Null token mode value), then a “not yet implemented” bug error occurs. If the current mode is not Command, then a new “unexpected command” message occurs.

For the LET command, the next token mode value is Assignment. When an assignment operator is added to the output list, if there is a LET command on the command stack, it is popped and the assignment operator's token's LET sub-code flag is set. For now, since the command stack should be empty, the end-of-line processing only needs to check if the command stack is empty. Later it will need to process commands on the stack.

Saturday, May 29, 2010

Parser – Bug / Test Updates

During the testing of the LET command implementation, some problems were discovered. The first problem was in the Parser's get number token function. The simple single 0 digit caused an “invalid leading zero in number constant” error. This error check was to prevent a constant like 01 from being accepted, but it did allow a leading zero when followed by a decimal point like in 0.1. This fix was to check if the next character is a digit, and if it's not, then it terminates looking for more characters, and then otherwise cause the error.

Due to this error, it was a good idea to re-run all of the parser tests since it has been a while since these were checked. Low and behold, they all failed or miscompared. The print token function used by these tests hadn't been updated for the changes to the code and data type enumerations, so these were updated. Once corrected, there were still miscompares, but these were due to either the decimal code values changing (because of all the new codes that have been added) or because the data type of many operators and internal functions were changed from None to their appropriate data type. Additional 0 constant test inputs were added to the Parser number test inputs.

The regression test scripts were updated to also include the parser tests. The Windows batch file uses the comp command, which has a nice feature that a wild card can be used for both files names and it is able to compare all sets of files with one command (like all 8 translator test output files). However, it also has an irritating feature where after it's done comparing files, it asks if there are more files to compare. No has to be entered to continue. There is not option to prevent this. Since there are two compares (one for the parser files, one for the translator files), no has to be entered after the comparing the parser file before the translator files are compared.

One last problem was discovered that affects both the print token and print small token functions. Numeric constants were not being output as intended. The code was outputting an integer using the C %d format specifier and doubles using the %g specifier. The problem is that if the constant 1.0 (a double) was entered, it would be output as 1 making it impossible to know if it was an integer or a double. The raw strings entered for the numbers were suppose to be output - the reason these strings were saved in the first place, to preserve the original string for later output by the Recreator.

Translator – Token Sub-Codes (Implementation)

The sub-code flag was implemented, which consisted of adding the sub-code memory to the Token class, adding the sub-code flag value definitions and modifying the print token test routine to output the flags.

The setting of the parentheses sub-code flag was handled in the do pending parentheses function, which checks if unnecessary parentheses were entered and appended a dummy parentheses token. The code was changed to set the parentheses sub-code flag of the last token appended to the output. Two issues were discovered.

The first issue found was if two sets of unnecessary parentheses are entered, for example, A=((B)), then the parentheses sub-code flag can only be set once. Upon reproducing the original source, the Recreator will not know that there were two sets of unnecessary parentheses. So for this case, a dummy parentheses token will still be appending for each additional set of unnecessary parentheses entered.

Curiously, with this change, if three (or and odd number of) unnecessary sets of parentheses are entered, for the third (fifth, etc.) set, the parentheses sub-code flag gets set in the second (fourth, etc.) set's dummy parentheses token. Something the Recreator will need to handle.

The second issue found was if the last token appended was a hidden conversion code, the conversion code's token parentheses sub-code flag gets set. It is anticipated that this would cause a problem for the Recreator. The Recreator should be able to safely ignore the conversion codes, but if it needs to look for sub-code flags, these can't be ignored. To avoid this, the code was modified that if token's table entry has the new Hidden flag set, then the item previous to the conversion code has its parentheses sub-code set.

The setting of the comma sub-code flag was handled in the new equal token handler function. When an equal token is received, if the mode was a multiple comma assignment, then the comma sub-code flag is set when the token's code is set to the assign list operator. The comma sub-code flag will not be set of the mode was a multiple equal assignment.

Translator – Token Handlers (Implementation)

A TokenHandler typedef was needed to define the pointer to token handler function. I was not able to define the function pointer directly on the variables. This always proves difficult with more complex types, especially involving pointers. Fortunately, using typedef simplifies the issue. Here is the definition that was inserted before the TableEntry class:

class Translator; // forward reference to Translator class
typedef TokenStatus (*TokenHandler)(Translator &p, Token *&token);

The TokenHandler type definition is then used for defining an token handler function pointers. Each of the token handlers were created using the existing code in the switch statement, where the code was modified to add the Translator pointer in front of all the Translator variables and the Translator scope (Translator::) was added to the Translator enumeration values.

The code that handles operators (which was after the switch statement for the special codes) was also put into a handler function. This greatly simplified the code at the end of the add token function. For processing the token, the temporary token handler function pointer is set to the code's table entry value. If the value is not set (is NULL), then the temporary pointer is set to the default operand token handler function. The function is called using the temporary pointer and it's status return value is immediately returned.

The program was compiled several times during the making of the changes. I got tired of running of the test cases (to output is redirected to a file in the base directory) and comparing to the official test output files (in the test directory), so I wrote a MSYS (bash) script to do it automatically and check all the test cases. An equivalent Windows batch file was also written, but does not work near as nice. Both will be included in the next release.

Friday, May 28, 2010

Translator – Token Handlers

The Translator's add token function is becoming quite large and at the current rate will become much larger as the different commands are implemented. It's never a good idea to have giant functions. Therefore, this routine needs to be broken up into separate functions. The code that contains the special token code handling (for equal, comma, closing parentheses, and end-of-line) will be separate functions. These functions will be called token handlers. Most of the commands will also have token handlers.

Having a switch statement on the token code where each case calls a token handler function is not the most efficient implementation. The concern is not execution time, but the amount of code lines required, in other words, a large switch statement. A better implementation is to store a function pointer in each table entry where a code requires a token handler. The Translator will check if there is a function pointer and then call the function to process the token.

And here is where it gets complicated. Pointers to member functions are not allowed. So the token handlers can't be Translator member function. But they need access to all of the Translator data members. The solution was to make them C++ friend functions of the Translator class with an reference argument to the Translator instance (a Translator function will pass *this).

The token handlers functions will also require a reference to the current token pointer (a reference so that it can be changed for an error) and will return the status of the operation. The Status enumeration needs to be moved out of the Translator because in order to define the return value for the token handler function pointer, it needs to be defined before TableEntry (and Table). But Table needs to be defined before TableEntry. Catch 22.

A forward reference for the Translator class can be added before TableEntry (needed for the Translator reference argument, which is simply class followed by the class name and a semicolon), but not for enumerations inside the Translator. Therefore the status enumeration will be moved out of Translator and renamed to TokenStatus (appropriate for a token handler, and the word Token is shorter than Translator).

Thursday, May 27, 2010

Translator – LET Command

There is nothing special to say about the LET command – the format of assignment statements, less the optional LET keyword, has already been defined and implemented. When the Translator receives the LET command token, it will be pushed on the new command stack. The mode will be Command when the LET is received and, as the Translator is currently implemented, needs to left set to Command mode for the assignment statement to be processed as currently implemented.

When the assignment token is added to the output list, it will first check if there is a LET command token on top of the command stack, and if there is, then the LET sub-code flag will be set in the assignment token, the LET command will be popped from the command stack, and the LET token will be deleted.

There is a problem. The mode must be set to Command for the LET token to be accepted. The mode must be set to Command for the assignment statement. Some detection is necessary to prevent double LET keywords. To solve this problem, a new mode is needed, an Assignment mode. For most of the Translator, both Command and Assignment mode will be equivalent except for the processing of command tokens.

Wednesday, May 26, 2010

Translator – Command Processing

When processing tokens in a statement, it is necessary to know what command is currently being processed. For example, the SPC and TAB functions are only valid in the PRINT statement. If the command is pushed onto the hold stack, there will be other tokens on top of the command token. Therefore the current command can't be determined by looking at the top of the hold stack. This means another variable would be needed to store the current command.

Commands can also be nested within a line (for example, an assignment or PRINT inside an IF-THEN-ELSE statement), so a stack is still the ideal solution. The command tokens need to be pushed onto a separate stack from the hold stack, a command stack. The precedence of a command token will still be used to empty the hold stack, but the command token will be pushed onto the command stack instead. The current command can be accessed by looking at the top of the command stack.

Before getting started with designing and implementing the commands, it's important to mention that there is no attempt being made to follow any particular standard or existing BASIC for this project. Being a heavy GW-Basic user in the 80's and early 90's, a lot of the inspiration for this project comes from that experience.

Multiple statements will be supported – as way to fit more code on the screen. In GW-Basic, the is a general lack of any rigid multiple line control structures. There are limited ones like FOR-NEXT, WHILE-WEND, DEF-ENDDEF, but these are easily abused (mainly due to the interpreted nature of GW-Basic). Factor in the GOTO, ON-GOTO GOSUB and ON-GOSUB commands (not being supported here), and GW-Basic code could look very ugly.

For this project, there will be new modern control structures including some structures that don't exist in other BASIC or other languages. It is time to get started on the commands, which will be defined before implemented. One command has already been implemented, the assignment statement, less the optional LET keyword, so it's now time to implement this optional keyword...

Tuesday, May 25, 2010

Translator – Commands and Precedence

In the translated RPN list, the command token will generally end up at the end of the statement. To accomplish this, the command tokens need to be saved until the rest of the statement is translated. The best place to save the command tokens is on the hold stack with a very low precedence to keep them on the stack until the end of statement.

The appropriate precedence of commands would appear to be the same as the EOL token, so that when the EOL token is processed, the command tokens will be emptied from the stack. This implies that commands could empty other commands, but generally, this will not occur except for multiple keyword commands (like the IF-THEN-ELSE and FOR-TO-STEP commands).

The EOL token currently has the same precedence as assignment, closing parentheses, and comma tokens. It makes sense for assignment and commands to be the same precedence because assignment is technically a command. An assignment will never empty a command from the hold stack because the Translator never receives an assignment operator. An equals token is received as an equality operator, which has a higher precedence, so it won't empty any lower precedence commands. It's only after the stack emptying that an equals token may be changed to an assignment token.

Closing parentheses and comma tokens, being the same precedence, will empty all tokens with higher or same precedence, which is a problem, since command tokens will be the same precedence. This is currently not an issue because:

An EOL token is not pushed onto the hold stack and therefore can't be emptied by a closing parentheses or comma.
An assignment operator is not on the hold stack when a multiple assignment comma is processed, so the comma won't empty an assignment token.
An assignment token is on the hold stack when a comma is processed within an array or function – the lower precedence array or function token will be on the stack before the assignment token, so it won't empty the assignment token.
An assignment token is on the hold stack when a comma is processed not within an array or function (though possibly after an opening parentheses) – this is an unexpected comma and an error occurs.
A closing parentheses may empty an assignment token from the stack, but since no opening parentheses, array or function token is found on the stack, an error occurs.

This means the precedence of closing parentheses and comma tokens need to be increased above assignment and command tokens so that they do not empty them from the hold stack, but below any other operator. Closing parentheses and comma tokens are never pushed onto the hold stack, so there is no worry about them being emptied from the hold stack by an EOL or command token (an assignment token never empties the hold stack).

Monday, May 24, 2010

Translator – Token Sub-Code

The internal code will contain sub-codes that will be used by the Recreator, thus eliminating the need to put dummy codes into the program. The Translator knows when information will be needed by the Recreator in reproducing the original source. A sub-code member will be added to the token class so that the Translator can pass this information to the Encoder. The Encoder will set the sub-codes in the internal code words.

Instead of adding the dummy parentheses token to the output list, the Translator will now set the parentheses sub-code in the last token appended to the output list. The closing parentheses token used for the dummy token is no longer needed, and will be deleted. The token output routine in the test code will output a ')' at the end of a token when the parentheses sub-code is present.

Similarly, if the LET keyword is present in front of an assignment, then the LET sub-code will be set in the assignment operator token. The token output routine in the test code will output a 'LET' at the end of a token when the LET sub-code is present. Before knowing exactly how this will be accomplished, it is necessary to defined how commands will be processed...

Sunday, May 23, 2010

Internal Code Format

The encoded BASIC program will be stored in memory in a very compact form. It will not be any kind of linked list of tokens like come out of the Translator. A linked list would have way too much overhead during execution and would waste a lot of memory. Instead, the program will be very simple format where each code will take up a single 16-bit word.

A 16-bit word has plenty of space to support all the codes, which currently stands at 137 (there will be lot more, but far less than the 65,536 numbers possible in a 16‑bit word). This handles the actual codes (operators, internal functions, and commands), but what about entered identifiers like variables, arrays, define and user functions, constants, and so on?

Entered identifiers will have an index value into a table, known as the Dictionary, which will contain the information about each identifier (like name, data type, number of array dimensions, size of array dimensions, number of arguments, argument data types, constant values, etc.). The index to the Dictionary entry will be stored in the program. During run-time, much of the information in the Dictionary will not be used.

These indexes will be preceded by a code. For example, one of these codes will push a double variable value onto the evaluation stack. The routine for this code will know to get the next 16-bit word that will contain the Dictionary index. Using a 16-bit word for the index implies a maximum of 65,536 Dictionary entries. This should be sufficient since subroutines and functions will each have their own Dictionary, with each having a maximum 65,536 Dictionary entries.

The 16-bit code word has more bits than are needed for the code value. The extra space can be user for other things. One use would be for information like the presence of unnecessary parentheses or the optional LET keyword. This information will be called a sub-code and ignored by run-time module, but would be used by the Recreator. The detail of the internal program doesn't need to defined right now, just the knowledge that there will be a sub-code present in the internal code. Next, what this means for the Translator...

Translator – Commands (Introduction)

Now that expressions have been fully implemented in the Translator (at least until more language features are added), it's time to start implementing the BASIC commands. In fact, one command has already been implemented, the assignment statement without the optional LET keyword.

It was mentioned some time ago that there was a way to eliminate the need for the dummy close parentheses codes. These dummy codes are put into the translated output where unnecessary parentheses were entered into an expression, so that the Recreator would know to reproduce them. This will prevent confusion from having entered parentheses just disappear. During run-time, these dummy codes would be skipped.

It turns the same method to get rid of these dummy parentheses codes will also be used for the LET command, first command to be implemented. There are already 8 assignment codes, and there will be 4 more once temporary strings are fully implemented in the Encoder. All of these assignment codes are for without the optional LET. Either another dummy code is needed for the LET or each of the assignment operator needs to duplicated when the optional LET is included – that's 12 more codes.

Neither of these two alternatives is desirable. There is a third alternative that will help eliminate the need for these dummy codes, but some look ahead planning is required into the design of the format for how the BASIC program code will be stored internally. In fact, as each BASIC command is implemented, so forward planning is necessary into how the command will be stored in memory and executed during run-time. Next a preliminary design of the internal code...

Saturday, May 22, 2010

Translator – Multiple String Assignments (Release)

Upon contemplating the string flags in the table entries, I realized that care must be taken to make sure the string flag is set in the table entries when an operator or function code has a string operand. A better design is to set the string flag automatically during table initialization if any of the operands of the string data type. Therefore, the string flags were removed from the table entries and the Table constructor was modified set the flags in the table entries automatically.

To correct the issue of saving all the operands for string list assignments, a temporary simple stack is used to save the operands as they are popped off of the done stack and processed. Since the last two operands (the value being assigned and the last item in list) have already been popped and processed by the find code routine, these are pushed to the simple stack before processing the rest of the operands. Each additional operand processed is also pushed to this stack.

After the list operands are processed for a string list assignment a new array needs is allocated and filled from the temporary simple stack. In order to determine the size of this array, either the operands need to be counted as they are processed, or the number of items in the stack needs to be accessed. Since the simple stack already knows how many items it has, a new access function was added to the SimpleStack class to return the number of items.

It had been decided previously that mixed strings and sub-strings would be allowed in a multiple list assignment statement. Having a mix string list assignment replaces the need for a separate sub-string list assignment code, the mix-string list assignment will handle this case. So, a new AssignListMixStr associated code was added to AssignList (this new code has a sub-string as the first operand, the value being assigned). The list assignment handling code was modified to detect if the list contains both reference strings and sub-strings. If it does, then the token is changed to this new AssignListMixStr code.

Several new sub-string and mix-string assignments were added to test inputs. This completes string handling in the Translator. The code now handles expressions and assignment statements and ibcp_0.1.11-src.zip has been uploaded at Sourceforge IBCP Project along with the binary for the program. Next the real meat of the Translator begins, translating actual BASIC commands...

Friday, May 21, 2010

Translator – Sub-String Assignments (Release)

After testing the Table initialization code that checks maximum number of operands and maximum number of associated codes, several sub-string assignments statements were added and tested. Then some sub-string assignments with error were added.

One of the types of errors being tested were the assignment of temporary strings like MID$(A$+B$,2)=C$. There error was correctly detected, but the error was pointing to the MID$, which could be confusing. Therefore, a change was made that for an assignment, if the expected reference flag is not set, then if the token without the reference flag is a sub-string function (its data type is sub-string), then the error token is set to the token if the first operand of the sub-string function. So for this statement, the error will be pointing to the + of the A$+B$.

During testing of the sub-string assignments, one of tests tried was the mixing of regular strings and sub-string, for example A$,LEFT$(B$,1),C$=D$. This did not work correctly (this type of statement will be allowed). Also noticed that for lists, the assignment operator output list item only had two operands (the last variable being assigned and the operand being assigned). It should contain all of the operands being assigned. So these are two more issues that need to corrected.

In any case, sub-string assignments are working so another developmental release is being made and ibcp_0.1.11-dev-3-src.zip has been uploaded at Sourceforge IBCP Project along with the binary for the program. Now to fix assignment list operands and mixing of string types...

Thursday, May 20, 2010

Translator – Sub-String Assignments (Testing)

The changes to support sub-string assignment were implemented, which included adding the SubStr data type entries to the conversion code array in the match code routine; changing the data type to SubStr for the LEFT$, MID$ and RIGHT$ table entries; and added the sub-string reference checking. Initial testing started with the existing test inputs – no problems were discovered.

Upon trying a sub-string assignment, discovered that it didn't work because there was no AssignSubStr associated code, so this code was added along with its table entry. Now, many of the existing test inputs were failing. Next discovered that the maximum associated codes needed to be changed from 2 to 3 because Assign now had three associated codes. The wrong value was causing the find code routine to malfunction.

I thought it would be best to calculate both the maximum operands and maximum associated codes automatically during the Table initialization, so moving these constants to members of the Table class seemed to be the best solution. Unfortunately, these values need to be constants because they are used to define the sizes of several arrays.

It would still be prudent to have the the Table initialization at least check to make sure these constants agree with what was in the table entries. So code was added to the Table initialization to find the maximum operands and associated codes as it was scanning the entries for code checking. Two new table error types were added for these errors, which are reported by exceptions from the Table constructor. Testing continues...

Wednesday, May 19, 2010

Translator – Sub-String Assignments

There will be several associated codes for assigning strings from a reference string or temporary string to a reference string or to a sub-string. The Encoder will determine which code to use once the string type for each operand is determined. For the Encoder to know if a sub-string is present, a new SubStr data type is needed, which only the LEFT$, MID$ and RIGHT$ functions return.

The Translator will check to make sure that sub-string assignments are valid by making sure the string argument is a reference string, at least if it is a Paren or NoParen token type. If it turns out to be a user function, then the Encoder will report the error. The Translator will also make sure that a compound sub-string assignment like LEFT$(RIGHT$(B$,3),2)="AB" is not entered.

The find code routine needs to handle sub-strings where the operands are popped off of the done stack and the reference flag is checked. For the first operand (last operand popped), if the data type of the token is SubStr (indicating a sub-string function) and the operand is a String, then the token's reference flag is set to operand's reference flag (the reference flag is transferred). The operand's reference flag is then cleared as it is currently implemented. This assumes that the first argument of the sub-string function is the string being assigned.

If the operand's data type is anything but String (SubStr or TmpStr), then the reference flag is not transferred. For the invalid compound sub-string assignment, since the data type of the operand will be SubStr, the reference flag of the function's token will not be set. When the reference flag is checked for an assignment operator, the reference flag will only be set if the string operand was reference, otherwise if won't be set and an error will be returned.

Tuesday, May 18, 2010

Sub-String Assignments

Many BASICs support sub-string assignments, the syntax that will be supported is the same used as in GW-Basic, QuickBASIC, FreeBASIC, etc.. Here is an example sub-string assignment along with it's translation:

MID$(A$,5,2) = "AB" A$ 5 2 MID3$ "AB" AssignSubStr

Notice the new code AssignSubStr and the lack of <ref> on A$. When the MID$ is processed by the Translator, the reference flags of its operands are cleared, hence no A$<ref>. However, the MID3$ needs to have its token reference flag set when the assign operator checks for a reference, since that MID3$ token will be on the done stack. Consider how this statement is processed at run-time.

At the A$, a reference string is pushed on the evaluation stack, that is, its pointer and length are copied to the stack. When the MID3$ is processed, it will pop the 2 and 5 off of the stack. The pointer on top of the stack will be changed to point to the fifth character in A$ and the length is set to 2 (assuming that A$ is at least 6 characters long).

When the AssignSubStr is processed, the "AB" string constant will be popped off of the stack. Two characters of this value are copied directly to the pointer that is on top of the stack (which is pointing to the fifth character in A$). A different AssignSubStr vs. AssignStr code is required because no allocation occurs, only a copy to an existing character array. An AssignSubStrTmp code is also required for the case that the value being assigned is a temporary string, which needs to be deleted after the characters are copied.

Since LEFT$ and RIGHT$ are also sub-string functions, there is no reason these functions can't also be used to assign sub-strings – and as it turns out, no extra code is required. Next, how sub-string assignments will be handled by the Translator...

Monday, May 17, 2010

Translator – Sub-Strings

The processing of operators and internal functions with string operands needs to be delayed until encoding because in some cases, the Translator does not know if a string is a reference (from a variable or array element) or a temporary (from a user string function) because the Translator does not know if identifier is a variable, array element or a user function.

At run-time, if the string argument to a sub-string function is a reference string, then it's result is also a reference string in that the string does not need to be deleted. If the string argument is a temporary string, then it's result is also a temporary string in that the string needs to be deleted.

Each sub-string function will have an associated code. The main code will have a reference string operand and will return a reference string. The associated code will have a temporary string operand and will return a temporary string.

It would appear that sub-strings have no impact on the Translator. However, there is one more capability that needs to be considered, and that is sub-string assignments where a portion of a string is assigned without affecting the rest of the string...

Translator – Saved Operand Correction (Release)

The problem where the saved operand pointers were not pointing to a conversion code that was inserted (was pointer to the original operand), was a simply correction. All that was needed was to set the operand array element to the output of the output list append call where the conversion code token was inserted. The ibcp_0.1.11-dev-2-src.zip file, another developmental release, has been uploaded at Sourceforge IBCP Project along with the binary for the program.

Sunday, May 16, 2010

Sub-Strings – Details

The results of sub-string functions (LEFT$, MID$ and RIGHT$) will use the same character array as the argument string. Exactly how this is accomplished depends on whether the argument string is a reference string or a temporary string.

For a reference string, the pointer and length of the character array are copied to the evaluation stack. This means that the pointer and length on the stack can be modified without affecting the actual pointer and length of the variable or array element. For LEFT$, only the length on the stack needs to be changes. For MID$ and RIGHT$, both the pointer and length need to changed depending on the integer arguments.

For a temporary string, the sub-string operation is a little more involved because the pointer to the character array on the stack cannot be modified – it is needed to delete the character array when the temporary string is no longer needed. The length is not needed to delete the character array, so it can be modified. Since the pointer cannot be modified, the portion of the sub-string needs to be moved to the beginning of the character array. Again for LEFT$, only the length needs to be changed, the sub-string is already at the beginning of the array. But for MID$ and RIGHT$, the sub-string result needs to be moved to the beginning of the array.

For reference strings, there is no allocation, copying or deletion required for the sub-string operation. For temporary strings, there is still a moving of characters, but there is no allocation of a new character array or deletion of the old character array. After the expression is evaluated, the temporary string will be deleted. There could be some unused space in the character array from a sub-string operation when the temporary string is assigned and used as it. Enough of discussing how sub-strings will work at run-time, next, what impact sub-strings have on the Translator...

Sub-Strings

Because strings are variable length, they are dynamically allocated as needed during run-time. It is therefore beneficial to reduce as much as possible the amount of allocating, copying and deleting of the character arrays. This is why reference strings, the values of string variables and array elements, are used as-is so a new character array does not need to be allocated and copied to in order to put the reference string on the evaluation stack. But this will require extra code to know when to delete a temporary string and not to delete a reference string on the stack, which will be accomplished with additional associated codes.

There is another way to reduce some allocation and deleting of temporary strings for the sub-string internal functions, aka LEFT$, MID$, and RIGHT$. These functions can have a reference string or a temporary string as an argument. The obvious way to implement these functions is to create a new temporary character array of the appropriate size for the resulting string, copying the characters from the argument string to the new array and if the argument string is a temporary string, to delete it.

There is a simpler way to handle sub-strings that will eliminate the allocation of a new character array, deleting the temporary string argument if present and the copying for a reference string. Consider how a string is stored, there is a character array, there is a pointer to the character array and there is the length of the character array. The pointer and length make up the members of the String class along with the allocated array. A resulting sub-string will never be larger than the string argument. so why not use the same character array, since it has already been allocated. Next, details on how this will work with reference strings and temporary strings...

Saturday, May 15, 2010

Translator – Temporary Strings (Release)

Before testing the changes, the test code was modified to output the operands that were saved by the Translator within square brackets separated by commas. This change was made to see if operands were being saved correctly for the correct tokens (array, user functions, operators with string operands and internal functions with string operands). Only the primary operand token is output, so it the operand is an operator, that is what is output.

The code appears to be working, at least for all the current test inputs with one exception (no new test inputs were added at this time). If a conversion operator was inserted for an operand, the operand is not pointing to the conversion operator. For example, for the statement Z$=MID$(A$,B+C,D), the MID$ token is output as MID3$([A$,+,C], but should be output as MID3$([A$,CvtInt,CvtInt] since conversion operators were inserted for the + and D operands.

Other than this problem the code appears to be working. Since there is more work is needed to complete string data type handling in the Translator and not much was changed (temporary strings, which had a minor affect on the Translator; and operands are saved for later processing as needed for the Encoder), the code is being released only as a developmental release and ibcp_0.1.11-dev-1-src.zip has been uploaded at Sourceforge IBCP Project along with the binary for the program.

The rest of the string data type handling has to do with the sub-string functions (aka LEFT$, MID$ and RIGHT$), which will handle strings slightly differently during run-time...

Translator – Temporary Strings (Implementation)

Upon making the changes to add temporary strings, I realized that there are more string operators than just the CatStr operator – there are the equality, relational, assign, and assign list operators, all of which can have string operands (either reference or temporary). For the equality and relational operators, the character arrays of any temporary strings need to be deleted after the comparison is made.

For the assign operator, if the operand is a reference string, then a new character array needs to be allocated and the reference string copied. However, for a temporary string, the string variable's character array can be set to the temporary character array (making it a reference string) and the current character array is deleted, eliminating the need to allocate and copy. For assigning a list, the temporary string can only be used for one of the string variables in the list, the rest need new arrays allocated.

There is already a String flag used for immediate commands, and since it's value doesn't conflict with any of the existing flags (immediate command flags are separate), it can also be used for operators and internal function codes that have string operands. A note was added to the code to make sure it's value is not used for another flag value.

A constructor was created for the new RpnItem structure that takes a token pointer, number of operands and a pointer to an operand array as arguments (the later two default to 0 and NULL). If the number of operands and operand array pointer is supplied, an operand array will be allocated for a non-zero number of operands and the operand pointers will be copied from the supplied array. A destructor was also created to delete the token and the operand array.

Translator – Temporary Strings

The impact of temporary strings on the Translator is not much since the work of finding associated codes will be left for the Encoder. However, there are a few changes that are required – the main one is passing the operands to the Encoder so that it has easy access to them.

A new TmpStr data type will be added. The data type of the string operator (CatStr) and the internal functions that return a string will be changed to TmpStr. This includes the internal functions CHR$, REPEAT$, SPACE$ and STR$. Something else is needed for the functions LEFT$, MID$ and RIGHT$, which will be revealed shortly. Any string DefFunP and DefFunN token types will also be changed to TmpStr since defined (one-line) string functions will return a temporary string during run-time.

The conversion code table used by the match code function needs to be expanded to include the new TmpStr data type. As far as the Translator is concerned, the String and TmpStr data types are the same, therefore the conversion code table entries for String to TmpStr and TmpStr to String will be set to the Null code.

For the string operator and internal string functions that have string operands, the operands need to be attached to the operator/function token within the output list. To accomplish this, the output list will be changed from a token pointer to a pointer to a new RpnItem structure that will contain the token pointer, the number of operands (0 if not applicable) and a pointer to an array of output list item (RpnItem) pointers.

There will be a new String flag added for these codes so that the Translator can easily identify them. When this flag is set, the number of operands will be set and the output list pointer array will be allocated. This array will be filled from the operands popped from the done stack. The done stack will also be changed to a stack of RpnItem structure pointers.

This saving of operands functionality is also needed for parentheses tokens, which can be arrays or user functions. The processing of array subscripts and user function arguments also needs to be delayed until the Encoder since the Translator doesn't know the difference between an array (whose subscripts all need to be integers) or a user function (whose argument types will be contained in the dictionary and defined in a function definition statement).

Friday, May 14, 2010

Temporary Strings – Design

During run-time the string operator and internal string functions need to delete the character array of a temporary string operand, but not of a reference string. Figuring out the design for temporary strings has proven difficult. The ultimate goal is fast execution at run-time. Two solutions were considered.

The first solution (easy to implement), would be to add a temporary flag to the String class. After an operation was finished with an operand, it would be deleted. If the temporary flag is set, then the character array would be deleted, otherwise, no action is taken. This was not desirable because the run-time would need to check for the temporary string flag, adding to execution time.

The other solution (more involved to implement) would not require checks for temporary strings during run-time. This solution is desirable, and will require separate codes for whether the operands are reference strings or a temporary strings.

For instance, the CatStr code would handle the case where both operands are reference strings. There would be three associated codes to the CatStr main code, one for when the first operand is a temporary string and second operand is a reference string (CatStrT1), one for when the first operand is a reference string and the second is a temporary (CatStrT2), and one for when both operands are temporary strings (CatStrTT). These associated codes would delete the temporary operand(s) after the concatenation.

Unfortunately, the presence of temporary strings is not necessary known during translation. The reason being that identifiers could be variables, array elements, or user functions. Variables and array elements produce reference strings. However, user functions like internal functions produce temporary strings except within a user function where the function name will be used as a variable holding the value that will be returned by the function.

Remember that the Translator does not determine what the identifier is – this job is left to the Encoder, which has access to the Dictionary. The Dictionary will contain all the identifiers and what they are. Next, what changes are required in the Translator for temporary strings...

Wednesday, May 12, 2010

Reference Vs. Temporary Strings

During run-time as expressions are executed, the result from each operator or function is pushed back onto the evaluation stack. This result is considered a temporary value, which will be used by the next operation (operator, function or command). The evaluation stack holds these temporary values. However, the actual character arrays for strings, including temporary string values, will not be contained in the stack elements, only a pointer (and length) of the character array. Consider this string expression and it's translation:

A$+B$+C$ A$ B$ +$ C$ +$

As this expression is executed, the value of A$ is pushed onto the stack. For double and integer variables, values are simply copied to the stack. However, for strings, only it's character array pointer and length will be copied to the top item on the stack. This prevents having to allocate a new array and to copy the actual string value of A$ to the array. This item on the stack is a reference string. Similarly, the value B$ is pushed onto of the stack.

At the first +$ operator, a string needs to be created to hold the concatenation of A$ and B$. First the A$ and B$ operands are popped from the stack. The size of the result character array is allocated for the total length of A$ and B$. The contents of A$ and B$ are copied to this character array. The resulting string's length and character array pointer are then pushed on to the stack. This item on the stack is a temporary string.

The value of C$ is pushed on to the stack. At the second +$ operator, another temporary string needs to be created to hold the concatenation of A$+B$ (now a temporary string on the stack) and C$ (a reference string). Both operands are popped off of the stack and a temporary string is created as before. However, for this operator, after the result has been pushed on to the stack, the character array of the temporary string holding A$+B$ needs to be deleted. Next, what implications this has on the Translator...

Monday, May 10, 2010

String Data Type

It now time in the Translator development that consideration needs to be given to how the program will be executed at run-time, which will determine what the Translator needs to do - and this applies to strings. Keeping in mind that the goal is to do as much work and decision making during translation so that it does not have to be done at run-time, which will aid in fast execution.

During execution, operands are pushed onto the evaluation stack, popped off by operators and functions, evaluated and the result pushed back onto the stack. Each evaluation stack element can hold a double or an integer. Strings are a little more involved. Strings are variable length and can be quite large, therefore a stack element cannot hold a string (or the stack would be very large).

C++, the underlying language, does not have a string type. Strings are handled as character arrays. These arrays need to be allocated for the appropriate size as needed and deallocated when no longer needed. A stack element will be a union of the different types that can be pushed onto the stack (double, integer, reference, etc.). For a string, the item in the union will be the previously implemented String class, which contains the length and pointer to the string's character array. Next, the difference between a reference strings and temporary strings...

Sunday, May 9, 2010

Translator – Data Types and Assignments (Release)

Testing started by using the previous six translator test input sets. First discovered that a check was needed for a NULL expression information structure pointer in the data type and unary code access functions, specifically for the Null and EOL codes. Also discovered that any unary operators needed their own expression information structures since all that standard ones set the unary code to the Null code. Many other minor bugs discovered with these test inputs were corrected.

There were no differences with the first four test input sets, since these contain only expressions with no assignment operators. The fifth and sixth test input sets had differences because either the correct data type assignment operator is now being added or a hidden conversion operator is now being inserted. Even though care was taken to make sure all the assignments in the test inputs were valid, one statement was trying to assign an integer to a string, so this expression was corrected.

A seventh test input set was added. One of the errors tested pointed to a token that may not be appropriate. The statement “A=B$+5” produces a data type error as expected, but the error is “expected string” pointing to the 5. This is not wrong, the plus is processed first and with the first operand a string, the Translator expects a string for the second operand, hence the error. But this may be confusing to the user, the error should probably be “expected double” pointing to the B$. This will be left as is for now to be revisited later, since there is no easy fix for this.

The code now handles data types for assignment operators and ibcp_0.1.10-src.zip has been uploaded at Sourceforge IBCP Project along with the binary for the program. Next the string handling needs to be expanded...

Translator – Data Types and Assignments (Implementation)

The only issue that occurred as support for data types were being added to assignment operators was developing code to return the proper error and the proper token for assignment lists. The issue was that the error should point to the first token in the line that has in error, which is complicated by the fact the the items in the list are processed in reverse order as that is how the tokens are pulled from the done stack.

This section proved to be difficult to implement, so the gory details won't be included here. An added complication was that a not a reference error should be reported on a token that is both not a reference and is not the correct data type. Basically the code needs to keep track of the last operand processed. A reference error is reported on the current operand being processed, but a data type error is reported on the last operand processed, but only if it is a reference (otherwise a reference error has already been set for the token).

Now that these changes have been implemented and compile successfully, debugging and testing can begin...

Saturday, May 8, 2010

Translator – Expression Information (Revisited)

Sometimes programing is an evolutionary process. The final solution is not arrived at initially, so an intermediate solution is made. Once made, another better solution is made. And so on until arriving at a final solution. And this seems to be the case with the expression information, which has taken a while to arrive at a good solution. The changes described in the last post were completed, but they didn't look very good (very cluttered) – maybe there was better way, perhaps using new operator as originally planned.

It was also observed that a lot of the operators had the same operands, for example one double, two doubles, two integers, etc. There was no reason to have an operand array with two double data types for every operator that requires this – in other words, the same array could be used for all the operators (or internal functions) that take two double operands. There only needs to be one data type operand array for each unique operand set – so there will be one DblDbl_OperandArray that will be used for all operators or internal functions that take two double operands, an so on for the other possible operand sets.

All the possible unique operand arrays will be defined using the convention Dbl, Int, and Str for each operand data type (e.g. StrInt for two operands, a string and an integer operand). All the associated codes arrays will defined next, where each code has it's own array. There will be two macros (defines) that will produce the arguments for the expression information constructor in the table entries, which will look like this:

new ExprInfo(Double_DataType, Null_Code, Operands(DblDbl), AssocCodes(Add))

In this example, the Operands macro will produce two arguments, a 2 for the size of the operand array and a pointer to DblDbl_OperandArray. The size of the array will be calculated within the macro. The AssocCodes macro will be similarly defined. The constructor will have default argument values so if there are no associated codes, the AssocCodes() argument would not be included and the number of associate codes would be set to zero and the array pointer set to a NULL. Similarly if there are no operands.

As the table entries were being updated, it was observed that many of the entries had identical expression information constructor calls – for example, the math functions generally return a double, have no unary code, have one double operand and no associated codes. Again there is no reason to have a separate expression information structure for each one. So there will be comman expression information structures defined for these in the form of Dbl_Dbl_ExprInfo (returns double and has one double operand) and Int_IntInt_ExprInfo (returns double and has two integer operands), an so on. The address of these will be put into the table entries.

The table entries for the main codes are unique because each has it's own associated codes array, so the new ExprInfo syntax listed above will be used for these. This may still not be the best solution, but it's good enough for now, and I need to stop fiddling with this and get data type handling for assignments implemented...

Monday, May 3, 2010

Translator – Table Entry Expression Information

There will be an expression information structure for holding the expression related data for operators and internal functions (data type, unary code, number of operands, operand data types, number of associated codes and associated codes). The plan was to use the new operator along with the constructor of the expression information structure within the table entries like this:

{value, value, new ExprInfo(value, value, …), value, …},

The constructor would have a variable number of argument depending on the requirements of the operator or internal function. However, in practice this did not work. While variable arguments can be used in constructors, the actual arguments in the “...” can not be enumerations. So instead, the expression information structure along with the operand data type and associated code arrays will be defined outside of the table entries like this:

DataType Op_OperandDataType[2] = {Double_DataType, Double_DataType};
Code Op_AssocCode[1] = {OpInt_Code};
ExprInfo Op_ExprInfo(Double_DataType, Null_Code, 2, Op_OperandDataType,
1, Op_AssocCode);

If there are no operands (no argument internal functions only) or associated codes, then there will be no need to define these arrays. The constructor will be defined with default arguments that will set the pointers to the arrays to NULL. For the expression information above, the table entry will look like this:

{value, value, &Op_ExprInfo, value, …},

The table access functions will be modified to get the data using the expression information structure pointer of the table entry instead of direct members of the table entry. Because of the nature of C++, none of the code that uses the access functions need to be modified, with one caveat. The match code function assumed there were two associated codes and stopped its loop when a Null code was found. This function will be changed to get the number of associated codes (a new access function) and use that value for the loop. Fortunately this is a very minor change.

Now there is a lot of editing to modify the table entries for the new expression information structures and then on to implementing data handling for the assignment operators...

Sunday, May 2, 2010

Translator – Table Entry Restructuring

In looking ahead in the design of the assignment operators, I realized the size of the associated code array will need to increase from 2 to 5. This is due (spoiler alert) because there will be three types of strings internally – more on this when the design of strings begins shortly. Having an associated code array size of 5 for only two operators is a waste. There's needs to be a little more efficient storage system that doesn't penalize all the table entries for only two entries. Similarly for the operand data type array (now set at a size of 3).

In thinking about this further, only the operators and internal functions require the data type, unary code, number of operands, operand data types, associated codes and perhaps even the precedence. Currently the precedence for commands is obtained from the table entry, but no actual values have been entered yet. The design and implementation of commands is further done the line, so for now, precedence will remain implemented as is.

Anyway, for operators and internal functions, the information mentioned can be put into an expression information structure, for which a pointer to this structure will be put into the table entry. The expression information structure (class) will have a constructor to initialize the members from the constructors arguments and the new operator will be used to allocate the structure. Likewise for the operand data type and associated code arrays – the arrays will be allocated for only the size needed.