Interactive BASIC Compiler Project: September 2013

Thursday, September 26, 2013

REM Command Correction

The operands of program instructions will hold indexes to dictionary entries that will contain the text of the instruction (for example, variable names, the original strings of constants, the strings of remarks, etc.). The remark dictionary will be implemented first since it will just hold the strings of the comments. The other dictionaries will required looking up strings to find if a variable or constant already exists.

But first, a problem was discovered with how REM statements are parsed. The REM command should be recognized regardless of what characters follow the command. The parser for the most part did this already except that a space was required after the command. However, the parser should not require the space, consider these examples:

REMARK this should be a valid commented
REM any number of spaces should be allowed

The first statement was rejected because it was assumed to be an assignment of the REMARK variable and expected an equals instead of this. The second statement was valid but all the spaces were removed from the comment string.

The parser get identifier was modified to first look for a statement starting with the three characters R-E-M and store all the characters after this before scanning for a word (valid identifier characters up to a invalid identifier character). Some minor code simplification was also done in replacing the sequence of setting the token code, type and data type with a call to the existing set token table routine that performed these steps. Two additional statements were added to translator test #15 (remark tests) similar to the two examples above.

[commit c52d65a479]

Tuesday, September 24, 2013

Encoding Program Lines (Begin)

The initial version of the encode routine was implemented that takes the token list generated by the translator and converts it to a vector of program words, the contents of which will be inserted into the program. The size for the program word vector is obtained from the token list, which was calculated at the end of a successful translation. For now, the operand words are set to zero for instructions with an operand. Dictionary look ups will be required to generate the indexes for the operands.

The ProgramWord class was implemented containing a single unsigned short (16-bit) word variable with instruction and operand access functions. This class does not determine whether the word is an instruction or an operand, which is up to the user of this class. There are access functions for getting the code from an instruction word, checking if an instruction word has a sub-code set, setting an instruction to a code and sub-code, getting the operand word, and setting the operand word.

The ProgramLine class was implemented to contain a vector of program words and is based on the QVector class of the ProgramWord class. A class was implemented instead of just using a vector so that a text routine could be added to convert the program line into text for testing. The instruction text and operand text routines were implemented in the ProgramWord class. I single text could not be implemented since by itself, this class does not know whether the word is used as an instruction or an operand.

The test encode input routine was modified to call the encode routine upon a successful translation and output the encoded program line (a vector of program words) using the program line text routine. Since the operand words are set to zero, encoder test #1 currently fails since the text for operands is do no match. This will be handled as the dictionaries are implemented.

[commit 60add1604a]

Sunday, September 22, 2013

Condensing Sub-Code Bits

A program word will consist of instruction words and operand words. The instruction word will consist of the instruction code and sub-codes. The sub-codes will generally only be used to recreate the original program and will not be used during execution (though there will be a few exceptions). The code will reside in the lower 10 bits of the instruction word, which will allow for 1,024 different codes (which should be sufficient). This leaves the upper 6 bits for the sub-codes.

There are already five sub-codes that need to be stored in the instruction word, including the Parentheses, Colon, LET, Keep and Question sub-codes. This only leaves one bit for another sub-code and there are a lot more commands to implement. The sub-code bit usage needed to be reduced. As it so happens, the LET, Keep and Question sub-codes will never be on the same code, so only one bit is really required for these sub-codes.

To accomplish this, these three sub-codes were replaced with the new Option sub-code. The LET and INPUT translate routines now set this sub-code instead of the individual sub-codes that were removed. This new sub-code also can be reused for other commands requiring an individual sub-code.

The bits of the sub-codes were also rearranged with the sub-codes that will be used in the instruction words (Parentheses, Colon, and Option) in the same bit positions. This will prevent having two sets of sub-code definitions. A new Program Mask sub-code definition was added that will be used the mask the program sub-codes from the other token sub-codes when the instruction is created.

So that the proper sub-code can be output during testing, an option name variable was added to the table entries. Commands used the option sub-code will have the text name of the sub-code in the option name. In the token text routine, if a token has the Option sub-code, the option name is output (or the string "BUG" if the code does not have an option name).

[commit e721a3e3ae]

Encoding – First Phase - Revisited (Tagged)

The tokenCodes0.5 branch of development was successful so it was merged back into branch0.5 (a fast forward merge meaning the branch0.5 pointer was simply moved).

Since the commands.h header file will contain definitions for other things (like routines for constants), it was renamed to the more generic basic.h, which was also added as a dependency to the application binary in the CMake build file so that it appears in the Project source file list within QtCreator.

In the table entry initialization, it turns out that using "" (blank string) and NULL (pointer) for an initializer to a constant QString variable has the same effect, therefore, all "" were replaced with NULL in the table entries.

The first phase required for encoding is complete (again) except now it is now performed by the translator. This is a good point to tag version v0.5.1. Work will now begin again in encoding the tokens into the internal program code format to be stored in the program model.

[commit b461229bcd] [commit f64d4401d2]

Translator – Determine Code Size

The output assign codes routine contained two loops, the first to assign codes to tokens without codes, and the second to assign program word index values and determine the encoded size of the tokens. Since the token codes are now assigned within the translator routines, the first loop was left to just looking for unimplemented token types. This now can be accomplished in the second loop by simply checking if the token does not have a valid code.

When this routine was being modified, I realized that this routine along with the output append and output insert routines should be members of the RPN List class and not the Translator class since they only deal with the RPN list and do not use any of the translator member variables (with one exception). Therefore, these functions were moved to the RPN List class. The append and insert routines remain in the translator class as output list access functions for the command translate routines, however, the translator routines were changed to access the output list functions directly.

The output assign codes routine was renamed to the set code size routine since that is basically what it does besides setting program word indexes (related) and checking for token types not yet supported (needed only during development). The reset code size and increment code size access functions were removed since the code size member variable can be access directly. This routine now returns a boolean value (false meaning an unimplemented token was found, which is returned). An argument for the reference to the table instance to access the flags for codes was also required.

[commit d0e39b01f2]

Saturday, September 21, 2013

Translator – Assigning Variable Codes

In the get operand routine, the cases that fell through to the next case were replaced with the actual code needed each the case (in some cases, checks were performed that had already been done or were not necessary). For now, the identifier with no parentheses token type is assumed to be a variable. Eventually, the function dictionary will be checked to see if the identifier is a function before assuming it is a variable. The token is assigned a variable code, or a variable reference code if a reference was requested.

Previously when a reference was requested, the reference flag of the token was set. This is no longer necessary since nothing down stream needs to check the reference flag (except the debug output code). The reference flag was added to the variable reference code table entries so that the debug output code knows to output the reference indicator for this token type.

To validate that identifiers with no parentheses are being assigned a code by the end of translation, the token text routine was modified to output a question mark in front of a these tokens if no code has been assigned. When the with index flag argument is set (during encoder test output), these tokens output the code and operand (the string of the token) instead of just the string since these tokens will be encoded as two program words (the variable instruction and its index). Also, variable token codes are no longer assigned a code in the output assign codes routine called at the end of translation.

To make the variable output work correctly for both translator and encoder test output, the token type in the table entries for the six variable codes were changed from the internal function with no parentheses type to the no parentheses type. This will make these no these tokens go through the no parentheses token type case in the token text routine. This is inconsequential since the token type disappears once the tokens are encoded.

[commit db6795346f] [commit a1f413b2e1]

Translator – Assigning Constant Codes

In the get operand routine, if the token is a constant and the data type requested is a specific data type (double, integer or string), the token can be assigned one of the constant codes. For the any, none or number data types, it is not yet known what type the constant needs to be, so the code set assignment code will be delayed until it is known.

This will generally take place when the find code routine is called as the type of constant needed will be determined by the operator being processed. A constant token will be assigned before returning unless an error detected (wrong data type or a double constant that cannot be converted to an integer).

For most functions, constant arguments the will be assigned a code by the get operand routine since the arguments are for a specific data type. A few functions have multiple forms for double and integer arguments (ABS, SGN and STR$). For these functions, a constant argument is assigned a code since no code has not been assigned yet. Integer constants containing a decimal or an exponent (the double sub-code is set) are first changed to the double data type.

Previously, at the end of the get expression routine, the data type of a constant was forced to the expected data type passed. This would be another place a constant token needs to be assigned a code, however, it turned out that this code was never executed because the type of constants had already been determined and assigned a code by the end of this routine, so this code was removed.

To validate that constants are being assigned a code by the end of translation, the token text routine was modified to output a question mark in front of a constant token if no code has been assigned. When the with index flag argument is set (during encoder test output), constant tokens output the code and operand (the string of the constant) instead of just the constant since these tokens will be encoded as two program words (the constant instruction and its index). Also, constant token codes are no longer assigned a code in the output assign codes routine called at the end of translation.

To make the constant output work correctly for both translator and encoder test output, the token type in the table entries for the three constant codes were changed from the internal function with no parentheses type to the constant type. This will make constant tokens go through the constant case in the token text routine. This is inconsequential since the token type disappears once the tokens are encoded.

[commit fa0916f67b]

Table – Finding and Setting Token Codes

At the moment, the token codes are assigned to the constant and identifiers with no parentheses token types after the translation is complete. This was initially the first step of encoding, but was moved into the translator. Before moving this code assignment earlier in the translation, a simplified routine was needed where it is given a token with a data type and for a base code, find a code for that data type (which will be either the base code or one of its associated codes).

The table find code routine was being called to perform this action, but this routine does a lot more including converting a constant to the expected data type (so no hidden conversion code needs to be added), finding a conversion code if there was no associated code with the desired data type, and returning the expected data type when data type cannot be converted (an error).

The find code routine contained the part for finding an associated code, so this part was moved into the new set token code routine. This new routine just looks up an associated code if the expected data type of base case does not match a specified data type for a given operand of the code. If the base code or an associated code matches, then the token is set to the code and its type and data type is set to that of the code. Otherwise it returns false.

A secondary simplified set token code routine was added for setting the code of a token for a base code for the data type already in the token. The base codes that are used for this routine have only one operand, which currently only includes the Const, Var and VarRef codes. This secondary routine just calls the main set token code with the data type of the token and for the first (and only) operand of the base code.

[commit 32eec6f3c0]

Saturday, September 14, 2013

Translator Token Code Transition (Begin)

The translator will be transitioned to assign token codes to token types without a code as soon as possible. This includes the constant, identifiers with and without parentheses, and defined functions with and without parentheses token types. The handling of the defined functions types will be delayed if possible until define functions are implemented (much later).

Development of this transition will be done on the tokenCode0.5 sub-branch in case the changes do not work out and need to be abandoned, otherwise when complete, the main branch0.5 branch will be fast-forwarded to this sub-branch (which can then be deleted).

The first change was to give all tokens a code when they are created by the parser. Since the mentioned token types do not have a code, these will be assigned to the Invalid code value. This was accomplished by initializing the code to Invalid in the token constructor. As currently implemented, when the parser finds a command, operator, or internal function, the token is then assigned a valid code.

Since all tokens now have a code (some with the Invalid code), the token code can now be used to determine if the token has a table entry (it doesn't if the code is invalid). The has table entry access function is now longer needed and was removed along with the static table array indexed by token type that was used to determine if the token had a table entry. A new has valid code access function was added in its place. Users of the has table entry access function were updated to use the token code (see the commit for the specifics).

[commit 57d65135de]

Friday, September 13, 2013

Encoder/Translator – Design Considerations

In preparing for implementation of the next Encoder step of actually encoding the RPN list of tokens into program code, I realized that were some design issues. The first minor issue was that assigning codes and assigning program code word indexes with word counting can't be in the same loop because later for arrays, integer conversion codes may need to be inserted after subscripts, which will trip up the indexes already sett and the words counted. These indexes and word count could be adjusted as items are inserted, but it would more efficient to have two loops.

In further consideration of the design, I realized that the these first steps should really be in the translator and not the encoder. The code for these steps were moved to the translator (executed after successfully getting the commands for the line). A new test mode option was needed so that this call would not be made when just testing the translator so the expected translator test results would not change.

However, I realized that the translator could assign the codes before waiting until the end of the translation. The codes could be assigned when the tokens are first processed, specifically in the get operand routine. Also, the translator cannot work in a vacuum, meaning it will need access to the existing program, specifically the various dictionaries so that the types of tokens can be determined (variables, arrays, or functions).

Therefore, the translator will be changed to this new scheme of assigned codes to tokens in the get operand routine, but over the next several commits, with the exception that the dictionaries are not implemented yet. The first change made was to move the first two encoder steps into the translator and two separate loops, though this is a temporary step (and the encoder class is now empty, but this is also temporary).

The second change in this transition was to add a temporary check to determine if an identifier with parentheses is an array or function. Eventually this will be accomplished with the array and function dictionaries. To make this determination simple, if the identifier starts with the 'F' letter, then it is considered a function, otherwise it is considered an array. This affected the expected results because array subscripts are checked for integers with integer conversions added as needed.

[commit f9f0cd3de7] [commit 9111d8e0f8]

Saturday, September 7, 2013

Encoder – First Phase Complete (Tagged)

The initial implementation of the encode is complete, which includes preparing the tokens for encoding. Version v.0.5.0 has been tagged. Work will now begin in encoding the tokens into the internal program code format.

[commit 6ed93f833c]

Encoder Test Mode – Blank Lines

When the new line was added to encoder test #1, a blank line was added before the new line, but the line was ignored. The test run routine was modified to allow a blank line only when testing the encoder. The expected results were updated accordingly.

[commit c84fbfa08b]

Encoder – Assigning Positional Indexes

The next step in preparing the tokens in the RPN list for encoding is assigning a position index to each token. This is a position, or offset, that the token will have in the encoded program code within the line. This index will be used later for calculating offsets of a single line statements like from the IF token to the ELSE, END IF or last token on a line.

For now however, this index is not needed, but after assigning the indexes to all the tokens, the final count or size of the encoded program code for the line will be known. This value will be used for allocating a program word array that will be filled in during the next step of encoding.

It was more efficient to assign the position index after assigning codes to tokens instead of having a separate routine with another loop, the assign codes routine was renamed to the prepare tokens routine. A count variable was added to the routine, which is initialized to zero. After the token type switch, the index of the token is set to the count and the count is incremented. If the token will have an an operand (the token code has the has operand flag), the count is incremented again for the operand word.

The prepare tokens routine was also changed to return the size required for the encoder program line instead of a success/fail boolean status. For errors, a -1 value is returned. The calling encode routine was updated accordingly for the new return value.

The RPN list, item and token text routines were modified to optionally output the index of each token. The token text routine was also modified to output an index for the operand word, to make sure that the sub-codes are output after the code word (not after the operand word), and to treat the comment of remark as the operand word (when index output is selected). Another test line with multiple statements was added to encoder test #1 and the expected results were updated for the indexes now output on each token.

[commit 9e0e442171]

Encoder – Variable Reference Fix

In testing the next step, a small problem was discovered with variable references where the table entry for the main variable reference had the constant associated code array. This was corrected and the expected results for encoder test #1 was updated.

[commit 88913ed121]

Initial Encoder – Assigning Codes

The new Encoder class was implemented initially with just the first step where tokens without codes are assigned codes. For now, only variables (identifier with no parentheses token type) and constants are handled. The other token types will be implemented once the recreator and run-time modules are implemented for all the initial set of commands (LET, PRINT, INPUT and REM). The first phase of the Encoder class implementation (steps 1 and 2 described on August 30) only prepare the RPN output list from the translator (will be referred to as the RPN input list for the encoder). The second phase (step 3) involves the generation of the program code.

The initial Encoder class contains two member variables, a reference to the table instance and a pointer to the RPN input list. The class includes a single public encode routine and a private assign codes routine that the encode routine calls after saving the pointer to input list. For now, both routines simply return a success or fail status as a boolean value. The assign codes routine loops through the input list and does a switch on the token type.

For a constant token type, and the appropriate code is found for the data type of the constant from the new base code Const (with associated codes ConstInt and ConstStr). An identifier with no parentheses token type is assumed to be a variable (until functions are implemented later). If the reference flag is set, then the appropriate code is found for the data type from the new base code VarRef (with associated codes VarRefInt and VarRefStr). For non-reference tokens, the appropriate code is found for from the new base code Var (with associated codes VarInt and VarStr).

For the command, operator, and internal function (with or without parentheses) token types, no action is preformed since these token types already have a code. For the other token types (identifiers with parentheses, and defined functions with and without parentheses), the token is set as an error in RPN input list with a "not yet implemented" error, the input list is cleared and false is returned.

A pass through find code routine was added to the Table class that takes a single token and a base code. The token is set to the base code and its type is set to the type for the base code. This type replaces the token type present (constant or identifier with no parentheses). The full find code routine is called with the token passed as both the main token and operand token, since the token is the token to be modified (for an associated code as needed) and contains the information (data type) needed to set the code.

Table entries for the new codes were added, each given the internal function with no parentheses token type (same as for the hidden convert codes). Each of these entries were also given the new has operand flag. This flag will be used to determine if the code has a second operand program word. The token text routine was modified for internal function token types, where if the has operand flag is set, then the string of the token is output like a separate token representing a separate program word.

An initial encoder test file was created containing a assignment statement for a variable of each data type assigned to a constant along with a PRINT statement for each variable. This contains each of the nine tokens that need a code assigned (constant, variable, and variable reference for each data type). Encoder command line options were added to the Tester class with an encode input routine that first translates the line and if no error, encodes the line. For now, the tokens of the RPN list are output like with translator testing. The test scripts and batch file were updated to handle the encoder tests.

[commit 94ca2692d0]

Pre-Encoder Issues – More Refactoring

As the initial encoder class was being implemented, the need for some more minor refactoring was noticed. The first was the token mode enumeration that was needed by the old translator routines was not removed. The second was the member variable and access function for the double value of a constant token was renamed from valueDbl to just value for consistency. Since double is the default data type, generally the 'Dbl' part is not included (just 'Int' and 'Str') in the name.

[commit 0ec09e02e4] [commit 98e22bda3f]

Friday, September 6, 2013

Pre-Encoder Issues – Blank Lines

The final issue is the ability to handle (allow) blanks lines. When using test mode, lines that are blank or begin with a '#' character are ignored and never passed to the translator. However, blank lines are allowed in the GUI and the translator returned an "expected command" error for blank lines.

It was necessary to prevent these errors for the GUI and it is desirable for the encoder test mode to allow blank lines since the end result of the encoder will be to insert actual code into the program and blank lines need to be allowed.

A simple change was made to the get commands routine after getting the first token of a command. If this token is an end-of-line, and only if the RPN output list is empty, the done status is returned. If the output list is not empty, then there was a preceding colon statement separator and an end-of-line is not allowed (a command is expected).

The text output routine for the RPN list class used by the test code already allowed for a blank line (an empty output list), as the for loop that gets the text for each RPN item in the list, simply terminates and an empty string is returned.

This change currently can't be tested using the command line test modes since blank lines are ignored with test input files, and a blank line entered in input entry test mode causes the application to exit. However, the proper handling of blank lines can be seen using the GUI.

[commit 637eb58ea5]

Thursday, September 5, 2013

Pre-Encoder Issues – Refactoring

Some minor refactoring (in this case, a fancy term for renaming) was performed. When adding the new index member to the RPN item class, I decided that using the term operands for the items that are attached to an item was not appropriate. Operands really refer to operators. For arrays, attached items are subscripts, and for functions, are arguments. And when implemented, for single line statements like IF, these will simply be attached items, Therefore, the operand term was changed to attached.

Another thing that was bothersome was how the number of variables were named. When programming in C, one convention was to simply prefix the variable with an 'n' character, for example nitems. But when converting to camel casing, this becomes nItems and m_nItems for a class member. This looks kind of ugly (an opinion). Looking at how Qt does naming for inspiration, the convention itemCount is used. Therefore, all uses of 'n' were replaced with the Count convention.

In the RPN item class, there were a number of access functions for the number of operands and the operands themselves (now called attached) that were not actually being used anywhere. So these were removed. (Lesson: Don't added functions until actually needed.)

Finally, table functions and members related to associated codes used the abbreviations assoc and assoc2 (for the second operand associated codes). These were changed to associated and secondAssociated to make the code a little more readable.

[commit e161cf87e0]

Tuesday, September 3, 2013

Pre-Encoder Issues – Output List Indexes

Tokens that are arguments or sub-scripts are attached to identifiers with parentheses and define functions with parentheses tokens. These attached tokens will be used by the Encoder. For tokens that are determined to be an array, the subscript are checked with hidden integer convert codes added for double subscripts and errors reported for string subscripts. For tokens that are functions calls, the arguments are checked with conversion codes added or errors reported as needed.

Later, this mechanism will also be used to attach tokens to certain command tokens. For instance, in a single line IF statement, the ELSE, END IF or last token on the line will be attached to the IF command token. The Encoder will use this attached token to calculate the offset from the IF command to the token, so that during execution, the number of words to skip will be known for when the expression is false.

Currently it is the RPN Items that contain the tokens and it is the RPN Items that are actually attached (the RPN Item contains an array of attached RPN Item pointers). However, there is a problem as there is no way to determine where in the list the attached RPN Item is located. This will be needed to know where to insert conversion code or to calculate an offset.

This was corrected by adding an index member to the RPN Item. When an item is appended to the output list, this index is assigned the value where the item will be located (which is the size of the list just before appending the item). When an item is inserted into the list, the indexes of the items after the insertion point are incremented for each of the items new location.

For testing purposes, the text output for an RPN Item was modified to include the index of attached items. The affected results files were updated. See the commit log for more details of the changes.

[commit 5ef00faa25]

Monday, September 2, 2013

Pre-Encoder Issues – Constants

While thinking about the design of the first step of the encoder (assigning codes to all tokens), I realized that there is no reason to add hidden conversion codes after numeric constants - the constants can be converted during translation, not every time the constant is executed at run-time. However, not all constants can be converted. All integer constants can be converted to double but not from double to integer.

The union for the constant in a token for either the double or integer value was removed so that both values are available and the one needed is used. A constant token needs to be able to hold there different data types, one for an integer, one for a double that can be used as a integer, and one for a double that can't be used as integer (outside the range of an integer).

To accomplish this, if a double constant (one that has a decimal point or exponent) that is within the integer range, its data type is set to integer and a new Double sub-code is set. A double constant that can't be converted is set to the double data type. When the constant is parsed and the integer data type is set, both the integer and double values in the token are set (in other words, the parser does the conversion).

For integer constants, the translator changes the data type of a constant token to the data type needed in the expression and the Double sub-code is removed. A hidden conversion code no longer needs to be added after constants. However, for double constants (those outside the range of a integer), if an integer is called for, the new "expected valid integer constant" error is returned.

For debugging, a "%" character is output after a constant if its data type was set to integer by the translator (even though it may be a floating point value). A constant that was set to the double data type lacks this indicator character. The expected results files were updated for the new indicator character and lack of convert codes after constants. New translator test #17 was added to test many aspects of each constant data type (integer, integer with double sub-code, and double). See the commit log for the full details of the changes. A commit was also made renaming the token sub-code access functions for better readability (and functions not being used were removed).

[commit f104bdf37e] [commit 0d3151f5f9]

Interactive BASIC Compiler Project