Interactive BASIC Compiler Project

Friday, September 6, 2013

Pre-Encoder Issues – Blank Lines

The final issue is the ability to handle (allow) blanks lines. When using test mode, lines that are blank or begin with a '#' character are ignored and never passed to the translator. However, blank lines are allowed in the GUI and the translator returned an "expected command" error for blank lines.

It was necessary to prevent these errors for the GUI and it is desirable for the encoder test mode to allow blank lines since the end result of the encoder will be to insert actual code into the program and blank lines need to be allowed.

A simple change was made to the get commands routine after getting the first token of a command. If this token is an end-of-line, and only if the RPN output list is empty, the done status is returned. If the output list is not empty, then there was a preceding colon statement separator and an end-of-line is not allowed (a command is expected).

The text output routine for the RPN list class used by the test code already allowed for a blank line (an empty output list), as the for loop that gets the text for each RPN item in the list, simply terminates and an empty string is returned.

This change currently can't be tested using the command line test modes since blank lines are ignored with test input files, and a blank line entered in input entry test mode causes the application to exit. However, the proper handling of blank lines can be seen using the GUI.

[commit 637eb58ea5]

Thursday, September 5, 2013

Pre-Encoder Issues – Refactoring

Some minor refactoring (in this case, a fancy term for renaming) was performed. When adding the new index member to the RPN item class, I decided that using the term operands for the items that are attached to an item was not appropriate. Operands really refer to operators. For arrays, attached items are subscripts, and for functions, are arguments. And when implemented, for single line statements like IF, these will simply be attached items, Therefore, the operand term was changed to attached.

Another thing that was bothersome was how the number of variables were named. When programming in C, one convention was to simply prefix the variable with an 'n' character, for example nitems. But when converting to camel casing, this becomes nItems and m_nItems for a class member. This looks kind of ugly (an opinion). Looking at how Qt does naming for inspiration, the convention itemCount is used. Therefore, all uses of 'n' were replaced with the Count convention.

In the RPN item class, there were a number of access functions for the number of operands and the operands themselves (now called attached) that were not actually being used anywhere. So these were removed. (Lesson: Don't added functions until actually needed.)

Finally, table functions and members related to associated codes used the abbreviations assoc and assoc2 (for the second operand associated codes). These were changed to associated and secondAssociated to make the code a little more readable.

[commit e161cf87e0]

Tuesday, September 3, 2013

Pre-Encoder Issues – Output List Indexes

Tokens that are arguments or sub-scripts are attached to identifiers with parentheses and define functions with parentheses tokens. These attached tokens will be used by the Encoder. For tokens that are determined to be an array, the subscript are checked with hidden integer convert codes added for double subscripts and errors reported for string subscripts. For tokens that are functions calls, the arguments are checked with conversion codes added or errors reported as needed.

Later, this mechanism will also be used to attach tokens to certain command tokens. For instance, in a single line IF statement, the ELSE, END IF or last token on the line will be attached to the IF command token. The Encoder will use this attached token to calculate the offset from the IF command to the token, so that during execution, the number of words to skip will be known for when the expression is false.

Currently it is the RPN Items that contain the tokens and it is the RPN Items that are actually attached (the RPN Item contains an array of attached RPN Item pointers). However, there is a problem as there is no way to determine where in the list the attached RPN Item is located. This will be needed to know where to insert conversion code or to calculate an offset.

This was corrected by adding an index member to the RPN Item. When an item is appended to the output list, this index is assigned the value where the item will be located (which is the size of the list just before appending the item). When an item is inserted into the list, the indexes of the items after the insertion point are incremented for each of the items new location.

For testing purposes, the text output for an RPN Item was modified to include the index of attached items. The affected results files were updated. See the commit log for more details of the changes.

[commit 5ef00faa25]

Monday, September 2, 2013

Pre-Encoder Issues – Constants

While thinking about the design of the first step of the encoder (assigning codes to all tokens), I realized that there is no reason to add hidden conversion codes after numeric constants - the constants can be converted during translation, not every time the constant is executed at run-time. However, not all constants can be converted. All integer constants can be converted to double but not from double to integer.

The union for the constant in a token for either the double or integer value was removed so that both values are available and the one needed is used. A constant token needs to be able to hold there different data types, one for an integer, one for a double that can be used as a integer, and one for a double that can't be used as integer (outside the range of an integer).

To accomplish this, if a double constant (one that has a decimal point or exponent) that is within the integer range, its data type is set to integer and a new Double sub-code is set. A double constant that can't be converted is set to the double data type. When the constant is parsed and the integer data type is set, both the integer and double values in the token are set (in other words, the parser does the conversion).

For integer constants, the translator changes the data type of a constant token to the data type needed in the expression and the Double sub-code is removed. A hidden conversion code no longer needs to be added after constants. However, for double constants (those outside the range of a integer), if an integer is called for, the new "expected valid integer constant" error is returned.

For debugging, a "%" character is output after a constant if its data type was set to integer by the translator (even though it may be a floating point value). A constant that was set to the double data type lacks this indicator character. The expected results files were updated for the new indicator character and lack of convert codes after constants. New translator test #17 was added to test many aspects of each constant data type (integer, integer with double sub-code, and double). See the commit log for the full details of the changes. A commit was also made renaming the token sub-code access functions for better readability (and functions not being used were removed).

[commit f104bdf37e] [commit 0d3151f5f9]

Saturday, August 31, 2013

Pre-Encoder Issues – Array References

Before creating the new encoder class, some issues need to be taken care of in preparation for encoding. The first was the index member variable that was added to the token class for the token caching feature. There also needs to be an index member for use by the encoder. Also, the index token caching index does not have the "m_" prefix that member variables should have. Both of these issues were handled by giving this member variable the "m_id" name.

The second issue was that array references were not being translated correctly. When the get operand routine was called to get a reference, the process parentheses token routine called for the identifier with parentheses token was only asking for numeric expressions and if the expression only contained an identifier with or without parentheses and did not have the parentheses sub-code set, the reference flag of the token was set. Consider this sample statement and its incorrect translation:

A(B,C)=D
B<ref> C<Ref> A(<ref>[B<ref>,C<ref> D Assign

The reference flags on the sub-scripts should not have been set, there should be integer conversion code inserted after the B and C variables, and there is no reason to attach the sub-scripts to the A( token since this token is known to be an array as functions can not be used in this way. This also applies to INPUT statements with array elements. Here is the correct translation:

B CvtInt C CvtInt A(<ref> D Assign

In the process parentheses token routine, if the token being processed has its reference flag set, then the expected data type of the expressions is set to integer for array subscripts. After getting each expression, for identifier with parentheses tokens being processed (this routine also handles defined functions with parentheses), if its reference flag is set, the resulting item on top of the done stack is dropped (it has already been checked for an integer). Otherwise, the done stack top items reference flag is set conditionally as before. This is for the possibility that the token being processed is a function call, which the encoder will handle.

At the end of the process parentheses token routine when a close parentheses is obtained, a check was added if the token being processed has its reference flag set, then no tokens are attached since it is an array reference and its sub-scripts have already been checked for integers.

The expected results for several translator tests (#1, #4, #7, #8 and #12) were updated for the proper handling of array reference sub-scripts.

[commit 34a796984a] [commit 5d3b918e67e]

Friday, August 30, 2013

The Encoding Procedure

There are several steps for encoding a translated RPN token list for a program line into an internal program code line.

Step 1: Each token type that does not have a code needs to be assigned a code. These token types include identifiers with and without parentheses, constants, and define functions with and without parentheses. For identifier tokens, what the identifier refers to (variable, array or user function) needs to be determined before a code can be assigned.

Initially only variables will be supported, so identifiers without parentheses will be assumed to be variables, and except for the constants, the other token types will report a "not yet implemented" error. For a variable, there will be a total of six program codes including Double Variable, Integer Variable, String Variable, Double Reference Variable, Integer Reference Variable and String Reference Variable. The specific code is selected based on the data type and reference flag of the token.

Later when support for arrays and functions is implemented, the dictionaries will be used to determine the type of the identifier. For arrays, the attached arguments are integer subscripts, so each needs to be checked for an integer value. For double type subscripts, a hidden integer conversion code will be inserted. An error will be reported for string subscripts. The number of subscripts will also be validated. Similarly for function arguments, the data type of each argument will be checked adding numeric conversion codes or reporting errors as needed.

Step 2: The instruction size of each token will be determined. Instructions are either one or two program words. The token type can be used for this determination. The operator and internal function (with and without parentheses) token types are one word, the command token type can be either and the others are two words. For commands, a new table entry flag is needed for the size of each command.

The translated token list will be scanned while maintaining the total encoded size of the line. For each token, an index (a new member for a token) will be set to the current total size. The total size is then incremented for the encode size of the token. This index will be used later for calculating the offset for single structure statements (like a single line IF statement).

Step 3: The encoded line is generated (its total size is now known). For each token, the first instruction word is created from the code and sub-code of the token. For two word instructions, the second operand word is determined. For index values, the identifier is looked up in a dictionary and the second operand word is set to the index of the dictionary entry. For offset values, the offset is calculated from the attached token. Block numbers will probably work similar to index values with an associated block dictionary (this mechanism is not defined yet).

Once the line has been encoded, it can be inserted into the program. At this point things get complicated. Some dictionary entries will refer to specific locations in the code (consider the IF and END IF example from the previous post). For any line inserted, replaced or deleted, all these references to program locations need to be adjusted if located after the point of change. However, this is not a worry initially since only dictionaries for variables, constants and remarks are needed and these will not contain program locations.

Thursday, August 29, 2013

Program Code – Internal Format

The internal program code of a BASIC program will consist of 16-bit instruction words. Each instruction word will consist of two parts, the instruction code (command, operator, internal function, etc.) to perform, and the sub-code information that will only used to recreate the original program text (with the Parentheses, Colon and Let sub-codes). The sub-code information will not be used by the run-time module, but there will be a few exceptions (the Question and Keep on the various INPUT statement codes).

Some instruction words will have a second 16-bit operand word, which could contain one of three types of information depending on the instruction code. For instructions that are variables, arrays, constants, remarks, define functions, user functions, etc., this second word will be an index into one of the dictionaries. For single line structure statements (an IF statement for example), the second word will contain an offset to where to jump to. For example, in an IF statement followed by a set of commands to execute upon a true expression, the offset will tell the IF command how many words to skip when the expression is false.

The final type of information in the operand word is a block number, which will be used on multiple line structure statements. For example, an IF/END IF structure over several lines, both the IF and END IF commands will have the same block number. Structured block will probably also have a dictionary, so technically this operand type is also an index. The dictionary entry for a block will contain the locations of the IF and END IF statements. When running, if the IF expression is false, it will go to the dictionary for the block number to find out where the associated END IF is located and jump the instruction after it.

Encoder – Introduction

As mentioned a while ago (March 25, 2011), the translation of more BASIC commands is being postponed so that the other modules can be developed. The translation of enough commands with expressions has been implemented (INPUT, LET, and PRINT) to make a useful, though very limited, BASIC program (limited by the lack of conditionals and loops).

These modules include the encoder to convert a translated program line into the internal program code, the recreator to convert the internal program code back to program text, and the run-time module to execute the internal program code. Once initial versions of these three modules are complete and connected to the GUI, additional commands will be implemented one at a time for each of the four modules.

Only certain elements of the BASIC language will be implemented initially to simplify development of the remaining modules. This includes just simple variables and constants, with arrays, defined functions and user functions implemented later. Variables come from the identifier with no parentheses token type, which could also be user functions (either a call to a function with no arguments, or an assignment of the function return value inside the function). For now this token type will be assumed to be a variable until functions are implemented.

A major part of encoder development is to define the internal program code format, the code that will be stored in the program and executed by the run-time module. The other major part are the dictionaries that will hold the information about variables, arrays, constants, remarks, functions, etc., which will be referenced from the program code. For example, the actual names (variables, functions, etc.) are not be stored in the internal program, but in a dictionary and the program code contains references to these dictionary entries.

A minor part of encoder development (not needed by the final application), is the conversion of individual instructions of the program code into text. This is similar to conversion of the translated tokens into text for output by the command line test mode, or in the GUI program view. The GUI program view translator output will be replaced by the encoder output, which will have the same reverse polish notation layout as the translator output.

Tuesday, August 27, 2013

New Translator – Release

The implementation of new translator routines is now complete, at least with all the commands that were previously implemented in the old translator routines with the additional of support for multiple statements per line (separated by colons). Version 0.4.6 has been released (branch0.4 was merged to the master branch and tagged v0.4.6).

While preparing the archives for uploading to SourceForge, I noticed that the test scripts included in the binary archives were not correct and the scripts may not run properly. The previous uploads have been corrected. To prevent this problem in the future, the CMake build file was modified to copy the test files into the build directory and the scripts modified to use these copies, not the ones from the source directory. Also to prevent problems, a check was added to make sure the build directory is not the same as the source directory.

The CMake build was also modified to build an script that creates the binary release archive with the needed files for the current platform (Linux or Windows). This also means a .zip file on Windows and a .tar.gz file on Linux. The files included on the application executable, read me, license, release notes, test files and regression test script. On Windows, the regression test batch file is also included. On Linux, the memory test script and memory error suppression file are included.

Archive files containing the source and binary files (with test files) have been uploaded to SourceForge. For Windows, there is also the ibcp-libs.zip file that contains all the dynamic linked libraries required to run the program if the MinGW and QtSDK packages have not been installed (extract into the ibcp directory). Linux should already have the required libraries installed. This concludes the 0.4 development series. Implementation of the next step (encoder) will now begin with the 0.5 development series.

[commit 17f35242e2]

Sunday, August 25, 2013

New Translator – Code Reorganization

Now that the old translator routines have now been completely removed, it is time to reorganize the main translator source file to put the functions in a more logical top to bottom calling order. But first several of the routines were renamed:

translate2() → translate()

processOperator2() → processOperator2()

getInternalFunction() → processInternalFunction()

getParenToken() → processParenToken()

The second two functions were renamed because these functions are support functions to the getOperand() function like the processCommand() function is a support function to the getCommands(). Also, these functions were made private since they will not be called from outside the translator class (from the command translate functions). Comments on the some of the functions were also reworded, corrected and cleaned up. This concludes the implementation of the new translator scheme.

Looking at some code statistics, about 1330 lines of code were added during the implementation of the new translator routines, but about 257 of these lines were due to the additional token leak and extra delete detection routines, so the new translator routines account for about a net of 1073 lines. After the old translator routines were removed, the code was about 2509 lines less. So the new translator is about 43% the size of the old translator. The simpler design will be easier to maintain and utilize, so the change was worthwhile.

[commit 86d21c373f] [commit 4f63da6adc]

Saturday, August 24, 2013

Table Initialization – Expected Data Type

While reviewing the various To-Do entries marked in the code - words NOTE, TODO, and FIXME in comments that QtCreator highlights when the Todo plugin is enabled (Help/Plugins under Utilities) - there was a FIXME "remove" on a check in the table setup and check routine called during initialization. This check was for an unset expression information structure for an associated code.

This check was in a loop that scans all the tables entries and for each entry that contains operands (operators or internal functions), sets the expected data type for the last operand of an operator or first operand of non-operator. It does this by scanning the main code and all its associated codes recording the data type expected for each. After recording all the data types, if both double and integer was found, the expected data type is set to number, or if all data types (double, integer and string) were found, the expected data type is set to any.

However, the associated codes for the sub-string functions, which are set to the related assign sub-string function codes should not be searched and therefore have the second associated code index set to -1. The check loop was not checking for a -1 index and proceeded into the loop and should not have, so the check was added. The check with the FIXME for an unset expression information structure was in fact necessary because some associated codes do not have this structure, specifically the input parse type codes.

While studying this code, it was discovered that the AssignList code contained associated codes for AssignListInt and AssignListStr. This was used by the old translator routines, but not for the new translator routines because the AssignListType codes are now associated codes of the AssignType codes. The unnecessary associated codes were removed.

[commit 07ec1e4c4f]

Old Translator Removal – Reference Flag Cleanup

The process done stack top routine (formally the find code routine, see post from August 17) contained a section that for the first operand of a sub-string function used in an assignment or an assignment internal code token (determined if the sub-string token had the reference flag set) if the item on top of the done stack did not have its reference flag set, an “expected assignment item” error was reported. If the reference flag of the token was not set, then the reference flag of the item on top of the done stack was cleared (a reference is not needed). At the end of the routine. if the data type of the done stack top item was not correct or could not be converted, the reference flag state was used to determine which error to return.

This reference flag functionality is no longer needed in this routine since the checking of references is handled else where in the new translator routines (specifically by the using the reference argument of the get operand routine and by the get internal function for sub-string assignments). This code was removed, and since it was removed, the INPUT and LET translate routines no longer need to set the reference flag before calling this routine (indirectly via the process final operand from the INPUT translate routine) and clearing it afterward.

To simply the code a bit more with respect to the reference flag, specifically pertaining to sub-string assignments where the reference flag is set for a sub-string function token in the get operand routine (which the get internal function routine uses to determine if a string reference should be requested for the first operand), the reference flag is cleared upon return since the reference status is not needed (a reference was already obtained).

[commit 010789baaf]

Old Translator Removal – Process Final Operand

For the old translator, the process final operand routine handled the processing of the final operand of operators, internal functions, tokens with parentheses (arrays or functions) and internal codes (for example assign type, print type, and input assign type). The handling of internal functions and tokens with parentheses are now handled elsewhere by their respective get routines. So for the new translator, this routine is only called for operators and internal codes.

The code for handling tokens with parentheses, which included the attaching of the operands from the done stack, was removed. It turned out that is was no longer necessary to check for a reference token, which was used to determine if the item to be added to the RPN output list should also be pushed to the done stack. Since no internal codes need to be pushed to the done stack, it now only pushes operator tokens to the done stack. The PRINT translate routine was modified to only drop the done stack top item for the print only functions (TAB and SPC).

The process final operand routine was simplified after removing the unused code. The process done stack top routine (formally the find code routine, see post from August 17) is stilled called, which returns the first and last operands of the item was on the done stack top (the item is popped before returning). Afterward, the first operand is deleted if it is an open parentheses. For an operator token, the first operand is set to the operator token for a unary operator or the first operand of a binary operator, and the last operand is set from last operand of the item that was on done stack top. For an internal code, the last operand from the done stack top item is deleted if it is a closing parentheses.

[commit ca3b513ac9]

Old Translator – Removal (Unused Definitions)

There are two sub-code definitions that were only used by the old translator routines. The semicolon sub-code was used for unnecessary semicolons that were entered. Since this is no longer permitted, this sub-code was removed. The end sub-code was used to mark the last input parse code in an INPUT statement. Since the new INPUT translation uses the input begin code to mark the end of the input parse codes, this sub-code is not needed and was removed.

The values of the sub-codes were modified to close the bit gaps left by the semicolon and end sub-codes. Also several sub-codes that are only used by the translator (used, last, and unused) were given higher bit code values. It will be convenient and desirable if the same sub-code definitions can be used by the translator and program code. The program sub-code bit values must fit in a limited space of a 16-bit instruction word that will be shared with the code value. It may also possible to share bit values for sub-codes that will never be used with the same code. For example, the question and keep sub-codes will never be used on the same code (input begin string vs. input and input-prompt) so the same bit values could be used.

There was also the end-expression flag on table entries that could signal the end of an expression (close parentheses, comma, semicolon, rem-operator, and end-of-line). This flag was needed for the token centric old translator, but not used for the new translator, so it was removed.

[commit d356621d95]

Old Translator – Removal (Sub-String Data Type)

The sub-string data type was used by the old translator to identify the sub-string functions (LEFT$, MID$, and RIGHT$) that can be used to assign part of a string variable. The idea was more appropriate with the original String class that would make handling during run-time easy. The String class has since been replaced with the Qt QString class, which has different requirements during run-time and this has been accounted for with the new sub-string assignment translation scheme (see posts on new design and with multiple assignments).

The sub-string data type is not needed in the new translator routines and has been removed. The returning data type of the LEFT$, MID$ and RIGHT$ functions is now just a String like all of the other string functions. For assignments with these functions, the new sub-string flag is used. The AssignSubStr code was removed because it was replaced with the AssignLeft, AssignMid2, AssignMid3, AssignRight codes and the AssignListMix code was removed because the various AssignKeep codes replace its functionality (see posts referenced above). See the commit log for other changes made to remove the sub-string data type.

[commit 71e97bffa2]

Old Translator – Removal (Step 1)

The old translator routines will be removed in steps, the first step being the largest. All of the functions related to the old translator were removed along with the token handler and command handler functions. Also removed were the translator variables only used by the old translator routines including the state, mode, count stack (used for arrays and functions) and command stack (used for commands). This were are needed for the token centric old translator.

The program model was changed temporarily to call the new main translate routine, however, this will be changed back once the new translator routines are renamed (removing the "2" in their names that were added to avoid a conflict with the old translator routines).

The temporary test command line options to activate the new translator were also removed. The original test command line options now use the new translator. The old translator expected results files that were previously saved were removed. The temporary test scripts for running the new translator were removed and the commands added to look for old expected results files were removed from the original test scripts. All tests pass with the original test scripts using the new translator routines (Windows testing was not performed yet).

[commit f150fbeb3f]