Interactive BASIC Compiler Project: August 2013

Saturday, August 31, 2013

Pre-Encoder Issues – Array References

Before creating the new encoder class, some issues need to be taken care of in preparation for encoding. The first was the index member variable that was added to the token class for the token caching feature. There also needs to be an index member for use by the encoder. Also, the index token caching index does not have the "m_" prefix that member variables should have. Both of these issues were handled by giving this member variable the "m_id" name.

The second issue was that array references were not being translated correctly. When the get operand routine was called to get a reference, the process parentheses token routine called for the identifier with parentheses token was only asking for numeric expressions and if the expression only contained an identifier with or without parentheses and did not have the parentheses sub-code set, the reference flag of the token was set. Consider this sample statement and its incorrect translation:

A(B,C)=D
B<ref> C<Ref> A(<ref>[B<ref>,C<ref> D Assign

The reference flags on the sub-scripts should not have been set, there should be integer conversion code inserted after the B and C variables, and there is no reason to attach the sub-scripts to the A( token since this token is known to be an array as functions can not be used in this way. This also applies to INPUT statements with array elements. Here is the correct translation:

B CvtInt C CvtInt A(<ref> D Assign

In the process parentheses token routine, if the token being processed has its reference flag set, then the expected data type of the expressions is set to integer for array subscripts. After getting each expression, for identifier with parentheses tokens being processed (this routine also handles defined functions with parentheses), if its reference flag is set, the resulting item on top of the done stack is dropped (it has already been checked for an integer). Otherwise, the done stack top items reference flag is set conditionally as before. This is for the possibility that the token being processed is a function call, which the encoder will handle.

At the end of the process parentheses token routine when a close parentheses is obtained, a check was added if the token being processed has its reference flag set, then no tokens are attached since it is an array reference and its sub-scripts have already been checked for integers.

The expected results for several translator tests (#1, #4, #7, #8 and #12) were updated for the proper handling of array reference sub-scripts.

[commit 34a796984a] [commit 5d3b918e67e]

Friday, August 30, 2013

The Encoding Procedure

There are several steps for encoding a translated RPN token list for a program line into an internal program code line.

Step 1: Each token type that does not have a code needs to be assigned a code. These token types include identifiers with and without parentheses, constants, and define functions with and without parentheses. For identifier tokens, what the identifier refers to (variable, array or user function) needs to be determined before a code can be assigned.

Initially only variables will be supported, so identifiers without parentheses will be assumed to be variables, and except for the constants, the other token types will report a "not yet implemented" error. For a variable, there will be a total of six program codes including Double Variable, Integer Variable, String Variable, Double Reference Variable, Integer Reference Variable and String Reference Variable. The specific code is selected based on the data type and reference flag of the token.

Later when support for arrays and functions is implemented, the dictionaries will be used to determine the type of the identifier. For arrays, the attached arguments are integer subscripts, so each needs to be checked for an integer value. For double type subscripts, a hidden integer conversion code will be inserted. An error will be reported for string subscripts. The number of subscripts will also be validated. Similarly for function arguments, the data type of each argument will be checked adding numeric conversion codes or reporting errors as needed.

Step 2: The instruction size of each token will be determined. Instructions are either one or two program words. The token type can be used for this determination. The operator and internal function (with and without parentheses) token types are one word, the command token type can be either and the others are two words. For commands, a new table entry flag is needed for the size of each command.

The translated token list will be scanned while maintaining the total encoded size of the line. For each token, an index (a new member for a token) will be set to the current total size. The total size is then incremented for the encode size of the token. This index will be used later for calculating the offset for single structure statements (like a single line IF statement).

Step 3: The encoded line is generated (its total size is now known). For each token, the first instruction word is created from the code and sub-code of the token. For two word instructions, the second operand word is determined. For index values, the identifier is looked up in a dictionary and the second operand word is set to the index of the dictionary entry. For offset values, the offset is calculated from the attached token. Block numbers will probably work similar to index values with an associated block dictionary (this mechanism is not defined yet).

Once the line has been encoded, it can be inserted into the program. At this point things get complicated. Some dictionary entries will refer to specific locations in the code (consider the IF and END IF example from the previous post). For any line inserted, replaced or deleted, all these references to program locations need to be adjusted if located after the point of change. However, this is not a worry initially since only dictionaries for variables, constants and remarks are needed and these will not contain program locations.

Thursday, August 29, 2013

Program Code – Internal Format

The internal program code of a BASIC program will consist of 16-bit instruction words. Each instruction word will consist of two parts, the instruction code (command, operator, internal function, etc.) to perform, and the sub-code information that will only used to recreate the original program text (with the Parentheses, Colon and Let sub-codes). The sub-code information will not be used by the run-time module, but there will be a few exceptions (the Question and Keep on the various INPUT statement codes).

Some instruction words will have a second 16-bit operand word, which could contain one of three types of information depending on the instruction code. For instructions that are variables, arrays, constants, remarks, define functions, user functions, etc., this second word will be an index into one of the dictionaries. For single line structure statements (an IF statement for example), the second word will contain an offset to where to jump to. For example, in an IF statement followed by a set of commands to execute upon a true expression, the offset will tell the IF command how many words to skip when the expression is false.

The final type of information in the operand word is a block number, which will be used on multiple line structure statements. For example, an IF/END IF structure over several lines, both the IF and END IF commands will have the same block number. Structured block will probably also have a dictionary, so technically this operand type is also an index. The dictionary entry for a block will contain the locations of the IF and END IF statements. When running, if the IF expression is false, it will go to the dictionary for the block number to find out where the associated END IF is located and jump the instruction after it.

Encoder – Introduction

As mentioned a while ago (March 25, 2011), the translation of more BASIC commands is being postponed so that the other modules can be developed. The translation of enough commands with expressions has been implemented (INPUT, LET, and PRINT) to make a useful, though very limited, BASIC program (limited by the lack of conditionals and loops).

These modules include the encoder to convert a translated program line into the internal program code, the recreator to convert the internal program code back to program text, and the run-time module to execute the internal program code. Once initial versions of these three modules are complete and connected to the GUI, additional commands will be implemented one at a time for each of the four modules.

Only certain elements of the BASIC language will be implemented initially to simplify development of the remaining modules. This includes just simple variables and constants, with arrays, defined functions and user functions implemented later. Variables come from the identifier with no parentheses token type, which could also be user functions (either a call to a function with no arguments, or an assignment of the function return value inside the function). For now this token type will be assumed to be a variable until functions are implemented.

A major part of encoder development is to define the internal program code format, the code that will be stored in the program and executed by the run-time module. The other major part are the dictionaries that will hold the information about variables, arrays, constants, remarks, functions, etc., which will be referenced from the program code. For example, the actual names (variables, functions, etc.) are not be stored in the internal program, but in a dictionary and the program code contains references to these dictionary entries.

A minor part of encoder development (not needed by the final application), is the conversion of individual instructions of the program code into text. This is similar to conversion of the translated tokens into text for output by the command line test mode, or in the GUI program view. The GUI program view translator output will be replaced by the encoder output, which will have the same reverse polish notation layout as the translator output.

Tuesday, August 27, 2013

New Translator – Release

The implementation of new translator routines is now complete, at least with all the commands that were previously implemented in the old translator routines with the additional of support for multiple statements per line (separated by colons). Version 0.4.6 has been released (branch0.4 was merged to the master branch and tagged v0.4.6).

While preparing the archives for uploading to SourceForge, I noticed that the test scripts included in the binary archives were not correct and the scripts may not run properly. The previous uploads have been corrected. To prevent this problem in the future, the CMake build file was modified to copy the test files into the build directory and the scripts modified to use these copies, not the ones from the source directory. Also to prevent problems, a check was added to make sure the build directory is not the same as the source directory.

The CMake build was also modified to build an script that creates the binary release archive with the needed files for the current platform (Linux or Windows). This also means a .zip file on Windows and a .tar.gz file on Linux. The files included on the application executable, read me, license, release notes, test files and regression test script. On Windows, the regression test batch file is also included. On Linux, the memory test script and memory error suppression file are included.

Archive files containing the source and binary files (with test files) have been uploaded to SourceForge. For Windows, there is also the ibcp-libs.zip file that contains all the dynamic linked libraries required to run the program if the MinGW and QtSDK packages have not been installed (extract into the ibcp directory). Linux should already have the required libraries installed. This concludes the 0.4 development series. Implementation of the next step (encoder) will now begin with the 0.5 development series.

[commit 17f35242e2]

Sunday, August 25, 2013

New Translator – Code Reorganization

Now that the old translator routines have now been completely removed, it is time to reorganize the main translator source file to put the functions in a more logical top to bottom calling order. But first several of the routines were renamed:

translate2() → translate()

processOperator2() → processOperator2()

getInternalFunction() → processInternalFunction()

getParenToken() → processParenToken()

The second two functions were renamed because these functions are support functions to the getOperand() function like the processCommand() function is a support function to the getCommands(). Also, these functions were made private since they will not be called from outside the translator class (from the command translate functions). Comments on the some of the functions were also reworded, corrected and cleaned up. This concludes the implementation of the new translator scheme.

Looking at some code statistics, about 1330 lines of code were added during the implementation of the new translator routines, but about 257 of these lines were due to the additional token leak and extra delete detection routines, so the new translator routines account for about a net of 1073 lines. After the old translator routines were removed, the code was about 2509 lines less. So the new translator is about 43% the size of the old translator. The simpler design will be easier to maintain and utilize, so the change was worthwhile.

[commit 86d21c373f] [commit 4f63da6adc]

Saturday, August 24, 2013

Table Initialization – Expected Data Type

While reviewing the various To-Do entries marked in the code - words NOTE, TODO, and FIXME in comments that QtCreator highlights when the Todo plugin is enabled (Help/Plugins under Utilities) - there was a FIXME "remove" on a check in the table setup and check routine called during initialization. This check was for an unset expression information structure for an associated code.

This check was in a loop that scans all the tables entries and for each entry that contains operands (operators or internal functions), sets the expected data type for the last operand of an operator or first operand of non-operator. It does this by scanning the main code and all its associated codes recording the data type expected for each. After recording all the data types, if both double and integer was found, the expected data type is set to number, or if all data types (double, integer and string) were found, the expected data type is set to any.

However, the associated codes for the sub-string functions, which are set to the related assign sub-string function codes should not be searched and therefore have the second associated code index set to -1. The check loop was not checking for a -1 index and proceeded into the loop and should not have, so the check was added. The check with the FIXME for an unset expression information structure was in fact necessary because some associated codes do not have this structure, specifically the input parse type codes.

While studying this code, it was discovered that the AssignList code contained associated codes for AssignListInt and AssignListStr. This was used by the old translator routines, but not for the new translator routines because the AssignListType codes are now associated codes of the AssignType codes. The unnecessary associated codes were removed.

[commit 07ec1e4c4f]

Old Translator Removal – Reference Flag Cleanup

The process done stack top routine (formally the find code routine, see post from August 17) contained a section that for the first operand of a sub-string function used in an assignment or an assignment internal code token (determined if the sub-string token had the reference flag set) if the item on top of the done stack did not have its reference flag set, an “expected assignment item” error was reported. If the reference flag of the token was not set, then the reference flag of the item on top of the done stack was cleared (a reference is not needed). At the end of the routine. if the data type of the done stack top item was not correct or could not be converted, the reference flag state was used to determine which error to return.

This reference flag functionality is no longer needed in this routine since the checking of references is handled else where in the new translator routines (specifically by the using the reference argument of the get operand routine and by the get internal function for sub-string assignments). This code was removed, and since it was removed, the INPUT and LET translate routines no longer need to set the reference flag before calling this routine (indirectly via the process final operand from the INPUT translate routine) and clearing it afterward.

To simply the code a bit more with respect to the reference flag, specifically pertaining to sub-string assignments where the reference flag is set for a sub-string function token in the get operand routine (which the get internal function routine uses to determine if a string reference should be requested for the first operand), the reference flag is cleared upon return since the reference status is not needed (a reference was already obtained).

[commit 010789baaf]

Old Translator Removal – Process Final Operand

For the old translator, the process final operand routine handled the processing of the final operand of operators, internal functions, tokens with parentheses (arrays or functions) and internal codes (for example assign type, print type, and input assign type). The handling of internal functions and tokens with parentheses are now handled elsewhere by their respective get routines. So for the new translator, this routine is only called for operators and internal codes.

The code for handling tokens with parentheses, which included the attaching of the operands from the done stack, was removed. It turned out that is was no longer necessary to check for a reference token, which was used to determine if the item to be added to the RPN output list should also be pushed to the done stack. Since no internal codes need to be pushed to the done stack, it now only pushes operator tokens to the done stack. The PRINT translate routine was modified to only drop the done stack top item for the print only functions (TAB and SPC).

The process final operand routine was simplified after removing the unused code. The process done stack top routine (formally the find code routine, see post from August 17) is stilled called, which returns the first and last operands of the item was on the done stack top (the item is popped before returning). Afterward, the first operand is deleted if it is an open parentheses. For an operator token, the first operand is set to the operator token for a unary operator or the first operand of a binary operator, and the last operand is set from last operand of the item that was on done stack top. For an internal code, the last operand from the done stack top item is deleted if it is a closing parentheses.

[commit ca3b513ac9]

Old Translator – Removal (Unused Definitions)

There are two sub-code definitions that were only used by the old translator routines. The semicolon sub-code was used for unnecessary semicolons that were entered. Since this is no longer permitted, this sub-code was removed. The end sub-code was used to mark the last input parse code in an INPUT statement. Since the new INPUT translation uses the input begin code to mark the end of the input parse codes, this sub-code is not needed and was removed.

The values of the sub-codes were modified to close the bit gaps left by the semicolon and end sub-codes. Also several sub-codes that are only used by the translator (used, last, and unused) were given higher bit code values. It will be convenient and desirable if the same sub-code definitions can be used by the translator and program code. The program sub-code bit values must fit in a limited space of a 16-bit instruction word that will be shared with the code value. It may also possible to share bit values for sub-codes that will never be used with the same code. For example, the question and keep sub-codes will never be used on the same code (input begin string vs. input and input-prompt) so the same bit values could be used.

There was also the end-expression flag on table entries that could signal the end of an expression (close parentheses, comma, semicolon, rem-operator, and end-of-line). This flag was needed for the token centric old translator, but not used for the new translator, so it was removed.

[commit d356621d95]

Old Translator – Removal (Sub-String Data Type)

The sub-string data type was used by the old translator to identify the sub-string functions (LEFT$, MID$, and RIGHT$) that can be used to assign part of a string variable. The idea was more appropriate with the original String class that would make handling during run-time easy. The String class has since been replaced with the Qt QString class, which has different requirements during run-time and this has been accounted for with the new sub-string assignment translation scheme (see posts on new design and with multiple assignments).

The sub-string data type is not needed in the new translator routines and has been removed. The returning data type of the LEFT$, MID$ and RIGHT$ functions is now just a String like all of the other string functions. For assignments with these functions, the new sub-string flag is used. The AssignSubStr code was removed because it was replaced with the AssignLeft, AssignMid2, AssignMid3, AssignRight codes and the AssignListMix code was removed because the various AssignKeep codes replace its functionality (see posts referenced above). See the commit log for other changes made to remove the sub-string data type.

[commit 71e97bffa2]

Old Translator – Removal (Step 1)

The old translator routines will be removed in steps, the first step being the largest. All of the functions related to the old translator were removed along with the token handler and command handler functions. Also removed were the translator variables only used by the old translator routines including the state, mode, count stack (used for arrays and functions) and command stack (used for commands). This were are needed for the token centric old translator.

The program model was changed temporarily to call the new main translate routine, however, this will be changed back once the new translator routines are renamed (removing the "2" in their names that were added to avoid a conflict with the old translator routines).

The temporary test command line options to activate the new translator were also removed. The original test command line options now use the new translator. The old translator expected results files that were previously saved were removed. The temporary test scripts for running the new translator were removed and the commands added to look for old expected results files were removed from the original test scripts. All tests pass with the original test scripts using the new translator routines (Windows testing was not performed yet).

[commit f150fbeb3f]

Friday, August 23, 2013

New Translator – Complete (Tagged)

The implementation of the new translator routines is now complete for all the items previously completed in the old translator routines (LET, PRINT, INPUT and REM, with the additional item of multiple statement support). Before removing the old translator routines, this is a good place to set a tag even though it has been only three commits from the last tag. This \will be the last commit that will contain the old translator routines.

Version v0.4.5 has been tagged. All tests, including the new multiple statement test (#16) with the new translator routines. Test #16 does not pass with the old translator routines since multiple statement support was never added (would have required a new colon token handler). Some comments were added indicating items that need to be removed with the old translators, which will now commence.

[commit 51ef63448b]

New Translator – Colons

Multiple BASIC statements per line will be supported where statements are separated by a colon. As mentioned in the last post, the newly implemented get commands routine, as indicated by its name, was intended to be able to process multiple statements per line, though the code to do that was not implemented.

For multiple statements, colons will not be stored in the program. Instead, there will be a colon sub-code set on the last token of the statement, which is usually the command token, but not always (for example, an assign of a LET or a semicolon of a PRINT). Multiple consecutive colons or a lone trailing colon will not be permitted as they add nothing.

To add support for multiple statements, the get commands routine was modified by adding a check for a colon token after checking for a RemOp token. For a colon token, the token is not needed and so it is deleted. The colon sub-code is set on the last token added to the RPN output list. The loop then continues with looking for the another command. Other changes requires was to add the end-statement flag to the colon table entry, to add the colon sub-code definition, and to modify the token text routine to detect and output the colon sub-code.

A new translator test (#16) was added for testing multiple statements with various commands before colons, and with several statements with errors. The old translator routines crashes on this test as expected since multiple statements are not supported. This will not be fixed since now that the new translator supports everything the old translator supported, the old translator routines can now be removed.

[commit 734521d17f]

Thursday, August 22, 2013

New Translator – Remarks

There are two forms of remarks (comments), the REM command and the remark operator (a single quote). The remark operator can be placed anywhere an end-of-statement token can be put including at the beginning of the line where a command is. There are no commands after a remark since all characters are part of the comment up to the end of the line. Remarks were implemented a couple of different ways and while these all worked, these solutions were not very clean (there were multiple checks for the Rem and RemOp tokens in several routines).

The final solution was to implement a new get commands routine, which as the name implies will be able to handle multiple commands on the line separated by colons (which wasn't implemented yet). The new routine contains a loop, which begins by getting a token. If the token is a Rem or RemOp, the loop is exited. Otherwise the token is processed by calling the process command routine (see below). If the terminating token of the command is a RemOp, the loop is exited. Otherwise, for now, the routine returns the terminating token from the command translator routine and the done status.

When the loop is exited in the new get commands routine, which will be due to a Rem or RemOp token, the token is appended to the RPN output list, and the next token is obtained, which should be an end-of-line token unless there is a bug in the parser. The end-of-line token is returned as the terminating token with the done status.

The original get command routine was renamed to the process command routine, which made more sense and this routine won't be public as only the get commands routine will be the only caller. It was modified to receive the first token from the token argument instead of getting the token itself.

The expected results for translator test #15 (Remark tests) needed to be updated due to the change in translation of the PRINT and INPUT commands (the old results were saved). Now all the translator tests pass with the new translator routines.

[commit a3a71526f9]

Sunday, August 18, 2013

New Translator – INPUT Statements (Tagged)

The implementation of INPUT statement translation went fairly quickly using the new translation routines. Only a couple of days were required using the new command centric translator routines, compared to well over a month using the old token centric translator routines. This should be an indicator how the implementation should go for the rest of the BASIC commands.

The INPUT statements implementation in the new translator is now complete and version v0.4.4 has been tagged. All tests still pass with the old translator routines. All translator tests pass with the new translator routines with the exception of the REM statements (test #15) as this has not been implemented yet. Some slight cleanup was done with the latest commit along with updating the files for v0.4.4. Implementation of REM statements and operators can now commence in the new translator.

[commit 00815b4bac]

New INPUT Translation

The new INPUT translate routine required several local variables besides a status variable, including an index into the RPN output list of the InputBegin or InputBeginStr code where to insert the input parse codes (see below), a done flag, and an input token. This routine will be used for both the INPUT and INPUT PROMPT commands.

For the INPUT command a new token is created for the InputBegin code. For the INPUT PROMPT command, the get expression routine is called to get a string expression. The done stack top item is dropped and the terminating token is checked. If the terminating token is a comma, the 'Question' sub-code is set in the token, else if the token is not a semicolon, the "expecting operator, semicolon or comma" error is returned. The token is set to the InputBeginStr code.

The index to where the input begin token will be inserted into the RPN output list is obtained by the current count of the number of items in the list. The token is appended to the output and a loop is entered to get each of the input variables.

The get operand routine is called to get any type of variable reference. The next token is obtained. If the token is a comma, the done flag is set to false and the input token is set to the token (to be reused for the input assign token). If the token is a semicolon, the 'Keep' sub-code is set in the command token, the done flag is set to true, the input token is set to the token (for reuse), and the next token is obtained. For all other tokens, the done flag is set to true, and the input token is set to a new token.

The input token is set to the InputAssign code and the process final operand routine is called to process the reference on top of the done stack and set the input token code to the appropriate input assign code for the data type of the reference. A new token is created with the second associated code of the input assign code, which will be the coordinating input parse code. The input parse token is then inserted into the output list at the begin index. The input parse token for each subsequent reference will be inserted in front of the last input parse token, and therefore all the input parse tokens will be in the desired reverse order (making execution easier). The loop continues while the done flag is not set.

Upon exit from the loop, if the status is set to an error, the error is returned. Otherwise, the command token is appended to the output and if terminating token is not an end-of-statement, the "expecting comma, semicolon or end-of-statement" error is returned. (Note the reversal of the "comma" and "semicolon" words in the error message used for the PRINT statement, which makes more sense for the INPUT statement.) Finally the done status is returned. See the commit log for the other minor changes required to implement the INPUT translate routine.

[commit 8b079faf61]

Saturday, August 17, 2013

New INPUT Translation – Preparation

In preparation for the new INPUT translate routine, some minor changes were needed. The first change was add default values to the process final operand routine for the second token and the operand index arguments. The second token argument is used when processing operators where it is given the first token of the operator being processed and this second token is used for binary operators to set its first token when it is pushed on the done stack.

The old translator routines also set the second token to the closing parentheses token for tokens with parentheses, but the new translator routines do not call the process final operand routine for these tokens. Also for the new translator, the operand index will only be used for operators. In addition to operators, the new translator will only use the process final operand routine for special codes like PrintDbl and InputAssignDbl, and these will not require the second operand or operand index arguments.

The other change involved adding a new reference type in additional to None, Variable and All. The new VarDefFn type is to allow variables and defined function references only and not sub-string references (as in sub-string assignments). The Variable reference type was modified to only allow variables and not defined function assignments. The get operand routine was modified to support these changes. The INPUT command will use the Variable reference type since defined functions cannot be input. The get internal function routine was modified to use the new reference type for sub-string assignments (because defined functions of type string will support sub-string assignments, at least this is the current plan).

[commit b670ab38f6] [commit a79f117c52]

Associated Code Processing Reorganization

During the design of the INPUT statement translation, some code reorganization was needed, specifically in the Translator find code routine, which processes the done stack top item (that was last added to the RPN output list) to check its data type to see if it matches the expected operand data type of the operator or internal function being processed, and if not looks to see if there is an associated code with that expected data type. If no associated code is found, then checks to see if the data type of the operand can be converted (a hidden convert code is added to the output if it can), or else determines the appropriate error. Upon success, the done stack top item is dropped.

The find code routine had evolved into a somewhat complicated function. It is also used to take a base code (for example PrintDbl, and will also be used for InputAssignType) and change it to an associated code for the data type of the operand on top of the done stack. The functionality of looking up an associated code from a base code for a given data type will also be needed in the Encoder (to be implemented soon now), for example, taking the VarDbl code (that pushes a double variable value to the evaluation stack during run-time), and changing it as needed to VarInt or VarStr for the data type of a token with no parentheses.

Since the future Encoder class should not need to call a Translator class function for this, and since this is really the functionality of the Table class (all the associated code and data type information is in the Table), the part of the find code routine that does this was moved into a new find code routine in the Table class. Some unused code was discovered (part of it even containing a bug, but was causing a problem since the code was not used), which was removed when the code was moved.

The Translator find code routine was renamed to the process done stack top routine to more appropriately describe its functionality. The convert code have need table was also moved from the translator to the table source file. As a result, a new Table convert code access function was needed, that takes a token and data type and returns the convert code needed (including a null code for no convert code or invalid if conversion is not possible). Some comments were also added documenting code that needs to be removed when the old translator routines are finally removed.

[commit b5fa514ca9]

Friday, August 16, 2013

New INPUT Command Design

The design of the INPUT command will be modified slightly. The current design is produces a rather complex translation, which also makes execution complicated. Refer to previous posts for Input Commands particularly February 24, 2011 showing an example translation and February 26, 2011 describing some of the complexity during execution. Unfortunately, simplifying the design much is not possible because of the requirement to handle a statement like INPUT I%,A(I%) where the value of I% needs to be assigned to an inputted value before the reference to A(I%) can be generated since the value of I% needs to be set beforehand. The design can be improved upon for run-time (the goal is for fast execution) especially when dealing with errors entered during input. Consider the previous and new proposed translations for this example statement:

InputBegin InputParseInt InputParseDbl'End' I%<ref> InputAssignInt I% A(<ref> InputAssignDbl Input
InputParseDbl InputParseInt InputBegin I%<ref> InputAssignInt I% A(<ref> InputAssignDbl Input

The proposed translation is nearly identical except InputBegin now appears after all the InputParseType codes, which are now reversed, and an 'End' sub-code is no longer needed. At run-time, each InputParseType pushes a data type to an input stack. InputBegin starts by issuing the prompt, gets the input, and processes the input stack without popping the items. Each item on the input stack contains a data type and an input value, which is filled in as each value is successfully parsed. If an error is detected, the error is displayed and the character at which the error occurred is highlighted. At which point, the user is given the opportunity to edit the input and try again.

Upon successful input (all values parsed and saved on the input stack), execution proceeds to the next code. Each reference gets pushed to the evaluation stack and each InputAssignType pops the reference, pops a value from the input stack, and assigns the value to the reference.

There is not much left for Input to perform, but it does check for the 'Keep' sub-code (semicolon at the end of the statement to keep the cursor on the same line), and if not set, moves the cursor to the beginning of the next line.

For the INPUT PROMPT command, the beginning of the translation is varied slightly to allow for the prompt string expression. The beginning of the line contains the tokens for the string expression (before the first InputParseType). There is an InputBeginStr instead of InputBegin, which could have an optional 'Question' sub-code depending if the prompt expression was followed by a semicolon (no question) or comma (question). At run-time, this code will pop the result of the string expression from the evaluation stack, and proceeds the same as InputBegin except with the specified prompt.

Thursday, August 15, 2013

New Translator – PRINT Statements (Tagged)

A minor parser issue was discovered with how the length of double word command tokens were being set, which was noticed with the "not yet implemented" errors not pointing at the entire INPUT PROMPT command. The token length was only being set to the length of the second word plus one for the space. This was corrected by setting the token length to the current position (first character after the second word) minus the starting column of the first word. This also takes into account if more than one space is entered between the two words.

The PRINT statements implementation in the new translator is now complete and version v0.4.3 has been tagged. All tests pass with the old translator routines (after some corrections were made). All translator tests pass with the new translator routines with the exception of the INPUT and REM statements as these have not been implemented yet. Some minor cleanup was done with the latest commit along with updating the files for v0.4.3 (see the commit log for details). Implementation of INPUT statements can now commence in the new translator.

[commit bd608f5cf3] [commit 0e316f71ce]

Wednesday, August 14, 2013

New Translator – PRINT Statements

The implementation of the PRINT translate routine using the new translator routines was mostly straight forward, but some time was spent getting the right errors reported at the correct tokens (the previous post dealt with some of these issues in the translator routines). Three local variables were needed to hold the pointer to the last semicolon token (last one is used instead of the command token if at the end of the statement), a separator flag (indicates when a semicolon is allowed) and a print function flag (only used for error reporting). The routine then enters a loop until the end of the statement.

The loop begins by calling the get expression routine for a None data type to allow for print functions or an empty expression. If a non numeric parser error is returned, then the appropriate error is returned. This and other errors cause the loop exit. If the done stack is not empty and the data type of the top item is None, then the print function flag is set to true. Otherwise a new print code token is created and the process final operand routine is called to get the correct associated code for the top item and the print function flag is cleared. The top item (either a print function or the new print code token) is dropped from the done stack (both have already been added to the RPN output list). The separator flag is set, the last semicolon token is deleted and the pointer is set to null.

If the next token (the terminating token from the get expression routine) is a comma, then if the last semicolon token pointer is set, then a comma is not allowed and the appropriate error is set and the loop exits. Otherwise, the comma token is appended to the RPN output list. The last semicolon token is deleted and the pointer is set to null. The separator flag is cleared (no semicolon is allowed). The loop continues with the next expression.

If the next token is a semicolon, then if the separator flag is not set, then a second semicolon is not allowed and the appropriate error is set depending if the last semicolon token pointer is set and the loop exits. The last semicolon token is deleted and the pointer is set to the current semicolon token. The loop continues with the next expression.

For any other terminating token, the loop exits. Otherwise, the separator flag is cleared for the next expression and the loop continues. If the loop exited with an error, the last semicolon token and command token pointers are deleted and the error is returned. If the last semicolon token pointer is set, the command token is deleted and is set to the last semicolon token. The command token is appended to the RPN output list.

If the terminating token is not an end-of-statement token, then the appropriate error is returned depending on the print function flag. Otherwise a done status is returned. The PRINT translate function was implemented in the new basic/print.cpp source file, which was added the CMake build file, and was added to the table for the PRINT command.

[commit 3fc94d4b18]

Tuesday, August 13, 2013

Parser Error And None Data Type Handling

The PRINT statement translation implementation is almost complete, but a few more issues were discovered with how parser errors and the None data type are handled. Support for the None data type was added for the print functions (TAB and SPC), but also to support being able to call the get expression routine and have it allow for no expression, just a terminating token. This is needed for PRINT statements that have commas with no expressions in between them (for instance, to skip over two columns).

The get token routine was modified to accept a new No data type value, which will now be the default value. This data type will only be used to call the get token routine when a non-operand token is requested. This was necessary since the None type is now being used for operands.

The get operand routine was modified for command and operator tokens, which are normally invalid for an operand and an "expected XX expression or variable" error is returned, now returns a Done status if the data type passed is None. This allows for no expressions to be returned and the command or operator token becomes the terminating token from the get expression routine (it is up to the PRINT translate routine to determine the validity of the token).

The get expression routine was modified to carry the None data type into the loop in the expected data type variable. This allows the None data type to be passed to the get operand routine upon the first call (to allow for print functions or no expression) so the secondary operand data type variable is no longer needed. After the first operator, the expected data type variable is set accordingly for the operator.

However, for parentheses expressions, the data type needs to be changed to Any before recursively calling the get expression routine since neither a print function nor no expression is valid inside the parentheses. Also, if a parser error is returned, the status is only set to the "expected operator or closing parentheses" error if the token does not have a data type, because a numeric parser error needs to be reported as such.

Finally, the scheme to leave an internal function of parentheses token on the hold stack when an error is detected so that the caller can check if the top of the hold stack is not the null token was flawed. Without going into details, it turned out this scheme was not necessary anyway and was removed. Quite a few more parser error tests were added to translator test #14 for testing additional error cases.

[commit 86292a56c0]

Sunday, August 11, 2013

Print Function Support

The PRINT statement consists of expressions separated by semicolons or commas. Commas can also be used with no expressions between them. The expressions may also contain the print functions TAB and SPC. These functions are distinguished from other functions in that they have no return data type. The PRINT translate function could specifically handle these print functions since they are only use in a PRINT statement.

However, the get expression routine also needs to be able handle these print functions in at least to report errors when they are used incorrectly in expressions. It currently does this simply by the fact that the None data type is not acceptable by any operator or data type request (an expression with a data type of None is not currently requested).

Instead of having the PRINT translate routine handle the print functions, the get expression routine was modified to accept the None data type, which affected two things. This allows print functions to be returned. It also allows no expression to be returned (just the next token as the terminating token), which is needed in the PRINT statement for commas. This will be detected by the done stack being empty. The PRINT translate routine will be the only caller with the None data type.

The new translator was changed back to only reporting the TAB( or SPC( token when used incorrectly in expressions. This was necessary due to parser and other errors inside the parentheses of the print function as the error was reported at the wrong place. Consider the expression:

A = TAB(B$)

An "expected numeric expression" error was reported at the B$ token, but should point to the TAB( token as it doesn't belong in the expression in the first place. Several more statements were added to translator test #14 that included various parser errors in print functions used incorrectly. See the commit log for more details of the changes made.

[commit 9502d5b183]

Saturday, August 10, 2013

Old Translator Issues With Token Caching

While debugging the new PRINT translate routine, I realized that the old translator routines were never tested with new token caching and error reporting code, because when they were, crashes and errors occurred. Two small problems were found and were corrected.

The first issue occurred when an error is detected and the error token was not the current token obtained from the parser (meaning the error token had already been added to the output and cannot be deleted). The old translator routine obtained a new token and copied the error token into it before calling the set error token routine, which called the output list set error routine and then deleted the token. When the token was copied into the new token obtained, the index of the original token was copied over the new token's index confusing the used token handling. This was corrected by calling the output list set error routine directly (which only grabs the column and length of the error token) so no temporary token is needed.

The second issue occurred when there was a REMARK operator (') at the beginning of a line. The Null token put on the bottom of the hold stack was not removed. This was detected as a token leak and the detection code reported and deleted the token. Unfortunately, the translator still had the token on the hold stack. This did not cause a problem until the next command with an error was processed. When an error occurs, the the hold stack is empty and its tokens are deleted. When it got to the deleted Null token, the index was garbage (because the memory for the token had been reused) and a segmentation fault occurred.

The REMARK operator token handler was modified to pop and delete the Null token from the hold stack if currently in command mode. If not in command mode, then another part of the translator correctly handled the the Null token. Curiously for this situation, the EOL token at the end of the line is never obtained and processed. Therefore, this fix is a hack, but since the old translator routines will be abandoned, there is no reason for a correct fix.

One last minor issue was that when token leaks or extra deletes were reported, if the token was a Null or End-of-line token, a blank token was output with just an index. The "NULL" and "EOL" strings were added to the debug name entry in the table to identify these blank tokens.

[commit cf93288fcb]

Thursday, August 8, 2013

New PRINT Implementation (Revised)

Once implementation of the new PRINT translation was started, I realized that the design at run-time was more complicated than it needed to be. Most of the original design appeared to be best after all, with a few minor changes. When it comes to run-time, the original design would be more efficient and would not need the extra complexity of a command stack. With print codes immediately after each print item, the print code would simply pop the value from the evaluation stack and print it.

The changes to the original design are how semicolons and the final PRINT command code are handled. Semicolon sub-codes were added to print functions codes except when they appeared at end of a statement. Multiple semicolons were allowed, marked with semicolon sub-codes or extra semicolon tokens with more than one extra semicolons. A PRINT command code would only be added to the end of the statement if there was no print token (semicolon, comma, TAB or SPC) at the end.

Semicolon sub-codes will no longer be needed. Semicolons will be assumed between each print item unless there is a comma between them. Multiple unnecessary semicolons will no long be permitted as these add nothing. If there is a semicolon at the end of the line, then a semicolon code will be added to the end of the line instead of a PRINT code. At run-time, the semicolon will no nothing, but when the line is recreated, the semicolon will produce the PRINT keyword in additional to the semicolon at the end of the line.

Wednesday, August 7, 2013

New PRINT At Run-Time (Retracted)

After considering this design, it was decided to be more complicated than it needed to be, so this post has been retracted, see revised post.

Consider the previous example statement and its new translation:

PRINT A,B$;TAB(20);C%
A B$ 20 C% PrintDbl , PrintStr TAB( PrintInt PRINT

At run-time, in additional to the evaluation stack used to hold values of operands and results, there will also be a command stack (not to be confused with the current command stack used by the old translator routines, but not needed by the new translator routines). The special print codes (specific print type codes, comma, and print functions), will only push themselves to the command stack.

When the PRINT (or semicolon) code is executed, it will process the command stack, except it will start at the bottom of the stack instead of popping items from the top. This is possible because the QStack class is derived from the QVector class, which allows its elements to be access by index. The values of the evaluation stack will also be accessed the same way.

When the PRINT (or semicolon) begins, it will start with indexes (one for each stack) set to zero and will process each item on the command stack. For the specific print type code, it will access the value at the evaluation stack index, output it, and increment the index. For a comma code, it will output the appropriate number of spaces (leaving the evaluation stack index as is). For a print function (TAB or SPC), it will access the value at the evaluation stack index, output the appropriate number of spaces, and increment the index.

At the end (top) of the command stack, it will clear both stacks, return the indexes to zero (for the next command), and for the PRINT code only, will advance to the next line. This scheme works because both the evaluation and command stacks will be empty at the beginning of each command.

New PRINT Implementation (Retracted)

After considering this design, it was decided to be more complicated than it needed to be, so this post has been retracted, see revised post.

The run-time design of the PRINT statement was reconsidered. The old scheme inserted specific type print codes into the translation with an optional PRINT code at the end depending if the line ended with a semicolon, comma or print function (TAB or SPC). Consider this example statement and its old translation:

PRINT A,B$;TAB(20);C%
A PrintDbl , B$ PrintStr 20 TAB(';' C% PrintInt PRINT

The specific type print codes pop a value from the evaluation stack and outputs it. The comma token outputs spaces to advance the cursor to the next column (traditionally groups of 8 characters). The TAB print function outputs spaces to advance to specified column. And the final PRINT advances to the next line. If the line ended with a semicolon, comma or print function, the final PRINT code is not present. The new translation will look like this (details to follow):

A B$ 20 C% PrintDbl , PrintStr TAB( PrintInt PRINT

The final PRINT code is always present, and at run time will know when to advance to the next line (depends on the code immediately preceding it). Semicolon tokens (or a sub-code) is not necessary and is implied between the items to print. Multiple semicolons, though allowed in the old translator (the semicolon sub-code and any semicolon tokens would do nothing at run-time), will not be permitted as there is no reason to allow them (unlike multiple commas).

A semicolon at the end of the line needs special consideration. If there is a semicolon at the end of the line, it will take the place of the PRINT code. At run-time, both perform the same actions except only PRINT advances to the next line. A semicolon can be used this way since no other statement needs a semicolon token. A 'Keep' sub-code or a new PrintKeep code could have been used, but why have a sub-code check at run-time or create a new code when a semicolon token can be used.

The next post will contain the details of how this new PRINT translation will work at run-time...

Tuesday, August 6, 2013

Extra Token Delete Detection

The other type of token allocation error that can occur is when a token that has already been deleted is deleted again. This is a less common problem, but can still could occur none the less. Extra token delete detection was also previously implemented, but removed. The original implementation contained a list of tokens deleted extra times. A copy of the token was appended to the list.

This time around, the text of the token (by using the text access function) along with its index (as part of the string) is saved in a list of strings of deleted tokens. When it comes to to report these errors, each string in the list is output. In the reimplemented delete operation function, before the token is marked a unused in the used token vector (its pointer set to null), if the token is already marked as unused, its index and text is appended to the deleted list instead and the token is not pushed onto the free token stack (it is already there).

A new private DeletedList class was implemented inside the Token class. The destructor is called automatically upon termination of the application and any extra token deletes are reported. This class also contains a report errors function that does the work of outputting the deleted list to the standard error output stream and clears the list afterward. The destructor calls this function.

Again, a separate function was implemented so that it can be called at the end of each line translated in test mode and extra token deletes for that specific line are reported with the line. For non-test (GUI) mode, any extra token deletes are reported when the application terminates.

Token (Memory) Leak Detection

While implementing the token caching, I thought it would be extremely helpful to put back some token leak detection, a common time consuming problem seen during debugging of the new translator routines. Leak detection was previously implemented, but removed during the Qt transition. According to the post on October 28, 2012, this was due to obscure compiler errors. Since valgrind was being used for memory leak detection by then, this code was removed. The log for this commit did not provide any additional clues to the issues other than to say Qt does not interface well to self implemented new and delete operators (?).

The original token leak detection consisted of a static list of used tokens and each token contained a pointer into this list. This was easy to implemented with how the original List class was designed, but the QList class had no easy equivalent (using an iterator did not work). I suspect this partly caused the problem, which may also have had something to do with the way members are initialized by the constructor when a new instance is created and having a Qt member complicated things. Anyway, a new easier method of keep track of used tokens was implemented this time around that did not cause any problems.

A plain integer index member was added to the Token class. Every token allocated gets a unique index number, which the token will have through the life of the application. The index values assigned start at zero and each additional token allocated gets the next index value. This index is used to index into a QVector of token pointers.

When a new token is allocated, the index is set to the size of this used token vector, and then the token is appended to the vector. In other words, the index points to its corresponding element in this vector. When a token pointer is popped from the free token stack, the element corresponding to the token is set to the token pointer. When a token is deleted, the element corresponding to the token is set to a null pointer to indicate the token is not currently used (it is in the free token stack).

A token pointer was used for this vector instead of a simple boolean flag (for indicating a token was used), so that when it came time to see if there are any tokens used that have not been freed (the element in the vector contains a non-null token pointer), the token pointer can be used to print information about the token (type, code, string, etc.).

A new private UsedVector class was implemented inside the Token class, similar to the FreeStack private class. The destructor is called automatically upon termination of the application and any token leaks are reported and the tokens deleted so that valgrind does not also report memory leaks. This class also contains a report errors function that does the work of scanning the vector and reporting any token leak errors found to the standard error output stream and deletes the tokens. The destructor calls this function.

A separate function was implemented so that it can be called at the end of each line translated in test mode and token leaks for that specific line are reported with the line. This does not interfere with the test scripts since the errors are output to the standard error stream, which is not captured in the output that gets compared with the expected results files. Once a test is seen to contain errors, it can be rerun from the command line to the standard output to see which test statements caused the errors. Having the errors output with the test lines saves debugging time in having to identify which test statement caused the error.

For non-test (GUI) mode, any token leaks are reported when the application terminates, though if not run from the command line, these errors will not be seen. This should not be an issue because any token leaks should have been eliminated by this time.

Monday, August 5, 2013

Token Caching (With Error Checking)

The PRINT translation will involve many token deletes for unneeded tokens (like semicolons) but also many new tokens (to hold the specific data type print codes). More on the redesign PRINT translation in an upcoming post, but it is desirable to reuse tokens as much as possible instead of repeatedly using the free memory heap, which can get fragmented with many allocations and deletions requiring time consuming garbage collection. Other commands (like INPUT) will also require this reuse capability.

Instead of some type of token saving code in each command translation routine as needed, a global method of caching tokens was implemented. This was accomplished by reimplementing the new and delete operators of the Token class. For the delete operator, instead of simply freeing the memory of the token, the pointer to the token is saved on a free token stack. For the new operator, if the free token stack is not empty, a pointer to a token is popped from this stack, otherwise a new token is allocated using the global new operator.

The free token stack was implemented as a private class based on the QStack class containing a reimplemented destructor function, which deletes all the tokens on the free stack. The destructor is called when the free stack goes out of scope, which occurs when the application terminates.

During the implementation and debugging of LET statements, one area that was time consuming was tracking down memory leak errors, mostly related to tokens. (Other leaks were easy to identify and correct.) First the test statement causing the problem had to be identified (time consuming), then the debugger had to be run to identify what was in the token that was not deleted so that the problem could be corrected (also time consuming). Missed through all of this was the token leak detection that was removed a while ago due to issues with the Qt libraries during the Qt transition.

A new token leak detection scheme was implemented along with detection of tokens that are deleted more than once. More details of these schemes in following posts. Any token leaks and extra token deletes are reported after each translator test line is translated. This will save the time to identify which statement caused errors. The text of the token is output for each error along with its index (will be detailed in the next post). Any token errors are also reported at application termination for non-test mode.

[commit 5fa1780cda]

Thursday, August 1, 2013

New Translator – LET Statements (Tagged)

The implementation of LET statements including multiple assignments and sub-string assignments in the new translator is now complete and version v0.4.2 has been tagged. All tests pass with the old translator routines (after some corrections were made). All expressions tests and all assignment statements in the translator tests pass in the new translator routines. Some minor cleanup was preformed with the latest commit along with updating the files for v0.4.2 (see the commit log for details). Implementation of PRINT statements can now commence in the new translator.

[commit a5d3434fbe]