Interactive BASIC Compiler Project: June 2013

Sunday, June 30, 2013

New Translator – Simple Expressions (Tagged)

The implementation for simple expressions is now complete in the new translator and version v0.4.0 has been tagged. Note that expression test #1 currently fails with the regression and memory test scripts because of problems with the expression mode in the old translator routines. The new translator routines run this test successfully, but should not yet be used on any of the other expression tests. Also, parser errors are not yet handled correctly by the new translator routines, but this is going to require a major change to how token errors are handled.

[commit 552aa6176a]

New Translator – Memory Leaks

Several of the error conditions caused a memory leak because the token at which the error occurred was not deleted. The error token can't just be deleted because it may be in the RPN output list. Tokens in the output list will be deleted when the list is clears when an error is detected. A method was needed to determine when an error token should be deleted.

This was accomplished by added a new UnUsed sub-code. At the locations where an error detected, the routine setting the error needs to set this sub-code if the token has not been added to the output list. This sub-code was set in two places, one in the get operand routine when there is a command or operator token, and the other in the translate routine that called the get expression routine when the terminating token is not the end-of-line token.

One other problem that caused a uninitialized variable used error from valgrind was also in the translate routine when the terminating token is checked for the end-of-line token. The check also needed to test if the token has a table entry before checking the token for the end-of-line code (non-table entry tokens don't have a code).

[commit a9b45327d6]

New Translator – Expression Error Checks

There were several error conditions that were not being caught correctly by the new translator routines. The get expression routine was also rewritten with a loop instead of recursively calling itself, which should be slightly more efficient and use slightly less stack space for complex expressions.

The first error condition occurred when a command or operator token is found when getting an operand. When this occurs, the proper "expected XX expression" error needed to be return, which is why the get operand routine contains the data type argument.

The second error condition occurred with the token after an operand in the end of expression token check. This check tested whether the token was an operator, however, both operators and command type tokens are reported as operators. The test was changed to specifically check for the operator token type.

The third error condition occurred when the hold stack is cleared of higher precedence operators. The check included obtaining the precedence of the current token and whether the token was a unary operator. However, these assumed the token had a valid code (was in the table), but the token could be any type. A different precedence function was used (taking a token instead of a code argument and handles any token type), and a new is unary operator function was implemented that takes a token argument that handles any token types. Precedences for all the token types also needed to be assigned (only some of them were assigned precedences previously).

Several more tests were added to expression test #1 for testing many of the error conditions. Note that the old translator expression mode has trouble with some of these error tests because the routine is not as robust as the new routines. The problems only occur in the expression mode. These problems will not be corrected, so for the time being expressions test #1 will fail when using the old translator (including the regression and memory test scripts).

[commit fab9d675a6] [commit fc7398879f]

New Translator – Simple Expressions

To be able to use the new translator routines, two new temporary command lines options were added, namely ‑n and ‑ne, which do the same thing as ‑t and ‑te except that the new translator routine is called. These will be removed once the new translator has been fully implemented.

Several new expression tests were added to expression test #1 (simple expressions) including a test of the "‑E" expression (which previously reported an error), several more unary operator tests (including multiple unary operators), many with mixed data types (double and integer) and some with invalid mixed data type types (number and string).

When running the old translator, it was discovered that a negative constant at the beginning of the expression was interpreted as a negate and a positive constant instead of a negative constant. This occurred because the parser was not set to the operand state being that the translator was in its initial state, so a check was added to set the parser operand state if in the initial state and expression mode is selected.

Currently for simple expressions in the new translator routines, parser errors are not being handled correctly and therefore not reported correctly. Expression test #1 does not contain any expressions demonstrating this (some will be added once this issue is resolved). The old translator routines crash on these when using expression mode (command mode does not have a problem). The new translator is also not handling invalid operands and operators correctly.

[commit 58a03cb51b]

New Translator – Get Expression Design

Five new functions were implemented to support simple expressions (simple operands, unary and binary operators) in the new translator and two existing translator functions are utilized. The names of new translator routines that conflict with existing functions are being temporarily suffixed with a '2' character. This functions will replace the existing functions once the new translator is fully implemented and working.

The new routines consist of the top level translate routine (currently only supports expression mode); the get expression routine for getting an expression (to be utilized by any command needing an expression), which will return the token terminating the expression; the process operator routine for handling the precedence of operators; the get operand routine for getting an operand (simple operands to start); and the get token routine for getting a token from the parser. Click Continue... for some more details of each of these new routines.

Continued... »

New Translator – Expected Data Type

The old translator routines had a somewhat complex method to detect and report data type errors. The new translator will have a simpler method. The new translator will have a routine for getting an expression from the input line and will have an argument for the desired data type. At the end of the expression, if the data type does not match this, then the appropriate hidden conversion code (CvtDbl or CvtInt) will be added or an "expected XX expression" error will be reported.

For example, an IF command will call get expression for an integer expression, the INPUT PROMPT command will call for a string expression and the PRINT command will call for any type of expression. The same method will apply to the arguments of internal functions and operands of operators. Two new data types were implemented to support this, the Number and Any data types.

Since the initial new translator implementation will only handle unary and binary operators, the get expression routine needs to know the expected data type that will follow a unary operator. For the minus (negate) operator, this is a numeric expression (either double or integer). For the NOT operator, this is an integer expression.

For binary operators, the first operand has already been processed and the appropriate associated code of the operator will already have been found. The get expression routine will need to know the expected data type of the second operand for the selected associated code.

Therefore, a new expected data type member was added to the expression information structure stored in the table. Code was added to the table setup and check routine to automatically generate the values for this new member by looking at the table entry for of the code and its associated codes, specifically at the operand data types of the last operand. If multiple data types are found, then the expected data type is set to the Number or Any data type appropriately.

[commit aa7d7bec92]

Saturday, June 29, 2013

Minor Parser Correction

Implementation of the new translator will start with simple expressions consisting of simple operands (constants and identifiers without parentheses), unary operators, binary operators and possibly parenthetical expressions. While implementing the new routines (more details in following posts), a minor parser problem was discovered.

When in operand state (which allows a minus sign in front of a numerical constant for a negative constant), if the expression started with a "-E" string, it was interpreted as an incorrectly formed floating point constant that was missing digits before the start of the exponent. Any other character besides an "E" was correctly interpreted as an operator followed by an identifier without parentheses.

The problem occurred in the parser get number routine when there was an 'E' character that was not preceded by digits, which returned the "expected digits in mantissa of floating point constant" error. For this condition, a check was added if a sign was seen and no decimal point was seen, then the routine returns that no number was found. The parser then proceeds to the get operator routine.

[commit bae8b420b8]

Friday, June 28, 2013

Minor Translator Improvements

Before embarking in the new translator routines, some minor improvements were made to the hold and done stacks. The same improvements could also have been made to the count and command stacks, but (spoiler alert) these stacks won't be needed for the new translator.

For the hold stack, there were several places where the top item on the stack is popped. The pop() function of QStack returns the item popped. But for these uses, the top item is not needed so a somewhat convoluted method of resizing the stack less one item was used for efficiency. It would be much better for code readability to use a simple function. A similar method was used for pushing items, where the stack was resized (up by one) first and then the members of the item filled in.

To make the code easy to work with and increase readability, a new HoldStack class was implemented, based on the QStack class, which contains a new drop() function, for popping the top item without returning it, and a push() function with specific arguments for the token and first token pointers (with the first token pointer defaulting to null since it is not always set for a push operation). Both functions still use the resize method, but these are now hidden in the class definition.

Similarly, a new DoneStack class was implemented except only a push() function was implemented since there were no uses of an empty pop. This push() function takes arguments for the RPN item pointer along with the first and last token pointers (both defaulting to null since these are not always set for push operations).

A minor change was also made to the table initialization routine that checks the maximum number of operands and associated codes. This routine now checks to make sure the fixed constants for these values are set exactly to the actual values found in the table instead of just checking if the constant at least as big as found in the table. The original thought was that this value would never decrease, which was not the case for the last change that decreased the maximum number of associated codes. (See post on May 20, 2010 for details of these constants.)

[commit cdfc70b43f] [commit 9951f9b9da] [commit 0f0fb97142]

Thursday, June 27, 2013

Removed Temporary Strings

The current translator routines will be left in place and kept operational as the new translator routines are implemented. Before proceeding with the new routines, the temporary string data type was removed since it is no longer needed. For now, the sub-string data type remains, but this too will be removed.

String operands were being attached to operators and internal functions that contained string operands since the translator can't determine if operands were regular strings (from variables and arrays) or temporary strings (from functions). The encoder was going to make this decision. Since all strings will now be temporary, strings operands no longer need to be attached.

All the operator and internal functions with various codes for all the combinations of regular strings and temporary strings were removed from the table along with the various operand arrays and expression information structures with temporary strings and associated code arrays. The table also included the number of strings operands along with a flag whether the code contained string operands. These were only used for attaching string operands, and so were removed.

The expected test results files were updated since there are no longer any string operands attached to tokens. Another small change was made to the translator where a single parser instance is now created in the constructor and deleted in destructor instead of creating a local parser instance for every line translated.

[commit 35f201d96c] [commit 86ca56c673]

Tuesday, June 25, 2013

New Translator – Strings

How strings will be handled at run-time was the first factor considered in the Translator redesign. Previously there was the concept of temporary strings and sub-strings. Temporary strings were put on the evaluation stack as the result of the concatenate string operator or a string function. Sub-strings were put on the evaluation stack as the result of the sub-string functions (LEFT$, MID$ and RIGHT$). Regular strings were put on the evaluation stack for string variables.

The idea behind temporary strings was the prevent the copying of a string for a variable to the evaluation stack - no point in creating a temporary string for a string variable if it is never going to be modified. Take the example expression A$+B$, which would be translated to A$ B$ +$. A temporary string only needs to be created for the result. Using temporary strings for when A$ and B$ are pushed to the evaluation stack would entail an unnecessary copy.

The idea behind sub-strings was again to prevent unnecessary copying of strings (since a sub-string is part of an existing string). Sub-strings could refer to either regular or temporary strings. Sub-string assignments would also be supported (for example, the statement MID$(A$,4,2)="AB").

The original String class developed supported these concepts. However, the QString class is now being used for strings and this Qt class supports implicit sharing, meaning that when a string is copied, it is not actually copied (only a reference to the original string is copied) until the string is actually modified. So for the sample expression, since the A$ and B$ string values are not modified, no copy would take place.

Therefore, only temporary strings will be put on the evaluation stack. There is no need for all the different forms of the string functions and operators. Sub-strings will need to be handled differently then originally envisioned with the String class. Sub-strings in expressions will now be handled the same as other string functions using the QString member functions left(), mid() and right(). Sub-string assignments will be handled differently using the QString member function replace(). Further details of sub-string assignments will be given when the new LET statement translation is implemented.

Monday, June 24, 2013

Translator – New Design

The Translator will be redesigned to change it from token centric to command centric. This includes how expressions are processed, meaning the processing of expressions needs to be changed from being fed tokens to getting tokens from the parser as needed. This should make for a much simpler design by eliminating the various states and modes used to determine how tokens get processed and for the checks in determining errors when they are detected.

This will be a major undertaking, so it will be down in steps. For the moment, the original translator code will be left in place and the new routines will be added along side the old ones. Some thought needs to be given to how the program will be executed at run time to determine how the program needs to be translated. These will be covered in upcoming posts.

To get started, the Remark token type was removed since it was not being used. Remarks are either handled as a command token (REM) or an operator token ('). Also, since it will be needed, the valgrind suppressed errors file needed to be updated. For some reason, the previously generated file was no longer sufficient. The memory test did pass when version 0.3.5 was generated, but building this version today now generates additional memory errors. All these errors occur within the Qt libraries and can be ignored.

[commit 61dc082425] [commit 98aee21078]

Wednesday, June 19, 2013

Command Centric Translator

The core of the current translator design has tokens fed to it one by one and is uses various states and modes to determine what type of tokens are expected and how they should be processed. This design is rather complex and will make the implementation of more complex commands (like IF-THEN-ELSE and FOR-TO-STEP) very difficult.

The new translator design will be command centric. What this means is that instead of handling each token individually from the bottom, the commands themselves will determine how their particular syntax will be translated. The translator will start by reading a token, determining the command, and then calling the command specific translation routine.

There will be several routines to support getting the parts of commands these command translation routines will use including getting an expression, getting a variable reference, getting a constant, and getting a command. Each of these functions exit once a token is reached that it does not handle and return that token. For example, a semicolon would terminate an expression routine and return the semicolon token. The caller would then decided how the semicolon will be handled (like in a PRINT statement), or if it is an error for the command (like in an IF-THEN statement).

For example, for an assignment statement, the LET translation routine first gets a variable reference, checks the returning token for a comma or equal operator, if a comma then will repeat getting a variable reference, else it would finish up by getting an expression. An error would be reported at any step that is not valid (or if say the types don't agree). Upon a successful command syntax, the token terminating the expression would be returned by the LET translation routine to the caller.

Each command translation routine would call the appropriate support routines for getting the various parts of the command's syntax and will return either an error or the token terminating the command. The main line translator would expect either a colon (in which case it would look for another command), or an end-of-line token. For an IF-THEN, after receiving the THEN token, would also call the get command routine, but in addition to looking for a colon or an end-of-line, would also look for an ELSE or ENDIF token.

Monday, June 17, 2013

Internal Program – New Design

Using a pure reverse polish notation format for the internal program, the code for the command would be at the end of the line. When running the program, the codes before the command code leave the arguments of the command on a stack, which are then available when the command token is executed. However, for some commands like PRINT and INPUT, there would be several intermediate codes intermixed in the line that perform part of the command (for example, print a value, input a value, etc.).

I thought that using a less than pure RPN format would be easier to translate to and recreate from, where the command code would be first followed by the other codes and ending with an end of statement code (end-of-line, colon, ELSE, etc.), which would kick off the command. There would still have been intermediate codes in the line, specifically the commas and semicolons that separate the other arguments of the commands.

When encountering these codes during run-time, they would call back to the command, which would look at the code and the type to decide how to process the value that are on the stack. These separator codes would have required type information (double, integer or string) in the their sub-code field. This would have complicated the run-time code. Note the decisions that would need be made at run-time. This would not be ideal as the goal is to make decisions before run-time so the code can be executed as fast as possible.

In any case, upon further contemplation, I decided that a pure RPN format would be best, but with a slightly different format for the intermediate codes needed for some commands, which would be the result of decisions made during translation. Plus translation will be easier with the new command centric translator, which will be described in the upcoming posts along with the specific details of the various commands.

Sunday, June 16, 2013

Translator Design – Revisited

The next step was going to be either the Recreator (recreating from the RPN lists produced by the Translator, since that is all that is available) or the Encoder (including defining the format of the internal program). For the later, the internal format of the program needs to be defined. Integral for this definition is how the program will be run.

The basic format chosen is a pure reverse polish notation. The problem with this is that it makes the translator design very complicated and somewhat convoluted (for example, the processing of a PRINT statement is spread throughout the translator code). And so far only three fairly simple commands have been implemented. The problem is that the design is token centric, meaning that the specific tokens are processed and they decide how they should be handled based on the current state of the translator (command, expression, etc.).

A better design, and on that should be easier to implement, especially once the more complex commands are implemented, is one that is command centric, meaning the commands decide how the program lines are processed and how the tokens are handled. This also means a pure RPN design of the internal program is not necessary. The design I have in mind should only impact run time slightly, but will significant simplify the translation of more complex commands.

The new design will be explained in upcoming posts. The goal for the 0.4 development series will be the implementation of the new translator design, which will include the currently implemented commands (LET, PRINT, INPUT, INPUT PROMPT and REM) along with handling expressions. The size of the current Translator (header, source, token handlers and command handlers) is a total of 2,863 lines. It will be curious to see if new design is able to reduce this.

Saturday, June 15, 2013

Translator/GUI Integration - Release

The integration of the Translator and the GUI is now complete, which includes a debugging program view of the program internals containing the translated RPN lists of the program lines. This will eventually show the internal program codes once the encoder is implemented. Version 0.3.5 has been released (branch0.3 was merged to the master branch and tagged v0.3.5).

Archive files containing the source and binary files(with test files) have been uploaded to SourceForge. For Windows, there is also the ibcp-libs.zip file that contains all the dynamic linked libraries required to run the program if the MinGW and QtSDK packages have not been installed (extract into the ibcp directory). Linux should already have the required libraries installed.

This concludes the 0.3 development series. Implementation of the next step will now begin with the 0.4 development series, which will be defined in the next post.

[commit 046c47cd9a]

New Git Repository Tagging Convention

The error highlighting implementation is now complete and it is time for a development release. I made the decision to change the naming convention of repository tags from releaseX.X.X to vX.X.X, which is the convention used by many other projects under git revision control (like the git source repository itself). All the "release" named tags have been replaced with the new "v" named tags. These can be removed from a local repository using the command git tag -d `git tag -l release*` (backward single quotes) or the repository can be re-cloned.

To support this change, the CMake build configuration file CMakeLists.txt was modified for this naming convention. A change was also required in the building of the version string in the command line class, which previously added seven to the release string pointer, seven being the length of the "release" part of the version string. This could have simply been changed to one, the length of the "v" part of the string, but instead a little code was added to look for the start of the version number part of the string (by using a regular expression) so it doesn't matter what string precedes the version number.

The change in tag names will not cause a problem when building one of the archive files downloaded for a given tag because CMake will generate the version string being that the git repository will not be available. Though the version announced as being built will still have the "release" string in the name.

However, if building at an older tag with the new tag naming convention, then these changes are needed, which is not an issue with any of the versions prior to 0.2. CMake will not complain from versions 0.2-1 to 0.2-6 but the version number output will be messed up. The changes need to be applied manually. Starting with the final version 0.2.0 through the 0.3.X versions, the changes can be applied by git using the command git cherry-pick fa89 command where fa89 is the abbreviated commit ID of the commit with the changes. To preserve this commit, a new branch can be created using the command git checkout -b xxx where xxx is the desired branch name.

Finally, previous blog entries were edited to reflect the new tag naming convention. All blog previously given the Pre-Release blog tag was given the Tag blog tag. Major releases still contain the Release blog tag (and not the Tag blog tag so they can be differentiated).

[commit fa89de8a0d]

Friday, June 14, 2013

Modifying A Line With An Error

When a line is modified that contains an error, the highlighted error can be shifted around or changed in size. This can make the error message no longer applicable, so the highlighted error should be removed. Upon the cursor leaving the line, the line will be rechecked for errors. A check for this situation was added to the document change slot of the edit box class when a single line is changed. No action was taken for single line changes.

When a single line is modified, the errors lists is searched its line number. If the line contains an error, but the modification occurred after the error, no further action is taken. If the cursor and the error are at the end of the line, or the cursor is within the error, the error is removed from the error list. If the cursor is before the error, the column of the error is moved by the net change (number of characters added less the number removed).

The update errors slot is called to update the display. A cursor changed signal is also emitted so that the error message on the status line is removed when the error is removed. This didn't always occur because the cursor is not always moved when the document is changed (for example, when characters are deleted). New move error column functions were added to the error list and error item classes to support the shifting of errors.

[commit 98d7b2c0b4]

Thursday, June 13, 2013

Error Action Enabling/Disabling

Now with actions to move the cursor to the next or previous error, these actions should only be enabled when there are actually errors. To accomplish this, an errors available signal was added to the edit box class. The update errors slot was modified to determine when the errors available condition changes by comparing the before and after error list emptiness status. Several problems were found and corrected while testing the new signal.

In the update errors slot, when there were no errors in the incoming errors list, a Qt error occurred because bad indexes were used in accessing the errors list. This was due to using the start and end change indexes when they were not valid. A check was added for when the incoming error list has no errors to prevent this.

The second problem occurred when there were no errors and the next or previous error key shortcuts were used. This was due because the Ctrl+. (period) and Ctrl+, (comma) were not intercepted since the errors actions were disabled, causing these keys to be treated as a regular period and comma, which was inserted into the program. Checks were added to the key press event handler to catch and ignore the period and comma keys when the Ctrl modifier was active.

Finally, during testing, an unrelated problem was found with the detection of changed lines when comparing RPN lists that contained errors. Previously this comparison (equality operator function) compared the error column and length and reported equality if they were the same. This caused a problem were sometimes the error was not displayed correctly in the modified line (or not at all). The equality function was modified to report no match if either RPN list has an error.

[commit 352667221f] [commit 15084235a2]

Interactive BASIC Compiler Project