Interactive BASIC Compiler Project

Thursday, December 11, 2014

Table – Name Lookup Mechanism

In order to eliminate the auto-generated code enumeration, unnecessary code enumerators will be removed. Table bracketing entries were used to break the table into search groups. There were three of these groups, which included plain words (including two-word commands), words with parentheses, and symbols.

When the table was initialized, the start and end of each group was determined and stored in the range member indexed by the search type enumerator. This enumeration was not changed to a C++11 enumeration class because its enumerators were used as indexes. When a caller wanted to search the table for a name, it passed one of these search types and the search looked only within that range (by sequentially iterating over the entries).

The new name lookup mechanism uses a standard unordered map where the key will be the name of the code and value will be a pointer to a table entry. For two-word codes, the two words are combined with a space separator between words. The key hash and key equal function operators defined in the dictionary class were moved to the utility header file and renamed to case optional. These are used for this map, defined as a static member, so that searches are case insensitive.

The search functions were renamed find to mirror the standard library names and were made static. The primary find function calls the map find function with the string argument. If the string was found then the code index is calculated by subtracting the base table entry pointer from the table entry pointer, otherwise the Invalid code is returned. After the new table is fully implemented, the table entry pointer itself will be returned or the default table entry pointer. The find function for two words combines the words with a space separator and calls the primary find function.

The add function was added to add an entry to the table, which now just consists of adding to the name to entry pointer map. This function is called at the beginning of the entry iteration loop in the table constructor and throws an exception (a string) if an error is found (a two-word code is already in the map). Eventually this code will be put into or called from the base table constructor.

The search type enumeration was removed along with the bracketing table entries (include their code enumerators) and the range member used to hold the indexes of the bracketing entries. The End Plain Word bracketing code was used by the unary operator recreate function to determine if a space should be added after the operator it is a name and not a symbol. This check was changed to checking if the last character of the operator name is a letter. There was also a match function that was no longer used and was removed.

[branch table commit f5f563cad2]

Wednesday, December 10, 2014

Table – Initialization (Errors)

Before creating the name and alternate code maps, the table constructor was modified to use a standard vector of standard strings to record table errors. Any errors found are then output to the standard error stream. Finally, the standard abort function is called to terminate the program when errors are found. The translation call was also removed as this isn't necessary because these errors are programming bugs not requiring translation.

[branch table commit 70f9125512]

Table – Name and Alternate Code Maps

Since the table source file is going to be receiving a lot of changes, the immediate goal is eliminate the auto-generated code enumeration header file so that the entire project doesn't need to be recompiled for each change to the table source file. This will be accomplished by changing how code enumerators (indexes) are handled.

A code enumeration will still be required for a limited number of codes that are referenced throughout, for example, the special operators like Comma, Equal, Semicolon, End-of-Line, etc. So there will still be a code enumeration, but it will not be used as an index to codes. Currently however, there are many more code enumerators that are used, specifically the range code enumerators and the code enumerators used for the associated code arrays.

The range code enumerators will be removed first. To accomplish this, the search mechanism will be changed, which currently searches for a name within three different ranges (plain words, parentheses words, and symbols) of the table entry array. The new mechanism will have one-word names in a name to table entry pointer map. The two-word names will be in a separate two-word names map. These maps will be static table members. The table entry structure members will eventually be in the new table class, so these table entry pointers will become table instance pointers.

When the table consists of many code table instances, the base table class constructor will setup these maps when the constructor of the code classes calls it. For now, these maps will be setup in the current single instance table constructor when it iterates the table entry array.

Similarly, there will be maps from code table instance pointers to vectors of table instance pointers for alternate codes (the new name for associated codes). These static member maps will also be setup by the new base table class constructor, but will temporarily by setup in the current table constructor.

Tuesday, December 9, 2014

Table – New Model

One of the major goals for the new table design (missed in the December 3 post) is to eliminate all the standalone code work functions (translate, encode, recreate, etc.), which require their definitions so that pointers to them can be put into the table entries. Many of the codes do not have several of these work functions (for example, only commands have translate functions), and many codes share work functions (for example, all binary operator have the same recreate function). Once all the codes are implemented, there would have been an explosion of these work functions, especially considering that each code will need a unique run function.

Therefore, a class hierarchy will be the basis for the new table model where the base class holds the information members and virtual functions used for these functions. Using virtual functions allow defining common work functions in an upper class. Unfortunately, there will be an explosion of classes since all codes need a unique run virtual function. Fortunately, there is a way to define these classes without requiring every source file to know about them so only the base table class needs to be known globally, which will contain the interface for all of these derived classes.

There will be a table instance for each code containing the information for that code only. Instead of identifying a code by an enumerator (essentially an index), a code will be identified by a base table class pointer. Code information will be accessed by inline access functions, and the work functions accessed using virtual functions. The index of the code will only be accessed when a token is finally encoded into the program. Each code will be assigned a unique index during initialization (more on this later). The token will therefore contain a code table instance pointer instead of a code enumerator (index).

Though there will no longer be a monolithic table instance, there will be some static table class members and functions. For example, the search functions will be static since these will not require a table instance (they will return an instance for a code). The search functions will use a static map member for looking up a name to get a code table instance. Each of these static members will be described as they are created.

Table – Current Model

The model of the current table design is a monolithic single instance where the information for each code is obtained with a code enumerator that is used within the table as an index into an array of plain structures where each element contains the information for a single code. There is a series of access functions taking a code enumerator as an argument. There is a similar series of access functions taking a token as an argument, where the code enumerator stored in the token is used. Finally there are a couple of functions for searching the table for a code by a name.

A singleton pattern was used for the table instance so that a global table instance would not be used. However, the table entry array was global (though only within the table source file) along with a static pointer to the instance (within the table class and defined in the table source file). Is this really a singleton? In any case, this singleton pattern will be temporarily replaced with a single global table instance (until the new table model starts to get implemented).

One problem with the current table is when it comes time to add a new variable to a table entry structure, like the just added expected data type member. If a value for one of the entries is missed, the compiler reports the error, but error is reported against the end of the array giving no clue which entry has the problem. Finding the problem entry is very time consuming. (This actually occurred).

Another problem is with the auto-generation of the code enumeration. The actual code enumerators were defined as comments in the table source file, which a awk script combed through an generated a header file with the code enumeration definition. This design was an attempt to eliminate the problem of matching the code enumeration with the entries.

This issue was that this auto-generated header file is dependent on the table source file, the main header file includes this header file, and all source files are dependent on the main header file. So every time the table source file is modified, the entire project needs to be rebuilt, and this is becoming a nuisance. Therefore, the first goal will be to eliminate this.

Sunday, December 7, 2014

Pre-Table – Expected Data Type

The expected data type member of the Expression Info structure was initialized during table initialization based on the type (operator or function), number of operands and data types of the operands. This variable was moved to the table entry so that the preset expression info instances are not modified. This resolved the issue with the REPEAT$ function. An initialization value needed to be added to all of the table entries, a drawback of the current flat table entry array (though this will be changed in time). The default data type value was used.

This is about the end of the preliminary changes for the table. The next post will begin to describe the new table design. Since the changes will be significant, an attempt will be made to transition to the new design in a series of smaller changes. The movement of the expected data type could be considered the first of these changes.

[branch table commit 81992b2d51]

Pre-Table – Expression Return Data Type

The data type table entry member was originally used for both the next expected token data type for a command and the return data type for operators and functions. The next expected data type was used with the now replaced token centric translator. Since the translator is now command centric, a next expected data type for commands is not needed. Since the return data type is expression related, it was moved into the Expression Info structure.

The preset expression info structure instances were updated to include a return data type. The table entries were updated where the data type initialization value was removed and the expression info structure pointer name updated or the return data type value added to the expression structure constructor call.

For the REPEAT$ function entry, the same preset expression info structure instance that is also used by several assign operators could not be used since the REPEAT$ function required a different expected data type.

The data type access function was replaced with a return data type function that checks if the expression info structure is present before attempting to access the return data type member. If the structure is not present, the None data type is returned.

[branch table commit 0186dcb76b]

Pre-Table – Multiple Member

The multiple table entry member identified codes that either could be the first word of a two-word command (for example INPUT), was a two-word command (INPUT PROMPT), could be the first character of a two-character operator (<), or was a two-character operator (<=). Having an entire member for just this was unnecessary since a flag could be used for this purpose. In addition, it was not necessary to identify two-character operators.

Therefore, the multiple table entry member was replaced with a new Two table flag (a Multiple table flag already existed) and its access function removed. The Multiple enumeration contained separate enumerators for both characters and words, though were assigned to the same value and even contained enumerators for three characters or words, even though there were no codes that used these. This enumeration was removed.

Some changes were made to the Table Flag enumeration, including assigning it an underlying type of an unsigned integer (32 bits). Instead of assigning the enumerators to hard to read hexadecimal constants, they were assigned values using the shift operator where an unsigned one value is shifted by a unique number of bits. The shift operation is calculated during compilation. The Null table flag enumerator was removed; the default table flag value (TableFlag{}) is used instead. The flag table entry member type was also changed to an unsigned integer. The has flag access functions were changed to return a boolean value.

The parser get identifier and get operator routines were updated to use the has flag access function with the Two flag instead of using the multiple access function. In the table header file, there were a few remaining previously missed uses of a Qt type (quint16) on function pointer definitions that were replaced with the standard equivalent type (uint16_t).

[branch table commit 2f5edfc30e]

Saturday, December 6, 2014

Pre-Table – Unary Operator Detection

The expression information structure within the table entry contained a unary code member, which was used to determine if a code was or could be a unary operator. This member contained the code of the unary operator or a null code. Only four codes contained a non-null code, which included the two negate codes (double and integer), the NOT operator, and the main subtract code. The subtract code contained the first negate code, but the others contained their own code.

This was not efficient use of the member as it was only used for four codes especially considering there was another way to determine a unary operator, namely checking if token type is Operator and the number of operands is one.

This required a change in how the negate and subtract operators are associated with each other. Originally these codes were not associated as the unary code member was used to get from the subtract code to the negate code. Without the unary code member, an association was needed. It did not make sense and was problematic to associate the negate code to the subtract code, which already had a number of associated codes.

Therefore the subtract code was associated with the negate code as a secondary associated code (the negate code already has the integer negate code associated to it). Making it the second associated code makes sense as the subtract has two operands. This change required some changes with how the translator handles unary operators.

When getting an operand, the translator get expression routine checked if the current token was a unary operator, and if it was, changed it to a unary operator. Now it just checks if the current token is not a unary operator before calling the get operand routine. When getting an operator, if the operator was a unary operator, an error was thrown immediately. This was changed to check if the unary operator has a secondary associated code (a binary operator), it if it does, the token is changed to the secondary associated code, otherwise an error is thrown as before.

The table initialization that sets the expected data type member was changed to look through the primary associated codes only for unary operator instead of the secondary associated codes. There was an is unary operator function definition in the table class definition that was not used and didn't have a function, so it was removed. The unary code argument was removed from all of the expression info constructor calls.

[branch table commit b472f6a36d]

Pre-Table – Sub-String Assignments

In researching the requirements for the new table design, specifically how to implement associated codes (which will be renamed alternate codes), there was an issue with the associated codes for sub-string assignments (LEFT$, MID$ and RIGHT$).

The sub-string assignment codes are the first associated code for the sub-string codes. The sub-string assignment-keep codes are the first associated code of the sub-string assignment codes. For recreation, the original sub-string code was the second associated code for each of these assignment codes. This was necessary since the sub-string code contained the actual number of arguments as the sub-string assignment codes contain only two arguments (one for the value and one for the reference being assigned).

This circular association back to the original sub-string code was going to be a problem with the new table design and so the sub-string code associated were removed. The first change made to the sub-string assignment table entries was to put back the original sub-string keyword name. Since the debug name is the combination of the primary and secondary names, the secondary names were changed to remove redundancy, for example, the debug name for AssignLeft changes to LEFT$(Assign. The affected expected test outputs were updated accordingly.

The assign string recreate function was modified to use the keyword name to look up the original sub-string code instead of using the second associated code for the sub-string assignment codes. This caused a problem with the MID3 codes as the MID2 code was found, which contained the wrong number of arguments. This was resolved by adding the Multiple table flag to the MID3 codes. When this flag is set, the sub-string code found is incremented to the next code. This change caused a problem during table initialization checking, which was modified to only check Multiple flagged entries if the Reference flag is not also set.

The assign string recreate function was also using the fact that the second associated index was set zero to detect an assignment-keep code (which are used in string list assignment statements). The string built so far needs to be put back onto the recreation stack. With the above changes, the assignment-keep codes no longer have associated codes. This was resolved by adding a new Keep table flag to identify the assignment-keep codes.

[branch table commit f0c7ebdacd]

Pre-Table – Maximum Checks

There were two constants defined in the table source file, one for the maximum number of operands and one for the maximum number of associated codes. During table initialization, while it is iterating over all of the table entries, it looks for the largest operand count and associated code count for any entry. If these largest values are larger than the maximum, then a table initialization error occurs (the application then aborts).

I was not able to determine why these constants and checks were put in. There is nothing that necessarily limits the number of operands or associated codes. These maximum constants were only used for these checks. Therefore, these constants and checks were removed.

[branch table commit 489c59904c]

Friday, December 5, 2014

Pre-Table – Standard String Members

The string members of the table entries were changed to standard strings including the return value of the their access functions. This includes these members and their uses:

name - the main keyword name of the code; blank for an internal code
name2 - the second keyword name of a two-word command; blank for a one-word commands; also used as the debug name for other codes like operators and internal functions
option - the debug name for codes that support the option sub-code (to reduce the number of sub-codes, this single sub-code has a different meanings for different codes, for example Question for the input begin string code, LET for assignment codes, and Keep for the intermediate sub-string assignment codes)

Besides the access functions for these members, there is also a debug name access function that returns the debug name for a code. The purpose of this name is to represent each code with a unique name. For example, there are several add codes depending on the data types of the operands, which includes the +, +%1, +%2, +%, and +$ strings. For all but the primary add code (taking double operands), these strings were put into the name2 member. When the name2 was empty, the name member would be returned for the debug name.

This mechanism was modified where the name2 member now contains only the second part of the debug name. So, for the add operators, all contain + for their name member, and the name2 members contain the "", %1, %2, %, and $ strings. The debug name function now always combines these two strings. This also works for internal codes that have a blank name member.

However, the resulting debug names for internal functions are now a little different. For example, the ASC, MID$, and STR$ functions previously had debug names (first line below where the 2 and 3 represent the number of arguments and the % represents an integer argument) become (where the name2 members are shortened to "", 2, 3, %):

ASC( ASC2( MID2$( MID3$( STR$( STR%$(
ASC( ASC(2 MID$(2 MID$(3 STR$( STR$(%

For the non-sub-string assignments, their name member is set to = and their name2 member contain the name of their code (Assign, Assign%, Assign$, etc.). The = is necessary so that the assignment statements are recreated correctly. The result of combining these two names are the strings =Assign, =Assign%, =Assign$ etc. This was left as is instead of trying to code around it since this only affects test output.

Obviously the change in debug names affected the expected outputs of many of the tests. These were updated accordingly. Since standard strings cannot be initialized to a null pointer, all blank strings in the table were changed to the "" empty string.

[branch table commit 4f13d11a05]

Thursday, December 4, 2014

Pre-Table – Removed Test Names Header

The test names header file contained an array of C-style strings indexed by a code enumerator each with the name of that code. This file was automatically generated by an awk script that extracted the information from the table class source file. This array was only used by the tester print token function (itself only called from the tester parse input function).

There was really no reason that the code name itself couldn't be used for this test output, specifically the debug name name for the code, which included the secondary name if set or the primary name. The print token function was modified to use the debug name instead. The auto-generating test names awk script was removed and the CMake build file was updated accordingly. The expected parser test output files were updated.

A problem was found in the table debug name access function. For a two-word command, this function was only returning the secondary name. This only affected the INPUT PROMPT command, which was being output as just PROMPT. This function was modified to also get the primary name if the multiple table entry member is set to Two Word (with no space between to the two words). This affected two of the expected encoder test results, which were updated.

[branch table commit 34c45230f6]

Wednesday, December 3, 2014

New Table Implementation

The Table class is only remaining class to be modified to utilize C++ and the STL. The table was poorly designed and was implemented in a way that was more appropriate for C than is was for C++. Goals of the new implementation include:

Eliminating the auto-generated header files, one containing code enumeration and one containing a code names array of C-style strings.
Eliminating some unnecessary table entry variables.
Eliminating the use of the untyped C-style macros.
Reduce the number of values needed to define each code table entry. Currently many entries have default values. This was somewhat alleviated by the Expression Information structure where some entries (like commands) don't have an instance of one these. This may possibly be accomplish using inheritance and templates.
Making setup of the table entries easier and less error prone especially when it comes to the associated codes (which will be renamed alternate codes).

I have been investigating how to redesign the table. Along the way I discovered some things about the table that could be simplified. So before beginning the new design, these things will be done that may make the transition to the new design a little easier. This work will begin in a new table topic branch.

Tuesday, December 2, 2014

Miscellaneous Minor Changes

A bunch of miscellaneous minor changes have been accumulating during this topic branch that were finally taken care of, which included:

Adding the C++11 override keyword to functions in derived dictionary information classes that implement virtual function from the base abstract dictionary class.
Changing the dictionary information array functions to return a reference to the vector instead of a pointer to the array contained within the vector. The pointer to the array was originally returned as it was thought would allow more efficient access to the array elements during runtime, but the vector bracket operator should essentially be the same.
Adding the noexcept keyword to all parser functions that don't throw exceptions.
Replacing several Qt forever macros with an empty for (to remove dependency on the Qt header files).
Removing two c_str function calls from recreator functions that were previously missed (these were originally added to work with QString, but were later replaced with standard strings).
Correcting some minor code formatting issues.
Removing two unused table search functions.
Changing two newline '\n' character outputs in tester class functions to std::endl, which outputs the newline character plus flushes the output stream. This was helpful during debugging to know which input line was being processed when some sort of crash occurred (otherwise the output was still in a buffer in memory making it difficult to know which line it was on).
Changing the header argument of the tester translate input function from a constant C-style character string to an rvalue reference to a standard string. An rvalue reference requires a temporary value, which is what is provided by calls to this function that don't use the blank string default.
Assigning the underlying to the sub-code enumeration to an unsigned 16-bit integer (uint16_t), something possible by C++11. The enumerator values were changed to unsigned 16-bit values (where the four leading zero digits were removed).

[branch misc-cpp-stl commit 69ae2f0b15]

This concludes this development topic and the misc-cpp-stl branch was merged to the develop branch and deleted. The Table class is the only remaining class that has not been transitioned from Qt to the STL and will also be reimplemented for better C++ utilization.

[branch develop merge commit 7f303f1bb1]

Sunday, November 30, 2014

Translator – Better Token Handling

A common pattern in the translator functions was a reference to a token pointer argument that a token was pass into and out of the function. The token returned was the next token that the function could not process (a terminating token). Several of the functions allowed an unset token pointer, in which case, they would get a token, otherwise they would use the token passed in. This was a strange pattern and somewhat difficult to work with.

The translator functions were modified to a better pattern where a member token pointer was added to the translator class to hold the current token. The translator functions were modified to use this current token member. The token is moved out of this member variable when successfully processed (consumed). The token pointer reference argument was removed from the translator functions.

The get token function was modified to put the token obtained from the parser into this new member, and the token argument was removed. If the current token member already has a token, then no action is taken.

The process command function when modified was reduced to only a few lines and since it was only called once, its code was moved to the get commands function. The command token and token arguments were removed from the LET, PRINT and INPUT translate functions, which were modified to get the current token from the translator. Several access functions for the current token were added to the translator including getting a constant reference to the token pointer member, reseting the token pointer member, and moving the token pointer out of the member. The latter two force a new token to be obtained upon the next get token function call.

The get operand function was modified to leave the current token pointer member empty if a valid operand was processed (added to the output list and pushed to the done stack). This forced the next call to the get token function to obtain a new token from the parser.

For sub-string assignments, the get operand function set the reference flag of the token if the token was a sub-string function (LEFT$, MID$, or RIGHT$) and a reference was requested. The process internal function function to identify a sub-string assignment and to request a string variable for the first argument of the function. Upon return, this reference flag was cleared. This reference flag toggling was replaced by passing the reference enumerator as an argument.

There was some code in the get operand function that set the code of the token to the Define Function with Parentheses code enumerator. These statements were moved to the process parentheses token where the similar statement reside for setting the Array and Function codes.

Some other minor changes were made including changing the RPN list token access function to return a reference to the token pointer instead of the token pointer itself (to prevent the copying of the token pointer and updating the shared pointer use counts), reorganizing some code in the various routines, renaming some local token variables, and updating comments for changes made.

[branch misc-cpp-stl commit 828f210f18]