Interactive BASIC Compiler Project

Saturday, October 12, 2013

Encoder Class – Removal

The next step is to implement the code to insert the encoded lines into the program (currently the translated RPN lists are stored). Part of this process that is already implemented is to increment the use counts for dictionary items when a line is inserted into the program. The encoder handled this part. Related to this is when a line is removed from the program - the use counts need to be decremented for dictionary items that were on the line (deleting a dictionary item when it is no longer used).

This remove line process appeared to be mirror to the encode (add) line process and therefore the question arose whether the encoder class should also contain a remove routine. Looking at the current encoder class, it currently contained the single encode routine and no member variables (except for a reference to the table instance). It also required a pointer to the current program unit, which is only used to pass to the encode functions for each code with an operand. A reference to the encoder was not passed to the encode function since the encoder class did not contain anything that was needed.

So the conclusion was made that the encoder class was not really necessary and that the encode routine should be in the program model (which contains all the dictionaries needed for encoding) along with the remove routine that will be implemented. Therefore, the encoder class was removed and the encode routine was moved to the program model class.

[commit 85827d49e8]

Friday, October 11, 2013

Program – Unit Vs. Model

The original plan was for program unit class to contain the code and dictionaries for a single unit (main routine, subroutine, or function) and the program model to contain all the program units. The program model class contained the functions for manipulating the program (inserting, replacing and removing lines), which needed to be moved to the program unit class. However, these functions are triggered by signals from the edit box, and also generate signals for updating the program view dock widget. The program model class still needs to be connected to receive and generate these signals.

Eventually when subroutines and functions are implemented, additional edit box windows can be created for each program unit opened. Each edit box opened will be connected to a program unit. This implies that the program model should only contain a single program unit that will be connected to an edit box when opened for editing, and there will be a program model for each program unit. Something else will need to contain a list of all the program units.

Currently, the main window class contains the edit box and program model instances. It is logical that when subroutines and functions are implemented, the main window will contain the list of program units, which are really program unit models. As each program unit is opened for editing, the main window will create an additional edit box window.

Therefore, the program unit class members were merged into the program model class. While doing this, all the member variables were placed at the end of the class definition to follow the same style that Qt uses. This was also done with the program word class. The other classes that don't currently follow this convention will be modified at some future point.

The constructor of the program model class was modified to get the table instance directly instead of being passed the table instance as an argument. The table instance is then used to create the translator instance before proceeding to create the dictionaries. Eventually the constructor will need to be aware if a subroutine or function program unit is being created and get the translator instance along with pointers to the global dictionaries (like the constant and remark dictionaries) from the main routine program unit.

[commit f3b76d057c]

Sunday, October 6, 2013

Encoding – Second Phase (Tagged)

This concludes the implementation of the encoder for the planned components needed to getting all modules up to the run-time working. The next step is fully integrating the encoder into the program model, where lines are inserted into, placed and removed from the program. Before continuing, the source was tagged version v0.5.2.

[commit cf579cd389]

Constant String Dictionary

The constant string dictionary was a little tricky to get working because of a difficult to resolve memory leak. The constant string dictionary will also be an information dictionary, but was not defined as an information template dictionary directly. The information required for this dictionary contains a pointer to a QString instance. At run-time, an array of QString pointers will be used for the string constants (though the pointers will be wrapped within a class syntactically).

As for number constants, a constant string information class was added with a lone QString pointer to the string constant value. Even though the information will contain a new instance of a QString, due to the implicit sharing feature of Qt, the underlying string will reference the original string in the token. The original string will also similarly be referenced by the key list and key map members of the dictionary. After the token is deleted, the string is still referenced three times (information element, key list and key map). The string is not deallocated until all three instances are removed.

When the constant string information element is created for a new element, a new QString instance is created from the string in the token. To prevent a memory leak, these QString instances need to be deleted. A destructor was needed for the constant string dictionary, so a new constant string dictionary class was created. This new class overloads the destructor, which deletes all the QString instances in the information elements. It will be necessary to nullify this QString pointer when the string constant is removed from the dictionary. The destructor then just deletes all the instance pointers and if one is null then no action is taken.

Initially, an information class instance was created locally in the encode routines and a reference to it was passed to the information dictionary add routine. For constant strings, a QString instance was created in this local instance. This was not an issue when the copy of the instance was put into the information vector for a new or reused instance as the instance pointer was copied into the vector. However, a memory leak occurred when the constant was already in the dictionary and the instance was left hanging (the QString instance was not deleted). This was previously resolved in the information dictionary template class by having its add routine create the instance only when adding a new item.

To complete the implementation, a constant string dictionary was added to the program unit class with an access function. The constant string encode and operand text routines were added that simply call the dictionary add and string routines. Finally, pointers to these functions were added to the associated table entry. The expected results file for encoder test #1 was updated for the numeric constants that are now in the output. A new statement was added with a string containing an embedded double quote character.

[commit 8a85668458]

Constant Number Dictionary

The information template dictionary will be used for the constant number dictionary. The additional information will hold the double and integer value of the constant, and is obtained from the same members in the token. Even through for double constants that can't be converted to an integer, the integer value (which is not set) is still copied to the information element, but it won't actually be used at run-time.

A constant number information class was defined containing the double and integer values. This class was given a blank default constructor (required for the QVector class), and a constructor taking a token pointer to initialize the values (required for the information template dictionary).

To complete the implementation, a constant number dictionary was added to the program unit with allocation and an access function. The constant number encode and operand text routines were added that simply call the dictionary add and string routines. Finally, pointers to these functions were added to the associated table entries. The expected results file for encoder test #1 was updated for the numeric constants that are now in the output.

[commit babfb758ec]

Information Template Dictionary

The numeric and string constant dictionaries have slightly different requirements and so will be handled separately. Both of these dictionaries will have additional information beyond what is in the base dictionary class, namely the constant values themselves. For both of these dictionaries, this additional information will be stored in a QVector. At run-time, the constant data for the QVector will be obtained as an array. This will allow for fast access of the constant values by the indexes stored in the program code.

The Qt QList and QVector classes are almost identical and both can access elements by index. The big difference is that the elements in the QVector are guaranteed to be stored in consecutive memory addresses. This allows the data to be access like a C/C++ array. If this consecutive access is not required, then the QList class should be used, though the only advantage appears to be in the time it takes to prepend items to the beginning of the list (QList is faster).

An information dictionary template class was created based on the dictionary class, which adds a vector of generic information class elements. The add routine was overloaded, which first calls the base dictionary add routine. The base dictionary add routine was modified from just returning whether a new item was added or not, to returning the status of whether a new entry was appended to the end of the list, an freed entry was reused, or if the entry already exists. This status is then used to either append a new information element, replace an information element, or do nothing if the item already exists.

The QVector class was chosen over the QList class for the information so that an array can be used at run-time for constants. There will be other dictionaries with information that won't have this run-time array requirement, but as noted above, the only disadvantage to using QVector instead of QList is prepending to the beginning of the list, and that operation will not be needed.

One requirement for the generic information class is that it must have a constructor that takes a pointer to a token. For constants, the information is obtained from the token. Eventually for other dictionaries, other information beyond what is contained in a token may be needed, at which time, additional argument(s) will be added to the constructors and to the information dictionary add routine, which now just takes a pointer to a token.

[commit 6ffb6901a1]

Saturday, October 5, 2013

Variable Dictionaries

When the program is run, the values for the variables need to be initialized (numbers to zero, and strings to empty). If all three data types of variables are stored in the same dictionary, the initialization routine would need to check the data type for each variable before initializing its value. This is not necessary if a variable dictionary contains a single data type, in which case it just has to loop through and initialize a single value type. The variable dictionaries will not hold the actual variable values. The run-time module will allocate and own the arrays with the values.

For now, no other information is needed in the variable dictionaries. The variable name is stored as the dictionary key, and the data type is implied by which dictionary contains the variable. Therefore at this moment, the base dictionary class is sufficient for the variable dictionaries. However, eventually addition information will need to be stored with the variable.

To complete the implementation of the variable dictionaries, three dictionaries, one for each data type, were added to the program unit class with the associated allocation and access functions. Encode and operand text routines were implemented for each of the three data types. These routines call the add (for encode) or string (for operand text) routines of the variable dictionary for the associated data type. These functions were placed in a the new basic.cpp source file, which will also contain the run-time routines for the variable and variable reference codes along with other miscellaneous functions (the two rem functions were also moved into this file).

Finally, pointers to these functions were placed in the six table entries for these codes (variable and variable reference times the three data types). Several new statements were added to encoder test #1 so that there are more than one variable of each data type. The expected results file for this test was updated for the variable names that are now in the output along with the output of the new statements.

[commit 2843b44761]

Operand Text / Remark Operands

Next up was to retrieve the text for the operand (comment string) of the remark command or operator from the dictionary using the index that is stored in the program code. The recreator also needs to do this, therefore, a mechanism was implemented to allow for the getting the text for the debug output, which will later be used for the recreator.

A new operand text function pointer was added to the table entry. This function is only needed for codes that have an operand, and an access function was added for this pointer. These functions have arguments for the pointer to the current program unit (to access the dictionaries) and the operand. The rem operand text function was implemented to get the text of the remark for an operand (an index) by passing it to the new dictionary string access function (gets the text from the key list member).

The program line text routine was renamed to debug text since it is only used for generating text for test output. Also, since this routine needs access the program unit for the dictionaries to look index values, it was moved to the program unit class. Since the program line class was only created to add the text routine to a QVector, this class was removed and QVector<ProgramWord> is now used.

The operand debug text (renamed) routine outputs the index of the operand along with the value of the operand, and needed to output the text of the operand. To get the pointer to the operand text function, it would need to be passed the code for the operand and would need the table instance. Instead, this functionality is handled by the debug text routine and the resulting text is passed to this routine to add the word index and operand index values.

The debug text routine also needs the table instance, so a table reference was added to the program unit class. This routine will also be called when encoded debug text is integrated into the program view of the GUI. The test routine called this routine with a program line (now a program word vector), however, when used for the program view, a program word vector would need to be extracted from the program unit code vector. This would require a vector to be copied from part of the total program code vector. To prevent this copy, the debug text was modified to take a program word pointer and a count of the number of words in the line.

Some minor cleanup of the table entries were performed on a separate commit (placed function pointers on their own lines, corrected the camel casing of the rem encode routine, and fixed a comment). The expected results file of encoder test #1 was updated for the remark comments that are now in the output.

[commit f5fe1efa4f] [commit bf9264bafb]

Thursday, October 3, 2013

Remark Dictionary

The remark dictionary is the first dictionary implemented. This is the simplest dictionary as it only needs to store the string of a comment, so the base dictionary is sufficient for this dictionary. The ProgramUnit class was added to contain the remark dictionary, which is defined as a pointer and allocated in the ProgramUnit constructor. A pointer was used since all the other program units will use the same remark dictionary.

Previously, the encoder encode routine just set the operand to zero and was modified to call the encode function for the code with an operand. For now, if the code does not have an encode function, the operand is still set to zero. The encode functions contain arguments for the pointer to the current program unit and the pointer to the token being encoded. Additional arguments will probably be needed, but these are sufficient for now.

The rem encode function was implemented, which simply passes the token to the add function of the remark dictionary and returns the index of the entry to be set as the operand word. The pointer to an encode function was added to the table entry structure and a table access function was added to return this pointer. A pointer to the rem encode function was added to the REM command and REM operator code table entries.

For testing, a temporary program unit was added to the main tester run routine and a pointer to it is passed to the encode input routine, which passed to the encode routine. Additional remarks were added to encoder test #1 including one set of duplicate comments to test that a single entry is stored in the dictionary. Currently only the index value is output as the string (of the remark) is not yet retrieved, which will be the subject of the next change. For now, the current output of this test (though incomplete) was copied for the expected results. Though this makes the test pass, the purpose for changing this file was detect changes as more of the encode routines are implemented.

[commit d2222df1d8]

Base Dictionary Class

A dictionary will hold a list of strings (keys) for the program for a particular type of item (remark, constants, variables, arrays, etc.). The index of the name will be stored in the program. When the program is running, the index will be used to access an item (constant, variable value, etc.). When the program is converted back to text (recreated) for display in the GUI, the index will be used to recover the originally entered string.

There are several member variables and functions that every dictionary will have. The variables include a stack of index values of entries that have been removed sp they can be reused, a list of entry keys, a map of entry keys to index values, and a entry use count. The functions include adding and removing entries to and from the dictionary.

The key list variable (QStringList) holds the list of keys. The index of the key within this list is the index stored in the program. The index is converted back to a string by using the key list. The purpose of the key map variable is for an efficient way of looking up a string in the dictionary for its index. The string could be looked up in the key list but would require a slow sequential search since it is unsorted. The key map variable (QMap<QString,&nbspint>) where the map stores the string as the key and the index as the value and allows for fast look up of keys. Since the QString class supports the implicit sharing feature of Qt, there is only one actual copy of each string.

The use count variable (QList<quint16>) hold the number of times the string is used in the program. On first use, the use count entry for the key is set to one. For each additional use, the use count for the key is incremented, and for each time it is removed from the program, the use count is decremented. When the use count for a key drops to zero, the key is removed from the dictionary. The use count list mirrors the key list.

The add function starts by looking up the string of a token in the key map to get its index. If not in the key map, the string is added to the dictionary. If the free stack is empty, then the index for the string is set to the size of the key list and the string is appended to it. A value of one is appended the use count list. Otherwise, an index of popped from the free stack and the token string is set at that location in the key list and a one is set in the use count list. The token string is added to the key map. If already in the dictionary, the use count entry for the string is incremented. The status of whether a new entry was added to the dictionary is returned, which will be used by derived dictionaries to know when to add additional information to the dictionary.

The remove function decrements the use count for index of an entry. If the use count is zero, the key is obtained from the key list and removed from the key map. The string in the key list is cleared (freeing the string). The now unused index is pushed to the free stack for reuse when a new string is added.

Tuesday, October 1, 2013

Program Components And Organization

The dictionaries are needed for the next step of the encoder to be implemented, which means that the encoder needs access to the dictionaries. Temporary dictionaries could be implemented in the tester class and maid available to its encoder instance. Eventually though, the program model will be the owner of the dictionaries. Some design was needed for what components the program model will contain for a program. The current plan follows...

The program will be organized into program units. The program model will contain a list of program units. Each unit will represent the main routine (only one), a subroutine (zero or more) or a function (zero or more) and will contain these components (only the main routine program unit with the components in bold are being implemented at this time):

Name (only subroutine and function program units)
ID - index within the program unit list (main routine will be the first)
Code - array of program words
Line Information - information about each line in the program
Block Information - information about multi-line statements (IF, FOR, etc.)
Remark Dictionary
Number Constant Dictionary
String Constant Dictionary
Subroutine Dictionary
Function Dictionary
Common Variable Dictionary
Common Array Dictionary
Common Define Function Dictionary
Variable Dictionary
Array Dictionary
Define Function Dictionary
Argument Dictionary (only subroutine and function program units)
Return Value (only function program units)

There will be only one instance of the italicized components, which will belong to the main routine program unit, but each of the subroutine and function program units will contain a pointer to these dictionaries. The common dictionaries are for COMMON statements, which will indicate variables, arrays, or defined functions that are accessible outside a given program unit (similar to C/C++ global variables). Further details of each component will be described as each is implemented.

Thursday, September 26, 2013

REM Command Correction

The operands of program instructions will hold indexes to dictionary entries that will contain the text of the instruction (for example, variable names, the original strings of constants, the strings of remarks, etc.). The remark dictionary will be implemented first since it will just hold the strings of the comments. The other dictionaries will required looking up strings to find if a variable or constant already exists.

But first, a problem was discovered with how REM statements are parsed. The REM command should be recognized regardless of what characters follow the command. The parser for the most part did this already except that a space was required after the command. However, the parser should not require the space, consider these examples:

REMARK this should be a valid commented
REM any number of spaces should be allowed

The first statement was rejected because it was assumed to be an assignment of the REMARK variable and expected an equals instead of this. The second statement was valid but all the spaces were removed from the comment string.

The parser get identifier was modified to first look for a statement starting with the three characters R-E-M and store all the characters after this before scanning for a word (valid identifier characters up to a invalid identifier character). Some minor code simplification was also done in replacing the sequence of setting the token code, type and data type with a call to the existing set token table routine that performed these steps. Two additional statements were added to translator test #15 (remark tests) similar to the two examples above.

[commit c52d65a479]

Tuesday, September 24, 2013

Encoding Program Lines (Begin)

The initial version of the encode routine was implemented that takes the token list generated by the translator and converts it to a vector of program words, the contents of which will be inserted into the program. The size for the program word vector is obtained from the token list, which was calculated at the end of a successful translation. For now, the operand words are set to zero for instructions with an operand. Dictionary look ups will be required to generate the indexes for the operands.

The ProgramWord class was implemented containing a single unsigned short (16-bit) word variable with instruction and operand access functions. This class does not determine whether the word is an instruction or an operand, which is up to the user of this class. There are access functions for getting the code from an instruction word, checking if an instruction word has a sub-code set, setting an instruction to a code and sub-code, getting the operand word, and setting the operand word.

The ProgramLine class was implemented to contain a vector of program words and is based on the QVector class of the ProgramWord class. A class was implemented instead of just using a vector so that a text routine could be added to convert the program line into text for testing. The instruction text and operand text routines were implemented in the ProgramWord class. I single text could not be implemented since by itself, this class does not know whether the word is used as an instruction or an operand.

The test encode input routine was modified to call the encode routine upon a successful translation and output the encoded program line (a vector of program words) using the program line text routine. Since the operand words are set to zero, encoder test #1 currently fails since the text for operands is do no match. This will be handled as the dictionaries are implemented.

[commit 60add1604a]

Sunday, September 22, 2013

Condensing Sub-Code Bits

A program word will consist of instruction words and operand words. The instruction word will consist of the instruction code and sub-codes. The sub-codes will generally only be used to recreate the original program and will not be used during execution (though there will be a few exceptions). The code will reside in the lower 10 bits of the instruction word, which will allow for 1,024 different codes (which should be sufficient). This leaves the upper 6 bits for the sub-codes.

There are already five sub-codes that need to be stored in the instruction word, including the Parentheses, Colon, LET, Keep and Question sub-codes. This only leaves one bit for another sub-code and there are a lot more commands to implement. The sub-code bit usage needed to be reduced. As it so happens, the LET, Keep and Question sub-codes will never be on the same code, so only one bit is really required for these sub-codes.

To accomplish this, these three sub-codes were replaced with the new Option sub-code. The LET and INPUT translate routines now set this sub-code instead of the individual sub-codes that were removed. This new sub-code also can be reused for other commands requiring an individual sub-code.

The bits of the sub-codes were also rearranged with the sub-codes that will be used in the instruction words (Parentheses, Colon, and Option) in the same bit positions. This will prevent having two sets of sub-code definitions. A new Program Mask sub-code definition was added that will be used the mask the program sub-codes from the other token sub-codes when the instruction is created.

So that the proper sub-code can be output during testing, an option name variable was added to the table entries. Commands used the option sub-code will have the text name of the sub-code in the option name. In the token text routine, if a token has the Option sub-code, the option name is output (or the string "BUG" if the code does not have an option name).

[commit e721a3e3ae]

Encoding – First Phase - Revisited (Tagged)

The tokenCodes0.5 branch of development was successful so it was merged back into branch0.5 (a fast forward merge meaning the branch0.5 pointer was simply moved).

Since the commands.h header file will contain definitions for other things (like routines for constants), it was renamed to the more generic basic.h, which was also added as a dependency to the application binary in the CMake build file so that it appears in the Project source file list within QtCreator.

In the table entry initialization, it turns out that using "" (blank string) and NULL (pointer) for an initializer to a constant QString variable has the same effect, therefore, all "" were replaced with NULL in the table entries.

The first phase required for encoding is complete (again) except now it is now performed by the translator. This is a good point to tag version v0.5.1. Work will now begin again in encoding the tokens into the internal program code format to be stored in the program model.

[commit b461229bcd] [commit f64d4401d2]

Translator – Determine Code Size

The output assign codes routine contained two loops, the first to assign codes to tokens without codes, and the second to assign program word index values and determine the encoded size of the tokens. Since the token codes are now assigned within the translator routines, the first loop was left to just looking for unimplemented token types. This now can be accomplished in the second loop by simply checking if the token does not have a valid code.

When this routine was being modified, I realized that this routine along with the output append and output insert routines should be members of the RPN List class and not the Translator class since they only deal with the RPN list and do not use any of the translator member variables (with one exception). Therefore, these functions were moved to the RPN List class. The append and insert routines remain in the translator class as output list access functions for the command translate routines, however, the translator routines were changed to access the output list functions directly.

The output assign codes routine was renamed to the set code size routine since that is basically what it does besides setting program word indexes (related) and checking for token types not yet supported (needed only during development). The reset code size and increment code size access functions were removed since the code size member variable can be access directly. This routine now returns a boolean value (false meaning an unimplemented token was found, which is returned). An argument for the reference to the table instance to access the flags for codes was also required.

[commit d0e39b01f2]