Interactive BASIC Compiler Project

Saturday, November 1, 2014

Parser – Number Error Corrections

Qt functions are currently being used to convert strings of numbers to a double or an integer in the get number routine. This routine will be changed to use STL functions. While investigating this, a few problems were discovered with how some of the number errors were being reported.

The "expected sign or digits for exponent in floating point constant" error was being reported even when the exponent sign was present. A new "expected digits for exponent in floating point constant" error for this situation. When an incorrectly formed number contained a single decimal point followed by the start of an exponent ('E'), the "expected digits in mantissa of floating point constant" error only pointed to the decimal point. The error was changed to point to both the decimal point and the 'E' character.

The translator was not reporting the "expected command" error correctly when there was a number error - the error was pointing to the number error which was either not at the beginning of the command or its length was not one. This occurred because the number error was not correctly reported as an unexpected token error when a reference was request (at the beginning of a statement).

This was corrected by adding a reference argument to the get token translator routine with a default of None. Only when this argument is None are number tokens allowed. When an unknown token error is returned from the parser, the reference argument is used to generate the appropriate expected error status. For the first token obtained from the get commands translator routine, this argument was set to All, which prevents number tokens (an unexpected token error is return for all number including number errors).

The get operands translator routine was modified to pass its reference argument directly to the get token call. Since get token now generates the appropriate error for references, this routine no longer needs to intercept the error to return the appropriate error or set the error length to one for references. The status is simply returned when the status is not Good. The LET translate routine handles reporting errors when neither a command nor a reference starts a line. The section handling errors was structured poorly and was rewritten.

Certain types of errors were reported differently as a result of these changes. Previously, the error for an incorrect statement like 34=A was reported as an "expected item for assignment" error pointing to the 34. Now the "expected command" error is reported pointing to only the first character of the number. Both errors are technically correct, and it would be difficult to report the previous error. The error was changed to "expected command or item for assignment" since both are applicable at the beginning of a statement.

The expected results for parser test #3 (numbers), translator tests #1 (assignments), #3 (more assignments), and #14 (parser errors) were updated for these changes. Some addition tests were added to translator test #14 for the new expecting digits for exponent error. Many of the translator tests results were also updated for the expected command message change.

[branch parser commit 20e46cc617]

Thursday, October 30, 2014

Parser – Unique Pointers

The parser routines create a token held in a shared pointer upon return. The main function operator routine returns this shared pointer. The token is not actually being shared, just moved until it reaches the caller. There is no reason to use a shared pointer in the parser as a standard unique pointer is sufficient. The parser routines were changed to return a unique pointer. Another alias was added for a unique token pointer:

using TokenUniquePtr = std::unique_ptr<Token>;

Unfortunately, there is no equivalent function for std::unique_ptr like the std::make_shared() function for std::shared_ptr (though one has been added for C++14) so unique pointers must be initialized using the new operator with the unique token pointer alias constructor:

return TokenUniquePtr{new Token {pos, len, type, dataType, m_input}};

The callers of the parser operator function did not needed to be modified since there is a shared pointer constructor that takes unique pointer as an argument (the shared pointer takes ownership of unique pointer). The table new token function was also modified to return a unique token pointer.

One other small unrelated change was made to the get identifier routine with the creation of the REM command token. This code was simplified as it was not necessary to copy the comment string from the input into a temporary string before creating the token. The string can be passed directly when the token is created. The new position can simply be set to the length of the input string. This was already done for the remark operator in the get operator routine.

[branch parser commit 336ad07bf8]

Wednesday, October 29, 2014

Parser – Operator Tokens

The get operator routine was modified to create a new token upon returning when a valid token is found. If the first character is not the start of an operator, a default token pointer is returned. The existing table new token function is used to create the new token upon return. The flow of the function was cleaned up by checking for an invalid operator first, a remark operator next and finally for a two-character operator.

In the main function operator routine, the call to get string was changed like the other get function calls with the member token initialization was finally removed along with the token member.

[branch parser commit 632ce89f80]

Parser – Constant String Tokens

The get string routine was modified to create a new token upon returning when a valid token is found. If the first character is not the start of a string constant (a double quote), a default token pointer is returned. A token constructor was added to support creating a string constant token, which in addition to the column and length takes the string constant without the surrounding double quotes.

This routine was changed from setting characters into the token string (by a length index counter that was not otherwise used) to simply appending the characters to a local string (since the token is not created until the return statement). The former will not work with standard strings. This local string is moved to the constructor, though this has no effect with a QString (copies if class doesn't support move), but will with standard strings.

The set string character token access function was removed since this routine was the only caller. In the main function operator routine, the call to get string was changed like the other get function calls with the member token initialization moved below this.

[branch parser commit 87e4ed4fb8]

Tuesday, October 28, 2014

Parser – Constant Number Tokens

The get number routine was next to be modified to create a new token upon returning when a valid token is found. The character parsing part of the routine was left intact except the two instances where no number is found were changed to return a default token pointer. The token creation lines at the end were replaced with return statements creating a token in a shared pointer. Two more constructors were added to the token class to these return statements.

The first, in addition to the column and length takes the string of the number and the integer value of the number, and automatically sets the type to constant and the data type to integer. The integer value member is initialized to the integer value, however, the double value member is set to the integer value in the body to do the conversion from integer to double (which can't be done with an initializer because the types are different).

The other constructor also takes the string of the number, the double value and a flag for whether a decimal point was present, sets the type to constant. The body checks if the double value is within the range of an integer, and if it is, the sets the data type to integer, and sets the double sub-code only if there was a decimal point. The translator uses this sub-code to determine if a constant can be used as a double even though the data type is integer (a hidden conversion from integer to double code is not needed). For values outside the integer range, the data type is set to double (indicating conversion to an integer is not possible).

The body of the second constructor was taken from the get number routine because this code primary sets token members (via access functions), and it seemed appropriate to do this within the token class. Since the body was not trivial, the constructor was put into the token source and not the header file. Another reason was that the C-style integer minimum and maximum constants were replaced with C++ standard numerical constants from the limits STL header file (no reason to burden source files including the token header file with another header file).

In the main function operator routine, the call to get number was changed like the call to get identifier with the member token initialization moved below this. It appears redundant to declare a if-scoped token pointer at each if statement, but if there was a single token pointer for the entire routine, it would first be initialized to a default value, then reinitialized at each if statement. The if-scoped variable is initialized directly with the return value of the get routine.

[branch parser commit d328c0a720]

Monday, October 27, 2014

Parser – Create Token As Needed

The parser will be modified to create a token only when a valid token is found in the input string and is returned directly. This means that the token will be created on a return statement, which is automatically moved to the caller since the created token (in a shared pointer) is temporary and going out of scope.

Once all the get routines in the parser are changed, it will no longer be necessary to have a member variable to hold the token and there will be no worry of a token being left allocated for an error. Right now the returned token will be an shared pointer, though is not necessary. The return pointer will be changed to a unique pointer, which can be assigned to a shared pointer.

The get identifier routine was the first to modified. Most of the token creation lines were replaced with returns statements creating a token in a shared pointer:

return std::make_shared<Token>(pos, len, type, dataType, m_input);

To support this, two new constructors were added to the token class. One that in addition to the column, length, type and data types values takes the input string (from which a string is created using the column and length values) as shown above. The other constructor taking a code and optional string, which is used by a new token function added to the table class that uses the table to set the type and data type values of the new token.

Once all the locations where in the get identifier routine were replaced, it could be seen that the code was repetitive, so the whole function was reorganized and reduced. If no valid identifier token is found, a default token pointer is returned, which the caller can check as a boolean.

The main function operator routine was modified to support this partial transition. When the end-of-line is reached, a new token is created and returned (using the new token table function). The get identifier routine is called in an if statement by itself receiving the return value in a if-scoped variable, which is returned if set:

if (TokenPtr token = getIdentifier()) {
return token;
}

For now, the current creation of a new token was moved to after the statements above. It will continue being moved as each get routine is changed until all have been changed at which time it will be removed along with token member pointer.

[branch parser commit e34fbccacc]

Sunday, October 26, 2014

Parser – STL Preparations

The changes required to make the parser routines use the STL are going to be extensive, but an attempt will be made to break the changes into smaller incremental changes. Since the parser routines make use of various table functions, it will be necessary to modify the table entries and its functions to use STL. Some preparatory changes were made.

The table entries are divided into several groups for searching, which includes plain word, parentheses words, data type words and symbols. The parser utilizes these groups when searching if strings have a code. The data type words section was empty and upon consideration it was concluded that this group is not needed. This group may have been originally conceived for internal functions that don't have arguments (for example, a DATE$ function). However, these internal functions can go into the plain word group (the RND no argument function is already in this group). This group type along with its bracketing entries were removed.

[branch parser commit 0043105154]

The issue of the token being left allocated when the parser throws an exception could be resolved by not creating a token until a valid token is found. Used of the token member before an exception is thrown were examined and the only use was in the creation of the error exception. The only token members used were its column and length. The column was always the same as the current input position (except one instance) and the length was always 1.

All of the throw statements were modified to not use the token instead using the current input position with a length of 1. One case used a length of 2 (2 was previously used). For the "floating point constant is out of range" error, the statement to set the new input position was moved to after an error is thrown.

For the "expected sign or digits for exponent in floating point constant" error, two columns were reported, the column at the beginning of the number (for operator state) and the alternate column at the beginning of the error (for operand state). A number token is no longer accepted when invalid, so only the alternate column was being used. This mechanism was not needed, and for this error, the position of the error only is reported. This mechanism was also removed from the tester print error function.

[branch parser commit e92da57ef1]

Saturday, October 25, 2014

Parser – Exceptions

When an error was detected, the parser set its internal error status to an enumerator for the error (either unknown token or a number error) and returned a an error token with its column and length set to the error. So to return an exception, the status, column and length values need to be included in the exception thrown. A simple Error structure was added to hold this information. The parser routines were modified to throw an error exception when detecting an error. The necessary values are included for this structure:

throw Error {Status::Error, m_token->column(), m_token->length()};

The set error functions were removed along with the error status member and its access function. Since the Error token type enumerator doesn't indicate an error token anymore, this enumerator was removed. The table entries that used with enumerator were changed to the default token enumerator (required the first enumerator to be set to 1). A leftover check in the get identifier routine was removed that set the token string to "invalid two work command" for an error token type, but this won't occur.

The translator get token routine was modified to catch parser exceptions (using C++ try and catch blocks). For no exception, the Good status enumerator is returned. For an exception, an error token is created from the column and length in the error structure. (The token constructor was modified with an additional length argument and to the C++ initializer syntax.) The rest of catch section remains the same except the status in the error structure is returned. The creation of an error token may be removed later if the translator is modified to the exception model for handling errors.

The tester parse input routine was also modified to catch parser exceptions. The while loop was replaced with a forever loop and the more flag removed. For no exception, the print token routine is called. The routine continues with the next token unless an End-of-Line token was returned. For an exception, the print error routine is called directly, and the routine returns immediately. The use of exceptions in the parse input routine allowed some simplifications in these print routines.

The print token previously had an error status argument. If the token type was an error, then the error status was passed to the print error routine with the token column and length. This check and call was removed, and so was the error status argument. The tab argument was always true, so it was also removed. The column, length and status arguments of the print error function were replaced with an error structure reference argument. This required creating a temporary error structure in the translate input and encode input routines (perhaps later the translator will be modified to throw an error and the program module modified to return an error structure).

It should be noted that the token created at the beginning of the parser function operator routine if left allocated when an exception is thrown. Previously the token is moved to the caller when the token contained an error the same as with a good token. This is not a problem because the parser will go out of scope once the translator handles the error and returns. The parser routines may be able to be changed to only create a token when a good token is found. This will be considered as the parser is changed to use the STL.

[branch parser commit 14265956f7]

Parser Errors – Removed Date Type

When the parser returned an error, it set the data type of the token to Double to indicate a number error or None for an unknown token. This was necessary since a number error could be returned when the parser was in operator state. This no longer occurs after the last change as the unknown token error is returned if the parser finds a character of a number when numbers are not allowed, so the error status alone can be used to determine the error type. Setting of the token data type for errors was removed. This reduces the amount of data to send back with an exception.

When the parser returned an error, the get token function of the translator returned the special Parser status enumerator. The translator routines used this enumerator along with the token data type to determine if the error was an unknown token or a number error. The parser error type can now determined directly with the error status from the parser, so the get token function was changed to return this status instead of the Parser enumerator.

The checks in the rest of the translator routines for the Parser enumerator and None data type were changed to just check for unknown token. The check in the get operand routine to set the token length to 1 for non-references when there was a parser error was removed since this is the only possible error. When getting the token after an argument in the process internal function routine, the check for a unary operator (an error) was moved to before the check for an error, which was changed to return all errors except unknown token. When this token was not a comma or closing parentheses the appropriate error is determined for the error or bad token.

The special Parser status enumerator was no longer used, so it was removed. The concludes all the prep work for changing parser errors into exceptions.

[branch parser commit ae9e97696e]

Thursday, October 23, 2014

Parser – Numbers (Operand State)

Upon making the next set of changes, I realized that State with Operator and Operand enumerators were not accurate terms for what the parser did with this option. All Operator state did was prevent numeric constants, but still allowed other operand type tokens (like string constants, identifiers, and functions). This was renamed Number with Yes and No enumerators, which more clearly expresses what the code does.

This change made it obvious where some simplifications could be made to the get token function of the translator. The value of the number (previously state) argument was selecting number (Yes) if the desired data type was not the default data type value (indicating the caller wants an operator token). When the desired data type is string, the number token would be invalid, so the condition of the desired data type is not string was added to the setting of the number argument.

Number tokens no longer are returned when looking for a string (an unknown token error will be returned). For error tokens, there was a check for operator state and the data type was double (indicating a number error), or the desired data type is string, which set the token length to 1 (so the error only points to the first character) and the token data type set to None (to indicate to callers that there is no number error). These situations now return an unknown token error for numbers with the length set to one and data type set to None, so this check was not needed.

The other check was much more involved (and confusing) but basically said if looking for an operand and there was not a number error, return an expression expected error. Otherwise return a parser error. This check was changed to if the desired data type was not empty (expecting operator) and not None (indicating a PRINT function is allowed) and there was an unknown token error, then return an expression expected error.

[branch parser commit 5690b1b4e9]

Wednesday, October 22, 2014

Parser – Operand State

When the translator is expecting an operator, both type of parser errors (unrecognizable character or numeric constant) are treated the same, and the error reported is appropriate for the situation (expecting an operator, comma, closing parentheses, etc.). However, when the translator is expecting a non-reference operand, the unrecognizable character error is reported as some type of expecting expression error is and numeric constant errors are reported as is (one of the six). For reference operands, the translator reports the appropriate expecting variable error.

A while ago, an operand state was added to the parser so that negative constants would be correctly interpreted instead of the unary negate operator and a positive constant. The get number function checked the operand state when a '-' appears at the beginning of the number. For the operator state, a '-' this function terminated indicating a numeric constant was not found, and it would then be interpreted as an operator by the get operator function.

To simplify checking for parser errors in the translator and ease the transition to exceptions, the parser was modified to return a single error ("unknown token" replacing "unrecognizable character") when in operator state since this error now included unexpected numeric constants. This error will not be seen by the user and is used by the translator to report appropriate errors. In operator state, there is no reason to parse a numeric constant.

The get number function is now only called for the operand state, so this function no longer needs to check for the operand state itself. Since the operand state variable was now only used in the main operator function, it no longer needed to be a member. The boolean argument value was also changed to a enumeration class (named State with values of Operator and Operand), so values are more explicit in calls to the parser operator function.

The default value for this state argument was also removed, requiring an argument value to be added from the Tester class call, which previously used the default (incorrectly operator state). The operand state is used so that numerical constants are parsed (since numerical constants are no longer parsed in operator state).

These changes caused a problem in the translator where the wrong error was reported when assigning a numeric constant (for example, 2=A) as the parser was now returning an unknown token error for numeric constants since the translator requested a command token in operator state. Previously a valid non-command token (numeric constant) was returned and passed to LET translate routine, which reported the appropriate error ("expected item for assignment"). Now with an error, the "expected command" error was reported.

The get commands function was modified by adding the Any data type argument to the get token call, which puts the parser in operand state. This works because any token not a command token is passed to the LET translate routine, which reports an error for any non-reference token. The check for an error from this get token call was no longer necessary since error tokens are passed to the LET translate routine (which reports the appropriate error).

These changes affected a result for parser test #3 (number tests) with the '-2147483647' number, which was being parsed as a negate operator followed by a positive integer. This test was intended to check for the maximum negative integer constant, which it was now since the operand state is now being used for testing. Two addition numbers were added to check one beyond both the maximum positive and negative constants (parsed as double constants). The results for parser test #5 were also updated for the change in the "unknown error" status.

[branch parser commit 04c4fea820]

Sunday, October 19, 2014

Parser – Errors (Exceptions)

One C++ feature not currently being used are exceptions (though exceptions were used a while back for table initialization, but this code was removed when this initialization was redesigned). The Qt library functions do not use (throw) exceptions, but the Standard Template Library (STL) functions can.

It is possible that exceptions could be used by the parser to throw exceptions for parser errors, which fall under two types, errors with constants (six of them related to incorrectly formed numbers or numbers out of range) and unrecognizable characters. Exceptions may also be able to be used for translator errors, but this will be considered later when the Translator class undergoes improvements.

The handling of parser errors was recently redesigned (see post), where the goal was to remove the dependency on the Qt translations functions for the error messages. This design still requires the caller to ask the parser for the error status code when it sees that the last token returned has an error. Before adding exceptions, some additional improvements can be made to the Parser class that will simplify the use of exceptions.

Saturday, October 18, 2014

Parser – Function Operator

The Parser class has a single purpose, to take an input string and return tokens of this string. This is basically like how the special function operator class works. So the first improvement made to the Parser class was to change it to a function operator class.

The set input function with a single string argument used to set the input string, plus initialize the position and operator state members, was removed. A input string argument was added to the constructor and initialization of these other members were added. The token function used to return pointers to tokens from the input string was changed to the operator function:

TokenPtr token(bool operandState) → TokenPtr operator()(bool operandState)

Callers now set the input through the constructor and retrieve tokens like this:

QString input = {...};
...
Parser parse {input};
...
TokenPtr token = parse();

Note that the instance was renamed from "parser" to "parse" as this makes the code a read a little better. No argument is shown because operand state argument has a default value (which is not shown in the function definition above).

The Tester class for the most part looks like above except that the input string is a standard string and is converted to a C-style string with the c_str() string member function (which is then implicitly converted to a QString). This won't be needed once the parser uses standard strings.

The Translator class changes are a slightly different form because it contains a parser pointer member defined as a std::unique_ptr. Previously, a single instance was created for the life of the translator instance. There is no reason to do this since there is nothing in the parser instance that needs to be retained between translations. A new parser instance is created for each translation and must be deferenced to obtain tokens:

m_parse.reset(new Parser {input});
...
token = (*m_parse)(operand);

And finally before returning, the translate function resets the parser member pointer to the default pointer (calls the reset function with no argument), which deletes the parser instance. The Translator class will also be changed to a function operator class so this final reset won't be necessary.

[branch parser commit 9e782539f3]

Utility – Base File Name

Most of the simpler transition to using the STL classes has been completed, though there is still quite a few Qt classes in the non-GUI classes. For example, the string member of Token class is still a QString, but the Parser class, needs significant changes to use this member as a standard string. Since these non-GUI classes need major changes, each will be handled in separate topic branches. Before concluding , the stl branch, some minor refactoring was done.

The base file name function was created in the Command Line class when Qt dependency was removed from the Tester class. This function takes a standard string file path argument and returns a standard string base file name, but uses a QFileInfo function to do its work (which is the easiest platform independent way to handle file name paths, because for instance, Windows and Linux use different directory separator characters - back slash vs. forward slash).

All Qt dependency has been removed from the Command Line class except for this static function. There was no other logical class to put this function so that it could be used by both the Command Line and Tester classes. A new Utility class was created to hold this function. Its header file includes the standard string header file and its source file contains the QFileInfo header, which shields the users from having to know about Qt. This class, like the Status Message class (see post), was made so that it can't be instanced or used as a base class. (Other similar functions can be added in the future.)

The Tester class had one remaining dependency on the Command Line class. The instance pointer of the Command Line is passed to run function as an argument. This instance was only used to call the copyright statement function. This argument was changed to a standard string for the copyright statement, which is now generated in the Command Line constructor and passed to the run function.

[branch stl commit 3020cd6827]

This concludes the initial (simpler) changes transitioning non-GUI classes to STL use. The stl branch was merged into the develop branch and deleted. A new branch will be created for the next set of C++11/STL related changes, which will be the replacement of Qt with the STL in the Parser class.

[branch develop merge commit a8bd956bb0]

Friday, October 17, 2014

Command Line – File Path

There are a few more items in the Command Line class that are dependent on Qt. One of these is the file name member, which contains the path name if a file was specified on the command line. The member along with it access function was changed to a standard string.

The Main Window class also contains a file name member that holds either the path of the file specified on the command line (obtained from the command line instance) or the last file that was loaded. This member was also changed to a standard string. The program path is also passed to many of the the functions within this class. These were modified to take a standard string.

The version function in the Command Line class was modified to return a standard string. This function first converted the C-style release string to a QString, then the first digit if found using a QRegExp with the index of function. A std::regex class is new to the C++11 STL, though unfortunately, this class is not implemented in GCC 4.8. There are a number of possible solutions to accomplish the same thing, but a simple C-like loop to look for the first digit was selected because the release string is a C-style string. Once the first digit is found, a standard string is created from the point of this digit character and returned. This function was made static since it doesn't use any members.

The copyright statement function in the Command Line class contained a translate call for the "Copyright" word. This function is called from the Tester class (no translation needed) and from the About box in the Main Window (translation needed). The function was modified to take the copyright string as an argument with a default of the untranslated copyright word. The About box passes the translated word. With this change, the translate macros could be removed from this class.

A problem was corrected in the Command Line constructor where the file name on the command line wasn't stored in the file name member, so the file name on the command line was ignored. This problem occurred when the argument list was changed to standard list of standard strings.

The constructor of the Main Window was modified to better handle the error when the command line file doesn't exist or the last used program no longer exists. When the command line file doesn't exist, an error is output to the standard error stream. When the last used program no longer exists, a warning box is displayed.

[branch stl commit c998b1ffaa]

Thursday, October 16, 2014

Memory Testing Issues – Resolved

After discovering a default Mint 13 system (kubuntu backports not used) containing Qt 4.8.1 did not exhibit the sporadic memory errors, some further investigation was done. The errors also did not occur when Qt 4.8.2 was built from source. Before blaming the Qt 4.8.2 from the kubuntu backports, the build directory was wiped and the application was rebuilt from scratch. The sporadic memory errors were no longer occurring, so there must have been a corrupted file in the build directory causing the errors.

Some memory testing investigation was also done on Mint 17 (based on Ubuntu 14.04). The conclusion previously was that valgrind 3.10.0.SVN reported errors differently than 3.7.0 (Mint 13) or 3.9.0 (built from the latest source available). The source for 3.10.0 is now available and 3.10.0 built from source on Mint 13 did not report any additional memory errors. The issue was found to be with the ld-2.19.so library on Mint 17 (Mint 13 has ld-2.15.so). This library appears to contain low-level memory allocation functions.

A different error suppression file was needed for the newer version of this library. The CMake build file was modified to detect the presence of ld-2.19.so (either 32-bit or 64-bit). If present, then a different error suppression file is copied to the build directory. This error suppression file generated on Mint 17 is independent of the version of Qt (the Qt libraries are not referenced), so no configuration of the file is needed.

The error suppression files generated for Qt 4.8.2 and 4.8.6 are also independent of the version of Qt, however, the one generated for Qt 4.8.4 is not. To create a suppression file that works with all versions (at least the ones tested), the suppression file needs to be configured for the specific version of Qt. The file generated from Qt 4.8.4 was used, and this file works with Qt 4.8.1, 4.8.2, and 4.8.6 once all references of "4.8.4" along with the installation directory of Qt are changed.

There are now two error suppression files, one for Mint 13 (ld-2.19.so not present) and one for Mint 17 (ld-2.19.so present). The one for Mint 13 is configured for the version of Qt detected, but the one for Mint 17 is not. Mint 17 has Qt 4.8.6, so no other version of Qt should be present. This commit was put in the develop branch since it is not related to the STL changes.

[branch develop commit 2fb73b6892]