The crash on Translator test 12 was caused because the token leaks were only being output after a successful translation but not when an error was reported. After a number of errors, the token leaks built up and when they were finally output, their information didn't line up with the current input line causing a crash when accessing outside of a buffer. Once that problem was fixed, the token memory leaks were fixed one by one.
For one of the leaks, an incorrect fix was applied causing multiple token deletes. For some reason these tokens were not output. A segmentation fault occurred when attempting to debug this problem before it even got to the output code when it was attempting to delete the string in the token a second time. Setting the string pointer to NULL didn't help – apparently the memory was already being used for something else.
Apparently gdb (under NetBeans) catches these segmentation faults where the program running by itself does not. In any case, at least the code was at least detecting the extra delete condition. Details of the memory leaks found and corrected so far after the continue. A common theme is that the token leaks occur when something is popped from the stack and not put anywhere because any tokens in any of the stacks, the pending parentheses token or the output list get deleted by the Translator's cleanup routine.
Sunday, January 30, 2011
Saturday, January 29, 2011
Token Leaks – Implementation
To implement token leak detection, a token pointer list was added to the Token class as a static member and a token list element pointer was added as a regular member. The new and delete operators were overloaded in the Token class to be able to maintain this token list.
The overloaded new operator function first allocates an array of bytes for the size argument, which is type cast to a Token pointer. The token pointer is appended to the static token pointer list and the element pointer returned from the list append function is put into the element pointer member of the token. The value of the token pointer is returned.
The overloaded delete operator function type casts the void pointer to a token pointer. If the element pointer is NULL, then this token has already been deleted. Otherwise the element pointer in the token is used to remove the token pointer from the token pointer list. The element pointer of the token is set to NULL to indicate that it has been deleted (to detect multiple deletes of the same token). The memory used by the token is then deleted.
This additional error that can occur when a token being deleted more than once. To also catch these errors, when a token is being deleted, if its element pointer is NULL, then a copy of the token will be added to a new delete list, which will be a list of tokens. A copy of the token needs to be made because the memory has already been deleted and could be reused by another allocation. It's also possible that the memory may have already been reused and the token copy will be garbage, but at least it will be known there is an extra delete somewhere.
Finally, the test code was modified to output any tokens that have not been deleted once the RPN output list has been output and cleaned up (all of its tokens deleted). The tokens in the delete list are also output. Once the code was working and running, many token leaks were found (in six of the Translator tests), and Translator test 12 (more error tests) was crashing. Time to fix some problems...
The overloaded new operator function first allocates an array of bytes for the size argument, which is type cast to a Token pointer. The token pointer is appended to the static token pointer list and the element pointer returned from the list append function is put into the element pointer member of the token. The value of the token pointer is returned.
The overloaded delete operator function type casts the void pointer to a token pointer. If the element pointer is NULL, then this token has already been deleted. Otherwise the element pointer in the token is used to remove the token pointer from the token pointer list. The element pointer of the token is set to NULL to indicate that it has been deleted (to detect multiple deletes of the same token). The memory used by the token is then deleted.
This additional error that can occur when a token being deleted more than once. To also catch these errors, when a token is being deleted, if its element pointer is NULL, then a copy of the token will be added to a new delete list, which will be a list of tokens. A copy of the token needs to be made because the memory has already been deleted and could be reused by another allocation. It's also possible that the memory may have already been reused and the token copy will be garbage, but at least it will be known there is an extra delete somewhere.
Finally, the test code was modified to output any tokens that have not been deleted once the RPN output list has been output and cleaned up (all of its tokens deleted). The tokens in the delete list are also output. Once the code was working and running, many token leaks were found (in six of the Translator tests), and Translator test 12 (more error tests) was crashing. Time to fix some problems...
List Class Update
The List class is currently keeping track of the beginning and end of the linked list with a master element. The master element is created when the list is created. A list element contains a previous element and next element link pointers along with the value of the element. The amount of memory allocated for the master element is less the element value using a strange syntax for subtracting the size of the value from the total size of the element – not ideal.
When a list is created, the links of the master element are set to point to the master element itself. The list is empty when the master element's next and previous pointers are the same as the master element pointer. When linking through the list, the first element is the master element's next pointer. The end of the list is detected when the master element is reached when linking to an element's next pointer.
In considering a design for the token linked list, it was realized that the master element is not really necessary. All that is needed is a pointer to the first element in the list. Or, for a first-in-last-out list, a pointer to the last element. The last pointer method will be used and the variable will be named tail. The list is linked circularly meaning that the last element's next pointer points to the first element in the list and the first element's previous pointer points to the last element. If the list is empty, then tail will be set to NULL.
Using the single tail pointer, the first element in the list is now obtained using tail->next assuming the list is not empty, otherwise NULL will be returned. The last element is now obtained simply by using tail. When appending items to end of the list, the new element is linked into the list and tail is set to the new element. When removing the last element from the list, a check is needed to see if the list is empty after removing the item.
When a list is created, the links of the master element are set to point to the master element itself. The list is empty when the master element's next and previous pointers are the same as the master element pointer. When linking through the list, the first element is the master element's next pointer. The end of the list is detected when the master element is reached when linking to an element's next pointer.
In considering a design for the token linked list, it was realized that the master element is not really necessary. All that is needed is a pointer to the first element in the list. Or, for a first-in-last-out list, a pointer to the last element. The last pointer method will be used and the variable will be named tail. The list is linked circularly meaning that the last element's next pointer points to the first element in the list and the first element's previous pointer points to the last element. If the list is empty, then tail will be set to NULL.
Using the single tail pointer, the first element in the list is now obtained using tail->next assuming the list is not empty, otherwise NULL will be returned. The last element is now obtained simply by using tail. When appending items to end of the list, the new element is linked into the list and tail is set to the new element. When removing the last element from the list, a check is needed to see if the list is empty after removing the item.
Sourceforge Uploads Now Working
Sourceforge is accepting uploads again and the files for the latest pre-release are now there.
Friday, January 28, 2011
Token Leaks – Initial Design Thoughts
When considering a design to keep track of tokens created, there needed to be an easy way to remove the token once it has been deleted. Tokens left in the list would be memory leaks. The first idea was a simple linked list where each element consisted of a previous link, a next link, a pointer to the token and some type of identifier of where in the code that the Token was allocated. A pointer to the list element would be added to the Token class.
When a token is created, it would be added to the list and a pointer to the list element would be put into the token. When a token is deleted, it would use the element pointer to delete the item from the list. For tokens still in the list at the end, the identifier would indicate where the token was created, though the contents of the token itself would probably be sufficient in determining why it was not deleted.
At first, a special token list class was considered. It would only need a pointer to the first element in the list. Some code on how this would work (adding and deleted elements) was sketched out. But wait, there was already a List class, and then it was realized that the List class doesn't need a master element to keep track of the first and last elements in the list. Only a pointer to the first (or last) element is needed...
When a token is created, it would be added to the list and a pointer to the list element would be put into the token. When a token is deleted, it would use the element pointer to delete the item from the list. For tokens still in the list at the end, the identifier would indicate where the token was created, though the contents of the token itself would probably be sufficient in determining why it was not deleted.
At first, a special token list class was considered. It would only need a pointer to the first element in the list. Some code on how this would work (adding and deleted elements) was sketched out. But wait, there was already a List class, and then it was realized that the List class doesn't need a master element to keep track of the first and last elements in the list. Only a pointer to the first (or last) element is needed...
Thursday, January 27, 2011
Memory Leak Detection
Memory leaks have been a continual problem. Frequently another one is discovered like the recent pending parentheses issue. Some short of memory leak detection is needed. In C++, the new and delete operators can be overloaded so that a custom memory management can be used. While this is a possible solution, a memory manager with leak detection can be very involved. Something simpler is desired.
Analyzing the current source code, looking for all the uses of new operators currently used, here's what was found:
Analyzing the current source code, looking for all the uses of new operators currently used, here's what was found:
Tokens – allocated in a number of places; these have been where most of the memory leaks have been found
Strings – for now these are allocated when creating tokens, so if the tokens get deleted these will get deleted (in the Token destructor)
Stacks – the arrays in the stacks upon first creation and when expanded; the arrays will be deleted when the stacks go out of scope (as part of the destructors of the classes that use them)
Lists – the elements of the list, which are deleted when the lists go out of scope (as part of the destructors of the classes that use them)
Translator Output list – this is a one time allocation in the Translator when a new translation is initiated; it is handed over to the caller when getting the translated results; so it's up to the caller to clean this up (currently handled in the test code)
RpnItems – everyone of these that are allocated is appended to the Translator's output list, so these will be handled when this list is cleaned up
Table – one time allocation in the main function, which is not important as there is only one, but a delete was added at the end of the main function
Error Lists – these are allocated as part of the Token and Table initialization when an error is detected, but were never deleted, which is not important as these were only created when there was an error (and the program exits), but deletes were addedSo it looks like Tokens are the only specific area where there is currently a problem with memory leaks. Next, a simple solution for detecting token leaks...
Sourceforge Not Accepting New Uploads
Sourceforge does not appear to be accepting new uploads at the moment. Last night it just seemed to be taking a long time to be distributed to the mirrors, but this morning the same message appeared. This may explain the problem. For some reason it did except the new release notes. For now the previously posted link for the latest pre-release does not work.
Wednesday, January 26, 2011
Translator – First/Last Operand – Pre-Release
The first and last operand changes are now working and all tests are working correctly except for three statements in Translator test 12 (more error tests). The expected results for these statements were changed in anticipation of having the Translator assume that tokens with parentheses on the left side of an assignment are arrays since functions with arguments will never be there. However, one little issue remains. Consider these statements:
This is a good time to make another pre-release before continuing. The file ibcp_0.1.14-pre-4-src.zip has been uploaded at Sourceforge IBCP Project along with the binary for the program. Now, before continuing on, it's time to do something about detecting memory leaks...
Z$ = A$ + (B + C)Because the open and close parentheses are now attached to operator tokens, the second + in the two statements above, when the error is detected (at this +), the whole parenthetical expression including the parentheses are reported instead of the expression within the parentheses. This new behavior is reasonable so it will be left as is for now.
Z = A + (B$ + C$)
This is a good time to make another pre-release before continuing. The file ibcp_0.1.14-pre-4-src.zip has been uploaded at Sourceforge IBCP Project along with the binary for the program. Now, before continuing on, it's time to do something about detecting memory leaks...
Tuesday, January 25, 2011
Translator – First/Last Operand – More on Parentheses
An old undiscovered bug was found where the pending parentheses token was not deleted when the previous token added to the output list's parentheses sub-code was set when it was determined that the parentheses was not necessary (the parentheses sub-code allows the parentheses to be reproduced). This again emphasizes the need for memory leak detection (more on this soon).
There was a conflict between a closing parentheses token being used as a pending parentheses and as a last operand. It's possible that a pending parentheses will become a token added to the output list (occurs when there are two sets of unnecessary parentheses), so when this token is no longer needed as a last operand, it can't be deleted. The solution was for two new sub-codes, one where the closing parentheses token is being used for a pending parentheses (Used) and one where it is being used for a last operand (Last).
When the pending parentheses is set, the token's Used sub-code is set. When a closing parentheses token is set as a last operand, the token's Last sub-code is set. When the pending parentheses token is no longer needed, if its Last sub-code is set, the Used sub-code will be cleared, otherwise the token will be deleted. If the pending parentheses is added to the output, its Used sub-code will be left set to prevent it from getting deleted. Similarly, when a closing parentheses token is no longer needed as a last operand, if its Used sub-code is set, its Last sub-code will be cleared, otherwise the token will be deleted.
In the closing parentheses handler, when a closing parentheses is processed for a parenthetical expression, the opening parentheses (previously deleted) will now be set as the first operand and the closing parentheses (also set to the pending parentheses) will not be set as the last operand of the token on top of the done stack. The closing parentheses token's Last and Used sub-codes will be set.
There was a conflict between a closing parentheses token being used as a pending parentheses and as a last operand. It's possible that a pending parentheses will become a token added to the output list (occurs when there are two sets of unnecessary parentheses), so when this token is no longer needed as a last operand, it can't be deleted. The solution was for two new sub-codes, one where the closing parentheses token is being used for a pending parentheses (Used) and one where it is being used for a last operand (Last).
When the pending parentheses is set, the token's Used sub-code is set. When a closing parentheses token is set as a last operand, the token's Last sub-code is set. When the pending parentheses token is no longer needed, if its Last sub-code is set, the Used sub-code will be cleared, otherwise the token will be deleted. If the pending parentheses is added to the output, its Used sub-code will be left set to prevent it from getting deleted. Similarly, when a closing parentheses token is no longer needed as a last operand, if its Used sub-code is set, its Last sub-code will be cleared, otherwise the token will be deleted.
In the closing parentheses handler, when a closing parentheses is processed for a parenthetical expression, the opening parentheses (previously deleted) will now be set as the first operand and the closing parentheses (also set to the pending parentheses) will not be set as the last operand of the token on top of the done stack. The closing parentheses token's Last and Used sub-codes will be set.
Saturday, January 22, 2011
Translator – First/Last Operand – Parentheses
Previously it was thought that parentheses could be ignored for the first and last operands because parentheses are always valid in a numeric expression or a string expression, though parentheses would be strange in string expression since there is currently only one string operator (concatenate), but still valid. However, consider this statement:
Currently, when a closing parentheses is processed, the open parentheses token is deleted and the closing parentheses is temporarily saved in the pending parentheses variable, which is checked later to see if it was necessary or not (the parentheses sub-code of the previous operand is set if the parentheses were unnecessary). To be continued...
Z$ = A$ + -(B+C)The entire -(B+C) sub-expression should be reported as an “expecting string expression” error. However, only the B+C is being reported before the unary operator fix and -(B+C after this fix. In both cases, the C was the last operand of the negate (where the expression type was invalid). The parentheses were removed. Had the statement been:
Z$ = A$ + INT(B+C)The error would have been correctly reported since the closing parentheses is being attached to the last operand of the function. Therefore, both the open and closing parentheses also need to be carried through the operands.
Currently, when a closing parentheses is processed, the open parentheses token is deleted and the closing parentheses is temporarily saved in the pending parentheses variable, which is checked later to see if it was necessary or not (the parentheses sub-code of the previous operand is set if the parentheses were unnecessary). To be continued...
Translator – First/Last Operand – Unary Operators
Expression type errors involving unary operators were not pointing to the correct tokens. This was caused because no operands were being attached to unary operators pushed to the done stack. Consider this statement:
The solution was to only attach the last operand from the operand popped from the done stack and leaving the first operand set to nothing of the unary operator token, which will cause the code to use the negate token itself as the first operand.
Z$ = A$ + -BThe -B tokens should be reported as an “expecting string expression” error. However, only the B token was being reported because the first and last operand of the negate operator was being set to the first and last operand of its operand from the done stack, which was the B token.
The solution was to only attach the last operand from the operand popped from the done stack and leaving the first operand set to nothing of the unary operator token, which will cause the code to use the negate token itself as the first operand.
Friday, January 21, 2011
Translator – First/Last Operand – Initial Debugging
Initially, all the Translator tests were failing. The source of the crash was the new delete close parentheses routine, which also needed to check if the token type has a table entry. The crash occurred because when token didn't have a table entry as it was trying to access the table entry to check for a close parentheses code.
Next, a very old bug that went undiscovered was occurring in the expression only test mode used for the first four Translator tests. The problem was that the final result was left on the done stack and the Null token was left on the hold stack. The crash occurred on the first error test when the cleanup routine was called and it tried checking for a close parentheses in the last operand – the token if was checking no longer existed. The End-of-line handler was modified to cleanup the stacks in expression mode and to check to make sure that there is only the result item on the done stack.
A series of problems were then discovered relating to unary operators and parenthetical expressions. Also, a realization was made that the first and last operands need to be saved for operands of arrays and used functions so that the Encoder can report errors correctly for array subscripts and function arguments. More on these problems to follow...
Next, a very old bug that went undiscovered was occurring in the expression only test mode used for the first four Translator tests. The problem was that the final result was left on the done stack and the Null token was left on the hold stack. The crash occurred on the first error test when the cleanup routine was called and it tried checking for a close parentheses in the last operand – the token if was checking no longer existed. The End-of-line handler was modified to cleanup the stacks in expression mode and to check to make sure that there is only the result item on the done stack.
A series of problems were then discovered relating to unary operators and parenthetical expressions. Also, a realization was made that the first and last operands need to be saved for operands of arrays and used functions so that the Encoder can report errors correctly for array subscripts and function arguments. More on these problems to follow...
Sunday, January 16, 2011
Translator – First/Last Operand Implementation
As the hold stack was being changed from holding just a token pointer to a structure containing a token, a first operand token and last operand token pointers, it was noticed that the end of the Operator and Equal (for the equality operator) token handlers contained almost identical code, both needing modified for the hold stack item change. The code was moved and replaced with a call to a new process first operand routine.
A second token pointer argument was add to the process final operand routine, which is called to add a token to the output list by the add operator routine (to add an operator from the hold stack), add print code routine (to add a hidden print code for a specific data type), and the close parentheses token handler (to add a function or array from the hold stack). This second token pointer will contain the first operand token (from the add operator routine), nothing (from the add print code routine), or the close parentheses token (from the close parentheses token handler).
The find code routine was modified to return the first and last operand token pointers from the operand on the done stack that is being checked. These arguments are pointers to token pointers with a null pointer default value – a caller only passes these when it needs them. The find code routine contains local pointer variables if the caller doesn't pass pointers since first and last operand tokens will be needed if an error occurs.
The line in the close parentheses token handler that deletes the close parentheses token was removed (since this token is now being passed to the process final operand routine to be stored temporarily as the last token pointer for arrays and functions). Several locations need to delete this token when it is no longer needed. This procedure first needs to check if the last operand token pointer is set, if the token is a close parentheses token, and then needs to delete the token. A new delete close parentheses inline function was implemented that performs this procedure.
Finally, a through inline function was implemented in the Token class that takes a second token argument and calculates the total length for a token that includes up through to this second token as described here and returns a pointer to the first token so that it can be conveniently assigned to the token being returned. Now for debugging and testing...
A second token pointer argument was add to the process final operand routine, which is called to add a token to the output list by the add operator routine (to add an operator from the hold stack), add print code routine (to add a hidden print code for a specific data type), and the close parentheses token handler (to add a function or array from the hold stack). This second token pointer will contain the first operand token (from the add operator routine), nothing (from the add print code routine), or the close parentheses token (from the close parentheses token handler).
The find code routine was modified to return the first and last operand token pointers from the operand on the done stack that is being checked. These arguments are pointers to token pointers with a null pointer default value – a caller only passes these when it needs them. The find code routine contains local pointer variables if the caller doesn't pass pointers since first and last operand tokens will be needed if an error occurs.
The line in the close parentheses token handler that deletes the close parentheses token was removed (since this token is now being passed to the process final operand routine to be stored temporarily as the last token pointer for arrays and functions). Several locations need to delete this token when it is no longer needed. This procedure first needs to check if the last operand token pointer is set, if the token is a close parentheses token, and then needs to delete the token. A new delete close parentheses inline function was implemented that performs this procedure.
Finally, a through inline function was implemented in the Token class that takes a second token argument and calculates the total length for a token that includes up through to this second token as described here and returns a pointer to the first token so that it can be conveniently assigned to the token being returned. Now for debugging and testing...
Saturday, January 15, 2011
Translator – Sub-String Functions
Sub-string functions add a wrinkle to storing the first and last token because these are not actually stored on the done stack (their string operand is kept on the stack). Consider this statement:
To resolve this, when a sub-string function is processed (popped from the hold stack and added to the output list, but not put on the done stack where its string operand remains), the sub-string token will be put into its operand's first token on the done stack. If this token already has a first token, it will be overwritten.
In the statement above, the first operand of A$ will be set to the LEFT$( token. The +$ token will then inherit the LEFT$( as its first token, followed by the >$ token. When the error is finally reported, the range of tokens will correctly be from LEFT$( to C$. Now on to the implementation of the first and last operand...
Z$= LEFT$(A$, 1) + B$ > C$When the +$ token is processed, the operand on the stack will be A$ left there by the LEFT$( function, which will be considered the first operand of the +$ token. This would then be passed on as the first operand of the >$ token. When the error is reported, the range will incorrectly be from A$ to C$ instead of LEFT$( to C$.
To resolve this, when a sub-string function is processed (popped from the hold stack and added to the output list, but not put on the done stack where its string operand remains), the sub-string token will be put into its operand's first token on the done stack. If this token already has a first token, it will be overwritten.
In the statement above, the first operand of A$ will be set to the LEFT$( token. The +$ token will then inherit the LEFT$( as its first token, followed by the >$ token. When the error is finally reported, the range of tokens will correctly be from LEFT$( to C$. Now on to the implementation of the first and last operand...
Friday, January 14, 2011
Translator – Functions and Arrays
Reporting range of tokens (sub-expressions) gets a little more involved when arrays and functions (internal or user) are in the expression. Consider this statements:
When the array or function token is popped from the done stack, if the last token is a closing parentheses, then it will be deleted, unless it is being assigned as the last operand to an operator token as would happen in the statement above, the closing parentheses to C$( would be assigned to the >$ token as its last token. When the error gets reported, the range would be from the A$( token and the final ) token of the C$(.
There is no issue with regular parentheses expression as these are valid in any expression, therefore, if the statement above included parentheses around the expression, the error would be reported as (in other words, the parentheses are valid, but not the expression inside):
Z$ = A$(1) + B$(2) > C$(3)The range of tokens would be A$( to C$( and reporting only up to C$( would not be correct since the 3 and ) tokens should be included. Therefore the closing parentheses of an array or function should be assigned the last token for an array or function token. This implies that closing parentheses tokens cannot not be immediately deleted.
When the array or function token is popped from the done stack, if the last token is a closing parentheses, then it will be deleted, unless it is being assigned as the last operand to an operator token as would happen in the statement above, the closing parentheses to C$( would be assigned to the >$ token as its last token. When the error gets reported, the range would be from the A$( token and the final ) token of the C$(.
There is no issue with regular parentheses expression as these are valid in any expression, therefore, if the statement above included parentheses around the expression, the error would be reported as (in other words, the parentheses are valid, but not the expression inside):
Z$ = (A$(1) + B$(2) > C$(3))This attachment of closing parentheses to tokens will take place in the closing parentheses token handler and will only apply to arrays and functions, not simple parentheses expressions. Also worth noting is that tokens (arrays and functions) will only have a last token assigned, the first token will be blank. In this case the token itself will be considered the first token. Finally, sub-string functions add another wrinkle to this scheme...
^^^^^^^^^^^^^^^^^^^^^-- expected string expression
Thursday, January 13, 2011
Translator – First and Last Operand
When an error occurs, the inclusive range between the first and last tokens of the offending expression will be reported. Take the previous example statement Z$=A$+B$>C$. The >$ token will be on top of the done stack when the expression type error is detected (the >$ operator returns an integer and cannot be assigned to a string variable). The entire expression from A$ to C$ should be highlighted.
This will be accomplished by saving the first and last operand for each operator as the expression is processed. When an expression type error is detected, these operands will be available. When assigning the first operand to an operator, if the operand on done stack has a first operand assigned to it, then that operand will be used, otherwise the operand token on the done stack will be assigned. The same for the last operand.
In the statement above, the +$ operator will have A$ for its first operand (since A$ does not have a first operand assigned to it) and B$ for its last operand (B$ does not have a last operand). At the >$ operator, its first operand will be assigned to A$ (which is the +$ operand's first operand) and C$ for its last operand (C$ does not have a last operand).
The first operand is no longer saved on the done stack except for string operands (a specific type operator has already been selected), the first operand needs to be attached to the token on the hold stack where the operator is residing until put into the output list. Therefore, the hold stack needs to hold structures containing a token pointer and a first operand token pointer instead of just a token pointer – a new hold stack item will be defined to hold these.
The done stack currently holds output list element pointers (the output list consists of RPN items each containing a token pointer, number of operands and attached operand array, if any). The done stack will also need to hold the first and last operand token pointers. Therefore, the done stack needs to hold structures containing an output list element item pointer along with the first and last token pointers – a new done stack item will be defined to hold these.
This will be accomplished by saving the first and last operand for each operator as the expression is processed. When an expression type error is detected, these operands will be available. When assigning the first operand to an operator, if the operand on done stack has a first operand assigned to it, then that operand will be used, otherwise the operand token on the done stack will be assigned. The same for the last operand.
In the statement above, the +$ operator will have A$ for its first operand (since A$ does not have a first operand assigned to it) and B$ for its last operand (B$ does not have a last operand). At the >$ operator, its first operand will be assigned to A$ (which is the +$ operand's first operand) and C$ for its last operand (C$ does not have a last operand).
The first operand is no longer saved on the done stack except for string operands (a specific type operator has already been selected), the first operand needs to be attached to the token on the hold stack where the operator is residing until put into the output list. Therefore, the hold stack needs to hold structures containing a token pointer and a first operand token pointer instead of just a token pointer – a new hold stack item will be defined to hold these.
The done stack currently holds output list element pointers (the output list consists of RPN items each containing a token pointer, number of operands and attached operand array, if any). The done stack will also need to hold the first and last operand token pointers. Therefore, the done stack needs to hold structures containing an output list element item pointer along with the first and last token pointers – a new done stack item will be defined to hold these.
Wednesday, January 12, 2011
Parser – Token Length
In researching for the last post, some issues were found in the Parser with respect to the token length. The token class contained a length member, but it was contained in the union along with the double value and integer values members used for constants. The thinking was that length would never be used with number constants; the length would be obtained from the string of the constant (which is being kept so that the constant could be reproduced exactly as entered).
The test routine contained code for determining the length of string constants. This required adding two for the surrounding double quotes (which are not stored) and adding one for each double quote contained in the constant (which is entered with two double quotes). It turned out that the length of the original string constant, including the surrounding double quotes was already being stored in the length member of the token. The code in the test routine was redundant and unnecessary (and was removed).
Generally the length member was being used for the length of the token, but it wasn't being set for identifiers (variables, arrays, and user functions) and number constants where the length could be obtained from the string member.
It will be easier if the length member always contained the length of the token so it will not required to obtain the length from the string member for certain token types. The length member was moved out of the union so it can also be used for number constants, and in the Parser, the length member was set for identifiers and number constants so now all token types are covered.
Tuesday, January 11, 2011
Translator – Errors with a Range of Tokens
Currently when the Translator detects an error, the error code is returned along with the token where the error was detected. The caller uses the column and length values in the token to point to the token. (Later the token will be highlighted in a different color.)
Now it is desired that a range of tokens be highlighted to show an entire sub-expression that contains an error. While it is possible to modify all the functions within the Translator for a second return token, there is an easier solution.
The column and length members of the token are used to output the error – where the token starts and how long it is. The Translator, upon detecting an error with a range of tokens, only needs to modify the length token where the beginning of the error is detected, to include all the tokens. The Translator can do this by using the column of the first token, the column of the last token and the length of the last token. The equation would be:
Now it is desired that a range of tokens be highlighted to show an entire sub-expression that contains an error. While it is possible to modify all the functions within the Translator for a second return token, there is an easier solution.
The column and length members of the token are used to output the error – where the token starts and how long it is. The Translator, upon detecting an error with a range of tokens, only needs to modify the length token where the beginning of the error is detected, to include all the tokens. The Translator can do this by using the column of the first token, the column of the last token and the length of the last token. The equation would be:
first->length = last->column – first->column + last->lengthSo when an error is detected and a range of tokens are involved, the length of the token being returned will be set using the equation above.
Monday, January 10, 2011
Translator – First Operand (Updated Design)
Upon considering the first operand design (see posts starting on July 8, 2010), There still be some confusion with the reported error message and the token at which it is pointing. Consider this simple statement (and the error produced currently):
Z$ = A$ + B$ > C$This does not make sense, because first, the integer expression is valid. The error occurs because this integer expression cannot be assigned to a string variable. And it could be confusing because a string expression cannot be placed where the greater than operator is. Alternatively it could indicate that a string operator is expected at the greater than (later why this was ruled out). The previously proposed first operand design would produce this error:
^-- expected string expression
Z$ = A$ + B$ > C$The error is now pointing to the beginning of the integer expression. But, one could say wait a minute, A$+B$ is a string expression, why is this an error? A better solution would produce this error:
^^-- expected string expression
Z$ = A$ + B$ > C$This is much better as it highlights the entire expression, an integer expression, and says that a string expression is expected here. This can be accomplished by, in addition to keeping track of the first operand (the A$), also keeping track of the last operand (the C$). The error would point to all tokens between the first and the last of the invalid expression. Now, how can these two tokens be passed back where currently only one token is being returned when an error occurs? Next, the solution to this...
^^^^^^^^^^^^-- expected string expression
Sunday, January 9, 2011
Translator – Expression Type Pre-Release
There are two main problems with the Twelfth (more error statements) test, namely the handling of the first operand of operators first mentioned on July 8, 2010, and recognizing tokens with parentheses before assignment operators as arrays only (implying the operands are numeric subscripts).
All of the other tests are succeeding, so it's probably a good idea to kick another pre-release first. This is the first release since development was switched to NetBeans, so there are some changes in the released files. NetBeans is not required for building the project. There is now full Makefile for building the program. NetBeans was used to initially generate this file, but it was modified to simplify it. The VIDE2 project file ibcp.vpj will no longer be part of the source files.
Also no longer part of the source files is codes.txt and test_codes.h. Both of these files are now be generated by the new Makefile using the codes.awk and test_codes.awk awk scripts. The MinGW and MSYS packages both need to be installed in order to use the Makefile. The program build has only been tested with GCC 4.4.0 version for MinGW. Also, the awk program is needed for the awk scripts (I don't remember if this is part of MSYS or whether it needs to be installed separately).
The Makefile is currently configured for building the release program with no symbols or debugging information. Debugging can be enabled by changing BASICOPTS from "-O -s" to "-g" in the Makefile. The file ibcp_0.1.14-pre-3-src.zip has been uploaded at Sourceforge IBCP Project along with the binary for the program. Now on to implementing the first operand design...
All of the other tests are succeeding, so it's probably a good idea to kick another pre-release first. This is the first release since development was switched to NetBeans, so there are some changes in the released files. NetBeans is not required for building the project. There is now full Makefile for building the program. NetBeans was used to initially generate this file, but it was modified to simplify it. The VIDE2 project file ibcp.vpj will no longer be part of the source files.
Also no longer part of the source files is codes.txt and test_codes.h. Both of these files are now be generated by the new Makefile using the codes.awk and test_codes.awk awk scripts. The MinGW and MSYS packages both need to be installed in order to use the Makefile. The program build has only been tested with GCC 4.4.0 version for MinGW. Also, the awk program is needed for the awk scripts (I don't remember if this is part of MSYS or whether it needs to be installed separately).
The Makefile is currently configured for building the release program with no symbols or debugging information. Debugging can be enabled by changing BASICOPTS from "-O -s" to "-g" in the Makefile. The file ibcp_0.1.14-pre-3-src.zip has been uploaded at Sourceforge IBCP Project along with the binary for the program. Now on to implementing the first operand design...
Saturday, January 8, 2011
Translator – More Expression Type Debugging
The next problem was occurring when there was an unexpected operand when an operator, comma or closing parentheses was expected (for example: Z$=MID$(A$,B C). This was only occurring for functions with multiple arguments (ASC, MID$ and INSTR) when at the last argument of the shorter form. The “expected operator or closing parentheses” error was produced, but a comma could follow for the optional argument.
A checked was added to the parentheses status routine when at the last argument. If the function has the Multiple flag set, indicating there is another form with more arguments, then the “expected operator, comma or closing parentheses” error is produced instead.
One problem remained, the statement Z$=MID$(A$,B, produced an “expected string expression” at the end of the line where it should have been an “expected numeric expression” error. The problem also occurred on the last optional argument of the INSTR and ASC functions. The problem was caused because the top of the count stack contained the index to the first form of these functions instead of to the second form with the optional argument.
When the code went to get the argument's data type, it ran of the end off the operand data type array incorrectly picking up a string data type. The coding error was on the line that incremented the index of the token on the hold stack to the next form of the function and assigned it to the index on top of the count stack. A post-increment C operator was used instead of the pre-increment operator and the index on the count stack was not actually changed.
Now all the test statements in the Eleventh Translator test are working (and on expected error needed to be corrected on the Third test because of the last fixed – it was missed previously). On to the Twelfth (more error statements) test...
A checked was added to the parentheses status routine when at the last argument. If the function has the Multiple flag set, indicating there is another form with more arguments, then the “expected operator, comma or closing parentheses” error is produced instead.
One problem remained, the statement Z$=MID$(A$,B, produced an “expected string expression” at the end of the line where it should have been an “expected numeric expression” error. The problem also occurred on the last optional argument of the INSTR and ASC functions. The problem was caused because the top of the count stack contained the index to the first form of these functions instead of to the second form with the optional argument.
When the code went to get the argument's data type, it ran of the end off the operand data type array incorrectly picking up a string data type. The coding error was on the line that incremented the index of the token on the hold stack to the next form of the function and assigned it to the index on top of the count stack. A post-increment C operator was used instead of the pre-increment operator and the index on the count stack was not actually changed.
Now all the test statements in the Eleventh Translator test are working (and on expected error needed to be corrected on the Third test because of the last fixed – it was missed previously). On to the Twelfth (more error statements) test...
Translator – More Expression Type Debugging
Now that the data type was moved back from the expression information structure back to the main table entry structure, so that it is available for commands as well as operators and functions, and the code was working again the way it was, it was back to debugging the Eleventh (error tests) Translator test (after committing the working code to CVS) to continue fine tuning the error messages produced.
The next problem to tackle was the generic “expected expression” error being produced when a binary operator occurred in an expression when only a unary operator was expected. This error needed to be changed to the more specific “expected <numeric/string> expression” errors. This was accomplish by calling the new get expression data type routine to get the current data type and producing the appropriate error.
Next there were several error test statements that contained an unexpected comma in an expression, which were producing “expected operator or end-of-statement” errors. This was the correct for some statements (for example: Z=A+B, and Z=A,B=1 and Z=A,), but, not for others (for example: Z=, and Z,Y=, and Z$=,) where the correct error should have been “expected <numeric/string> expression” errors.
These errors are being handled in the comma token handler. These conditions can be determined by looking at the done stack. If the done stack is not empty, the expression up to the comma is valid, so expecting another operator or end-of-statement is appropriate. If the done stack is empty, then there is no valid expression yet for the assignment, so expecting an expression is appropriate. Again, to determine the type of expression, the new get expression data type routine is used.
The next problem to tackle was the generic “expected expression” error being produced when a binary operator occurred in an expression when only a unary operator was expected. This error needed to be changed to the more specific “expected <numeric/string> expression” errors. This was accomplish by calling the new get expression data type routine to get the current data type and producing the appropriate error.
Next there were several error test statements that contained an unexpected comma in an expression, which were producing “expected operator or end-of-statement” errors. This was the correct for some statements (for example: Z=A+B, and Z=A,B=1 and Z=A,), but, not for others (for example: Z=, and Z,Y=, and Z$=,) where the correct error should have been “expected <numeric/string> expression” errors.
These errors are being handled in the comma token handler. These conditions can be determined by looking at the done stack. If the done stack is not empty, the expression up to the comma is valid, so expecting another operator or end-of-statement is appropriate. If the done stack is empty, then there is no valid expression yet for the assignment, so expecting an expression is appropriate. Again, to determine the type of expression, the new get expression data type routine is used.
Thursday, January 6, 2011
Translator – Command Expression Type
There was a problem getting the data type for the token on top of the command stack. There is no issue with assignment tokens as they contain a data type (they are just assignment operators put on the command stack).
A while back, a new field was added to the Table entries for the expression type with possible values of None, Numeric, String and Any to be used with command entries. The idea at the time was that this would be used for the expression type checking (for the generation of the correct error messages). However, this field was not being used (the original expression type code added at the time was abandoned because it had too many problems – it is currently commented).
This first attempt at the current problem was to use this expression type and convert it into a data type (None/Any to None, Numeric to Double, and String to String). The only place this was used was in the new get expression data type routine for commands. But this wasn't working for assignment commands (which were operators and their entries didn't have the expression type set – the data type in the expression information structure had the return value).
So it was concluded that the expression type field should be removed and the data type moved back from the expression information structure into the main entry structure (where it was originally). When the Parser finds and creates a command token, it sets the data type of the token to data type from the table entry. Currently for all tokens, the Parser was getting the data type from the expression information structure or none if there was no structure (as is the case for commands).
A while back, a new field was added to the Table entries for the expression type with possible values of None, Numeric, String and Any to be used with command entries. The idea at the time was that this would be used for the expression type checking (for the generation of the correct error messages). However, this field was not being used (the original expression type code added at the time was abandoned because it had too many problems – it is currently commented).
This first attempt at the current problem was to use this expression type and convert it into a data type (None/Any to None, Numeric to Double, and String to String). The only place this was used was in the new get expression data type routine for commands. But this wasn't working for assignment commands (which were operators and their entries didn't have the expression type set – the data type in the expression information structure had the return value).
So it was concluded that the expression type field should be removed and the data type moved back from the expression information structure into the main entry structure (where it was originally). When the Parser finds and creates a command token, it sets the data type of the token to data type from the table entry. Currently for all tokens, the Parser was getting the data type from the expression information structure or none if there was no structure (as is the case for commands).
Wednesday, January 5, 2011
Translator – Expression Type Implementation
A new routine was implemented to handle getting the current data type of the expression processed so far. This routine is called when the unexpected end of expression occurs to get the data type so that the correct “expected ...” error can be reported, and is also called before pushing an opening parentheses token to the hold stack to set its data type (so that it can be carried through the rest of the expression).
Once the routine was working, it was simplified. First, the token on top of the hold stack is checked for an operator before looking for the other conditions (parentheses, internal functions, and tokens with parentheses). This reduced the number of tests overall. An empty hold stack causes a bug error since at least the Null token should be on the stack.
If the top hold stack token is the Null token, then the data type is obtained from the token on top of the command stack. This caused another problem with commands (more on this in the next post). If the top token is an opening parentheses, then its data type is used (which could be none). Otherwise the data type of the operator's operand is used (first operand for a unary operator, second operand for binary operator).
If the top token on the hold stack is not an operator, then the top of the count stack is used to determine where to get the data type. If the count stack is empty, a bug error is returned since this situation should not happen. If the top is an internal function, then the data type of the current operand of the function is used. If the top is a parentheses, a bug error is returned since this situation should also not happen (parentheses token should have been on top of the hold stack). Otherwise the top contains a token with parentheses (an array or non-internal function) where the data type cannot be determined, so is set to None.
Once the routine was working, it was simplified. First, the token on top of the hold stack is checked for an operator before looking for the other conditions (parentheses, internal functions, and tokens with parentheses). This reduced the number of tests overall. An empty hold stack causes a bug error since at least the Null token should be on the stack.
If the top hold stack token is the Null token, then the data type is obtained from the token on top of the command stack. This caused another problem with commands (more on this in the next post). If the top token is an opening parentheses, then its data type is used (which could be none). Otherwise the data type of the operator's operand is used (first operand for a unary operator, second operand for binary operator).
If the top token on the hold stack is not an operator, then the top of the count stack is used to determine where to get the data type. If the count stack is empty, a bug error is returned since this situation should not happen. If the top is an internal function, then the data type of the current operand of the function is used. If the top is a parentheses, a bug error is returned since this situation should also not happen (parentheses token should have been on top of the hold stack). Otherwise the top contains a token with parentheses (an array or non-internal function) where the data type cannot be determined, so is set to None.
Saturday, January 1, 2011
Translator - Expression Type With Parentheses
Determining the expression type to report the appropriate error within a parenthetical expression without another operator on the hold stack involves a little more work. To accomplish this, the expression type needs to carried from before the opening parentheses.
To prevent having to scan backwards into the expression, when the opening parentheses is pushed on to the hold stack, the opening parentheses token will be assigned a data type, if possible, otherwise the data type will be left at the default data type of none (which will cause the generic “expected expression” error if an error occurs).
Here are several possibility of tokens that will be on the hold stack when an opening parentheses token is about to be pushed onto the hold stack and how to determine the data type to put into the parentheses token:
To prevent having to scan backwards into the expression, when the opening parentheses is pushed on to the hold stack, the opening parentheses token will be assigned a data type, if possible, otherwise the data type will be left at the default data type of none (which will cause the generic “expected expression” error if an error occurs).
Here are several possibility of tokens that will be on the hold stack when an opening parentheses token is about to be pushed onto the hold stack and how to determine the data type to put into the parentheses token:
Operator – get the expected operand data type of operatorThe first four possibilities are the same as before, so a common function will probably be implemented to handle determining the data type for the opening parentheses token. The last possibilities will require looking for the expected data type for the command on the command stack. Here are some examples of commands and expected data types for expressions:
Internal function – get the expected argument operand data type of function
Token with parentheses – can not determine data type
Opening parentheses – inherit this token's data type
Null – Look at data type for item on top of command stack
LET A = ( - assign token on command stack will have data type of variableThe IF statement has not been defined as of yet, but most likely will expect an integer expression (the result of all logical operators). However, a double expression will be accepted either with a different IF code expecting a double expression or the more likely, a hidden integer conversion code will be inserted. The IF will be defined later. Now on to the implementation of this...
PRINT ( - PRINT can take any type of expression, therefore none
INPUT PROMT ( - INPUT PROMPT expects string expression
IF ( - IF expects numerical expression
Translator - Expression Type Determination
In the Eleventh (error statements) tests, the are many statements for testing the correct error message in a line that contains something unexpected, like either an unexpected comma or end-of-line. One of these was just corrected (see last post) – the situation when the Translator is not in a parenthetical expression including inside array subscripts or function arguments. This was detected when the count stack is empty.
The situation when in an internal function has also been handled when the operand processing was implemented. The code simply checks for the expected operand type for the current operand. An internal function is detected when the number of expected operands for the item on top of the count stack is non-zero. The current operand number is in the number of operands in top item on the count stack.
For the remaining situation, the code was simply returning an “expected expression” error, since it was assumed that the type of expression could not be determined. Turns out, this is not entirely true. Two sub-conditions to this situation are inside a parentheses and inside an array or non-internal function (a token with a parentheses). Each of these can be further sub-divided into whether this is an operator or not. Here are examples of each of these four conditions (terminated with an unexpected end-of-line):
For the two cases with an operator, the data type of the operand or the operator on top of the hold stack can be used to determine the expected type of expression to follow (just like for a non-parenthetical expression just corrected). For the first case, with just a parentheses, determining the expression type is a little more involved...
The situation when in an internal function has also been handled when the operand processing was implemented. The code simply checks for the expected operand type for the current operand. An internal function is detected when the number of expected operands for the item on top of the count stack is non-zero. The current operand number is in the number of operands in top item on the count stack.
For the remaining situation, the code was simply returning an “expected expression” error, since it was assumed that the type of expression could not be determined. Turns out, this is not entirely true. Two sub-conditions to this situation are inside a parentheses and inside an array or non-internal function (a token with a parentheses). Each of these can be further sub-divided into whether this is an operator or not. Here are examples of each of these four conditions (terminated with an unexpected end-of-line):
Z = ( in parentheses, no operatorIn the third case, the expected expression type can't be determined because the Translator does not know if A is an array (expression must be numeric for a subscript) or a function (expression can be any type), therefore the “expected expression” error is appropriate. In the other cases, the expression type can be determined and in these examples, this would mean an “expected numeric expression” error.
Z = (A+ in Parentheses, with operator
Z = A( in token with parentheses, no operator
Z = A(B+ in token with parentheses, with operator
For the two cases with an operator, the data type of the operand or the operator on top of the hold stack can be used to determine the expected type of expression to follow (just like for a non-parenthetical expression just corrected). For the first case, with just a parentheses, determining the expression type is a little more involved...
Subscribe to:
Posts (Atom)