Interactive BASIC Compiler Project

Thursday, May 20, 2010

Translator – Sub-String Assignments (Testing)

The changes to support sub-string assignment were implemented, which included adding the SubStr data type entries to the conversion code array in the match code routine; changing the data type to SubStr for the LEFT$, MID$ and RIGHT$ table entries; and added the sub-string reference checking. Initial testing started with the existing test inputs – no problems were discovered.

Upon trying a sub-string assignment, discovered that it didn't work because there was no AssignSubStr associated code, so this code was added along with its table entry. Now, many of the existing test inputs were failing. Next discovered that the maximum associated codes needed to be changed from 2 to 3 because Assign now had three associated codes. The wrong value was causing the find code routine to malfunction.

I thought it would be best to calculate both the maximum operands and maximum associated codes automatically during the Table initialization, so moving these constants to members of the Table class seemed to be the best solution. Unfortunately, these values need to be constants because they are used to define the sizes of several arrays.

It would still be prudent to have the the Table initialization at least check to make sure these constants agree with what was in the table entries. So code was added to the Table initialization to find the maximum operands and associated codes as it was scanning the entries for code checking. Two new table error types were added for these errors, which are reported by exceptions from the Table constructor. Testing continues...

Wednesday, May 19, 2010

Translator – Sub-String Assignments

There will be several associated codes for assigning strings from a reference string or temporary string to a reference string or to a sub-string. The Encoder will determine which code to use once the string type for each operand is determined. For the Encoder to know if a sub-string is present, a new SubStr data type is needed, which only the LEFT$, MID$ and RIGHT$ functions return.

The Translator will check to make sure that sub-string assignments are valid by making sure the string argument is a reference string, at least if it is a Paren or NoParen token type. If it turns out to be a user function, then the Encoder will report the error. The Translator will also make sure that a compound sub-string assignment like LEFT$(RIGHT$(B$,3),2)="AB" is not entered.

The find code routine needs to handle sub-strings where the operands are popped off of the done stack and the reference flag is checked. For the first operand (last operand popped), if the data type of the token is SubStr (indicating a sub-string function) and the operand is a String, then the token's reference flag is set to operand's reference flag (the reference flag is transferred). The operand's reference flag is then cleared as it is currently implemented. This assumes that the first argument of the sub-string function is the string being assigned.

If the operand's data type is anything but String (SubStr or TmpStr), then the reference flag is not transferred. For the invalid compound sub-string assignment, since the data type of the operand will be SubStr, the reference flag of the function's token will not be set. When the reference flag is checked for an assignment operator, the reference flag will only be set if the string operand was reference, otherwise if won't be set and an error will be returned.

Tuesday, May 18, 2010

Sub-String Assignments

Many BASICs support sub-string assignments, the syntax that will be supported is the same used as in GW-Basic, QuickBASIC, FreeBASIC, etc.. Here is an example sub-string assignment along with it's translation:

MID$(A$,5,2) = "AB" A$ 5 2 MID3$ "AB" AssignSubStr

Notice the new code AssignSubStr and the lack of <ref> on A$. When the MID$ is processed by the Translator, the reference flags of its operands are cleared, hence no A$<ref>. However, the MID3$ needs to have its token reference flag set when the assign operator checks for a reference, since that MID3$ token will be on the done stack. Consider how this statement is processed at run-time.

At the A$, a reference string is pushed on the evaluation stack, that is, its pointer and length are copied to the stack. When the MID3$ is processed, it will pop the 2 and 5 off of the stack. The pointer on top of the stack will be changed to point to the fifth character in A$ and the length is set to 2 (assuming that A$ is at least 6 characters long).

When the AssignSubStr is processed, the "AB" string constant will be popped off of the stack. Two characters of this value are copied directly to the pointer that is on top of the stack (which is pointing to the fifth character in A$). A different AssignSubStr vs. AssignStr code is required because no allocation occurs, only a copy to an existing character array. An AssignSubStrTmp code is also required for the case that the value being assigned is a temporary string, which needs to be deleted after the characters are copied.

Since LEFT$ and RIGHT$ are also sub-string functions, there is no reason these functions can't also be used to assign sub-strings – and as it turns out, no extra code is required. Next, how sub-string assignments will be handled by the Translator...

Monday, May 17, 2010

Translator – Sub-Strings

The processing of operators and internal functions with string operands needs to be delayed until encoding because in some cases, the Translator does not know if a string is a reference (from a variable or array element) or a temporary (from a user string function) because the Translator does not know if identifier is a variable, array element or a user function.

At run-time, if the string argument to a sub-string function is a reference string, then it's result is also a reference string in that the string does not need to be deleted. If the string argument is a temporary string, then it's result is also a temporary string in that the string needs to be deleted.

Each sub-string function will have an associated code. The main code will have a reference string operand and will return a reference string. The associated code will have a temporary string operand and will return a temporary string.

It would appear that sub-strings have no impact on the Translator. However, there is one more capability that needs to be considered, and that is sub-string assignments where a portion of a string is assigned without affecting the rest of the string...

Translator – Saved Operand Correction (Release)

The problem where the saved operand pointers were not pointing to a conversion code that was inserted (was pointer to the original operand), was a simply correction. All that was needed was to set the operand array element to the output of the output list append call where the conversion code token was inserted. The ibcp_0.1.11-dev-2-src.zip file, another developmental release, has been uploaded at Sourceforge IBCP Project along with the binary for the program.

Sunday, May 16, 2010

Sub-Strings – Details

The results of sub-string functions (LEFT$, MID$ and RIGHT$) will use the same character array as the argument string. Exactly how this is accomplished depends on whether the argument string is a reference string or a temporary string.

For a reference string, the pointer and length of the character array are copied to the evaluation stack. This means that the pointer and length on the stack can be modified without affecting the actual pointer and length of the variable or array element. For LEFT$, only the length on the stack needs to be changes. For MID$ and RIGHT$, both the pointer and length need to changed depending on the integer arguments.

For a temporary string, the sub-string operation is a little more involved because the pointer to the character array on the stack cannot be modified – it is needed to delete the character array when the temporary string is no longer needed. The length is not needed to delete the character array, so it can be modified. Since the pointer cannot be modified, the portion of the sub-string needs to be moved to the beginning of the character array. Again for LEFT$, only the length needs to be changed, the sub-string is already at the beginning of the array. But for MID$ and RIGHT$, the sub-string result needs to be moved to the beginning of the array.

For reference strings, there is no allocation, copying or deletion required for the sub-string operation. For temporary strings, there is still a moving of characters, but there is no allocation of a new character array or deletion of the old character array. After the expression is evaluated, the temporary string will be deleted. There could be some unused space in the character array from a sub-string operation when the temporary string is assigned and used as it. Enough of discussing how sub-strings will work at run-time, next, what impact sub-strings have on the Translator...

Sub-Strings

Because strings are variable length, they are dynamically allocated as needed during run-time. It is therefore beneficial to reduce as much as possible the amount of allocating, copying and deleting of the character arrays. This is why reference strings, the values of string variables and array elements, are used as-is so a new character array does not need to be allocated and copied to in order to put the reference string on the evaluation stack. But this will require extra code to know when to delete a temporary string and not to delete a reference string on the stack, which will be accomplished with additional associated codes.

There is another way to reduce some allocation and deleting of temporary strings for the sub-string internal functions, aka LEFT$, MID$, and RIGHT$. These functions can have a reference string or a temporary string as an argument. The obvious way to implement these functions is to create a new temporary character array of the appropriate size for the resulting string, copying the characters from the argument string to the new array and if the argument string is a temporary string, to delete it.

There is a simpler way to handle sub-strings that will eliminate the allocation of a new character array, deleting the temporary string argument if present and the copying for a reference string. Consider how a string is stored, there is a character array, there is a pointer to the character array and there is the length of the character array. The pointer and length make up the members of the String class along with the allocated array. A resulting sub-string will never be larger than the string argument. so why not use the same character array, since it has already been allocated. Next, details on how this will work with reference strings and temporary strings...

Saturday, May 15, 2010

Translator – Temporary Strings (Release)

Before testing the changes, the test code was modified to output the operands that were saved by the Translator within square brackets separated by commas. This change was made to see if operands were being saved correctly for the correct tokens (array, user functions, operators with string operands and internal functions with string operands). Only the primary operand token is output, so it the operand is an operator, that is what is output.

The code appears to be working, at least for all the current test inputs with one exception (no new test inputs were added at this time). If a conversion operator was inserted for an operand, the operand is not pointing to the conversion operator. For example, for the statement Z$=MID$(A$,B+C,D), the MID$ token is output as MID3$([A$,+,C], but should be output as MID3$([A$,CvtInt,CvtInt] since conversion operators were inserted for the + and D operands.

Other than this problem the code appears to be working. Since there is more work is needed to complete string data type handling in the Translator and not much was changed (temporary strings, which had a minor affect on the Translator; and operands are saved for later processing as needed for the Encoder), the code is being released only as a developmental release and ibcp_0.1.11-dev-1-src.zip has been uploaded at Sourceforge IBCP Project along with the binary for the program.

The rest of the string data type handling has to do with the sub-string functions (aka LEFT$, MID$ and RIGHT$), which will handle strings slightly differently during run-time...

Translator – Temporary Strings (Implementation)

Upon making the changes to add temporary strings, I realized that there are more string operators than just the CatStr operator – there are the equality, relational, assign, and assign list operators, all of which can have string operands (either reference or temporary). For the equality and relational operators, the character arrays of any temporary strings need to be deleted after the comparison is made.

For the assign operator, if the operand is a reference string, then a new character array needs to be allocated and the reference string copied. However, for a temporary string, the string variable's character array can be set to the temporary character array (making it a reference string) and the current character array is deleted, eliminating the need to allocate and copy. For assigning a list, the temporary string can only be used for one of the string variables in the list, the rest need new arrays allocated.

There is already a String flag used for immediate commands, and since it's value doesn't conflict with any of the existing flags (immediate command flags are separate), it can also be used for operators and internal function codes that have string operands. A note was added to the code to make sure it's value is not used for another flag value.

A constructor was created for the new RpnItem structure that takes a token pointer, number of operands and a pointer to an operand array as arguments (the later two default to 0 and NULL). If the number of operands and operand array pointer is supplied, an operand array will be allocated for a non-zero number of operands and the operand pointers will be copied from the supplied array. A destructor was also created to delete the token and the operand array.

Translator – Temporary Strings

The impact of temporary strings on the Translator is not much since the work of finding associated codes will be left for the Encoder. However, there are a few changes that are required – the main one is passing the operands to the Encoder so that it has easy access to them.

A new TmpStr data type will be added. The data type of the string operator (CatStr) and the internal functions that return a string will be changed to TmpStr. This includes the internal functions CHR$, REPEAT$, SPACE$ and STR$. Something else is needed for the functions LEFT$, MID$ and RIGHT$, which will be revealed shortly. Any string DefFunP and DefFunN token types will also be changed to TmpStr since defined (one-line) string functions will return a temporary string during run-time.

The conversion code table used by the match code function needs to be expanded to include the new TmpStr data type. As far as the Translator is concerned, the String and TmpStr data types are the same, therefore the conversion code table entries for String to TmpStr and TmpStr to String will be set to the Null code.

For the string operator and internal string functions that have string operands, the operands need to be attached to the operator/function token within the output list. To accomplish this, the output list will be changed from a token pointer to a pointer to a new RpnItem structure that will contain the token pointer, the number of operands (0 if not applicable) and a pointer to an array of output list item (RpnItem) pointers.

There will be a new String flag added for these codes so that the Translator can easily identify them. When this flag is set, the number of operands will be set and the output list pointer array will be allocated. This array will be filled from the operands popped from the done stack. The done stack will also be changed to a stack of RpnItem structure pointers.

This saving of operands functionality is also needed for parentheses tokens, which can be arrays or user functions. The processing of array subscripts and user function arguments also needs to be delayed until the Encoder since the Translator doesn't know the difference between an array (whose subscripts all need to be integers) or a user function (whose argument types will be contained in the dictionary and defined in a function definition statement).

Friday, May 14, 2010

Temporary Strings – Design

During run-time the string operator and internal string functions need to delete the character array of a temporary string operand, but not of a reference string. Figuring out the design for temporary strings has proven difficult. The ultimate goal is fast execution at run-time. Two solutions were considered.

The first solution (easy to implement), would be to add a temporary flag to the String class. After an operation was finished with an operand, it would be deleted. If the temporary flag is set, then the character array would be deleted, otherwise, no action is taken. This was not desirable because the run-time would need to check for the temporary string flag, adding to execution time.

The other solution (more involved to implement) would not require checks for temporary strings during run-time. This solution is desirable, and will require separate codes for whether the operands are reference strings or a temporary strings.

For instance, the CatStr code would handle the case where both operands are reference strings. There would be three associated codes to the CatStr main code, one for when the first operand is a temporary string and second operand is a reference string (CatStrT1), one for when the first operand is a reference string and the second is a temporary (CatStrT2), and one for when both operands are temporary strings (CatStrTT). These associated codes would delete the temporary operand(s) after the concatenation.

Unfortunately, the presence of temporary strings is not necessary known during translation. The reason being that identifiers could be variables, array elements, or user functions. Variables and array elements produce reference strings. However, user functions like internal functions produce temporary strings except within a user function where the function name will be used as a variable holding the value that will be returned by the function.

Remember that the Translator does not determine what the identifier is – this job is left to the Encoder, which has access to the Dictionary. The Dictionary will contain all the identifiers and what they are. Next, what changes are required in the Translator for temporary strings...

Wednesday, May 12, 2010

Reference Vs. Temporary Strings

During run-time as expressions are executed, the result from each operator or function is pushed back onto the evaluation stack. This result is considered a temporary value, which will be used by the next operation (operator, function or command). The evaluation stack holds these temporary values. However, the actual character arrays for strings, including temporary string values, will not be contained in the stack elements, only a pointer (and length) of the character array. Consider this string expression and it's translation:

A$+B$+C$ A$ B$ +$ C$ +$

As this expression is executed, the value of A$ is pushed onto the stack. For double and integer variables, values are simply copied to the stack. However, for strings, only it's character array pointer and length will be copied to the top item on the stack. This prevents having to allocate a new array and to copy the actual string value of A$ to the array. This item on the stack is a reference string. Similarly, the value B$ is pushed onto of the stack.

At the first +$ operator, a string needs to be created to hold the concatenation of A$ and B$. First the A$ and B$ operands are popped from the stack. The size of the result character array is allocated for the total length of A$ and B$. The contents of A$ and B$ are copied to this character array. The resulting string's length and character array pointer are then pushed on to the stack. This item on the stack is a temporary string.

The value of C$ is pushed on to the stack. At the second +$ operator, another temporary string needs to be created to hold the concatenation of A$+B$ (now a temporary string on the stack) and C$ (a reference string). Both operands are popped off of the stack and a temporary string is created as before. However, for this operator, after the result has been pushed on to the stack, the character array of the temporary string holding A$+B$ needs to be deleted. Next, what implications this has on the Translator...

Monday, May 10, 2010

String Data Type

It now time in the Translator development that consideration needs to be given to how the program will be executed at run-time, which will determine what the Translator needs to do - and this applies to strings. Keeping in mind that the goal is to do as much work and decision making during translation so that it does not have to be done at run-time, which will aid in fast execution.

During execution, operands are pushed onto the evaluation stack, popped off by operators and functions, evaluated and the result pushed back onto the stack. Each evaluation stack element can hold a double or an integer. Strings are a little more involved. Strings are variable length and can be quite large, therefore a stack element cannot hold a string (or the stack would be very large).

C++, the underlying language, does not have a string type. Strings are handled as character arrays. These arrays need to be allocated for the appropriate size as needed and deallocated when no longer needed. A stack element will be a union of the different types that can be pushed onto the stack (double, integer, reference, etc.). For a string, the item in the union will be the previously implemented String class, which contains the length and pointer to the string's character array. Next, the difference between a reference strings and temporary strings...

Sunday, May 9, 2010

Translator – Data Types and Assignments (Release)

Testing started by using the previous six translator test input sets. First discovered that a check was needed for a NULL expression information structure pointer in the data type and unary code access functions, specifically for the Null and EOL codes. Also discovered that any unary operators needed their own expression information structures since all that standard ones set the unary code to the Null code. Many other minor bugs discovered with these test inputs were corrected.

There were no differences with the first four test input sets, since these contain only expressions with no assignment operators. The fifth and sixth test input sets had differences because either the correct data type assignment operator is now being added or a hidden conversion operator is now being inserted. Even though care was taken to make sure all the assignments in the test inputs were valid, one statement was trying to assign an integer to a string, so this expression was corrected.

A seventh test input set was added. One of the errors tested pointed to a token that may not be appropriate. The statement “A=B$+5” produces a data type error as expected, but the error is “expected string” pointing to the 5. This is not wrong, the plus is processed first and with the first operand a string, the Translator expects a string for the second operand, hence the error. But this may be confusing to the user, the error should probably be “expected double” pointing to the B$. This will be left as is for now to be revisited later, since there is no easy fix for this.

The code now handles data types for assignment operators and ibcp_0.1.10-src.zip has been uploaded at Sourceforge IBCP Project along with the binary for the program. Next the string handling needs to be expanded...

Translator – Data Types and Assignments (Implementation)

The only issue that occurred as support for data types were being added to assignment operators was developing code to return the proper error and the proper token for assignment lists. The issue was that the error should point to the first token in the line that has in error, which is complicated by the fact the the items in the list are processed in reverse order as that is how the tokens are pulled from the done stack.

This section proved to be difficult to implement, so the gory details won't be included here. An added complication was that a not a reference error should be reported on a token that is both not a reference and is not the correct data type. Basically the code needs to keep track of the last operand processed. A reference error is reported on the current operand being processed, but a data type error is reported on the last operand processed, but only if it is a reference (otherwise a reference error has already been set for the token).

Now that these changes have been implemented and compile successfully, debugging and testing can begin...

Saturday, May 8, 2010

Translator – Expression Information (Revisited)

Sometimes programing is an evolutionary process. The final solution is not arrived at initially, so an intermediate solution is made. Once made, another better solution is made. And so on until arriving at a final solution. And this seems to be the case with the expression information, which has taken a while to arrive at a good solution. The changes described in the last post were completed, but they didn't look very good (very cluttered) – maybe there was better way, perhaps using new operator as originally planned.

It was also observed that a lot of the operators had the same operands, for example one double, two doubles, two integers, etc. There was no reason to have an operand array with two double data types for every operator that requires this – in other words, the same array could be used for all the operators (or internal functions) that take two double operands. There only needs to be one data type operand array for each unique operand set – so there will be one DblDbl_OperandArray that will be used for all operators or internal functions that take two double operands, an so on for the other possible operand sets.

All the possible unique operand arrays will be defined using the convention Dbl, Int, and Str for each operand data type (e.g. StrInt for two operands, a string and an integer operand). All the associated codes arrays will defined next, where each code has it's own array. There will be two macros (defines) that will produce the arguments for the expression information constructor in the table entries, which will look like this:

new ExprInfo(Double_DataType, Null_Code, Operands(DblDbl), AssocCodes(Add))

In this example, the Operands macro will produce two arguments, a 2 for the size of the operand array and a pointer to DblDbl_OperandArray. The size of the array will be calculated within the macro. The AssocCodes macro will be similarly defined. The constructor will have default argument values so if there are no associated codes, the AssocCodes() argument would not be included and the number of associate codes would be set to zero and the array pointer set to a NULL. Similarly if there are no operands.

As the table entries were being updated, it was observed that many of the entries had identical expression information constructor calls – for example, the math functions generally return a double, have no unary code, have one double operand and no associated codes. Again there is no reason to have a separate expression information structure for each one. So there will be comman expression information structures defined for these in the form of Dbl_Dbl_ExprInfo (returns double and has one double operand) and Int_IntInt_ExprInfo (returns double and has two integer operands), an so on. The address of these will be put into the table entries.

The table entries for the main codes are unique because each has it's own associated codes array, so the new ExprInfo syntax listed above will be used for these. This may still not be the best solution, but it's good enough for now, and I need to stop fiddling with this and get data type handling for assignments implemented...