Sunday, May 16, 2010

Sub-Strings – Details

The results of sub-string functions (LEFT$, MID$ and RIGHT$) will use the same character array as the argument string. Exactly how this is accomplished depends on whether the argument string is a reference string or a temporary string.

For a reference string, the pointer and length of the character array are copied to the evaluation stack. This means that the pointer and length on the stack can be modified without affecting the actual pointer and length of the variable or array element. For LEFT$, only the length on the stack needs to be changes. For MID$ and RIGHT$, both the pointer and length need to changed depending on the integer arguments.

For a temporary string, the sub-string operation is a little more involved because the pointer to the character array on the stack cannot be modified – it is needed to delete the character array when the temporary string is no longer needed. The length is not needed to delete the character array, so it can be modified. Since the pointer cannot be modified, the portion of the sub-string needs to be moved to the beginning of the character array. Again for LEFT$, only the length needs to be changed, the sub-string is already at the beginning of the array. But for MID$ and RIGHT$, the sub-string result needs to be moved to the beginning of the array.

For reference strings, there is no allocation, copying or deletion required for the sub-string operation. For temporary strings, there is still a moving of characters, but there is no allocation of a new character array or deletion of the old character array. After the expression is evaluated, the temporary string will be deleted. There could be some unused space in the character array from a sub-string operation when the temporary string is assigned and used as it. Enough of discussing how sub-strings will work at run-time, next, what impact sub-strings have on the Translator...

Sub-Strings

Because strings are variable length, they are dynamically allocated as needed during run-time. It is therefore beneficial to reduce as much as possible the amount of allocating, copying and deleting of the character arrays. This is why reference strings, the values of string variables and array elements, are used as-is so a new character array does not need to be allocated and copied to in order to put the reference string on the evaluation stack. But this will require extra code to know when to delete a temporary string and not to delete a reference string on the stack, which will be accomplished with additional associated codes.

There is another way to reduce some allocation and deleting of temporary strings for the sub-string internal functions, aka LEFT$, MID$, and RIGHT$. These functions can have a reference string or a temporary string as an argument. The obvious way to implement these functions is to create a new temporary character array of the appropriate size for the resulting string, copying the characters from the argument string to the new array and if the argument string is a temporary string, to delete it.

There is a simpler way to handle sub-strings that will eliminate the allocation of a new character array, deleting the temporary string argument if present and the copying for a reference string. Consider how a string is stored, there is a character array, there is a pointer to the character array and there is the length of the character array. The pointer and length make up the members of the String class along with the allocated array. A resulting sub-string will never be larger than the string argument. so why not use the same character array, since it has already been allocated. Next, details on how this will work with reference strings and temporary strings...